LLM's become fluent in constructing coherent, sophisticated text in natural language from training on obscene amounts of coherent, sophisticated text in natural language. Importantly, there is no such corpus of text that contains only accurate knowledge, let alone knowledge as it unambiguously applies to some specific domain.
It's unclear that any such corpus could exist (a millennias old discussion in philosophy with no possible resolution), but even if you take for granted that such a corpus could, we don't have one.
So what happens is that after learning how to construct coherent, sophisticated text in natural language from all the bullshit-adled general text that includes truth and fiction and lies and fantasy and bullshit and garbage and old text and new text, there is a subsequent effort to sort of tune the model in on some generating useful text towards some purpose. And here, again, it's important to distinguish that this subsequent training is about utility ("you're a helpful chatbot", "this will trigger a function call that will supplement results", etc) and so still can't focus strictly on knowledge.
LLM's can produce intelligent output that may be correct and may be verifiable, but the way they work and the way they need to be trained prevents them from ever actually representing knowledge itself. The best they can do is create text that is more or less fluent and more or less useful.
It's awesome and has lots and lots of potential, but it's a radically different thing than a material individual that's composed of countless disparate linguistic and non-linguistic systems that have never yet been technologically replicated or modeled.
Wrong. This is the common groupspeak on uninformed places like HN, but it is not what the current research says. See e.g. this: https://arxiv.org/abs/2210.13382
Most of what you wrote shows that you have zero education in modern deep learning, so I really wonder what makes you form such strong opinions on something you know nothing about.
The person you are replying to, said it clearly: "there is no such corpus of text that contains only accurate knowledge"
Deep learning, learns a model of the world, and this model can be as inaccurate as it goes. Earth may as well have 10 moons for a DL model. In order for Earth to have only 1 moon, there has to be a dataset which encodes only that information, and not even once more moons. A drunk person who stares at the moon, sees more than one moon and writes about that on the internet, has to be excluded from the training data.
Also the model of the Othello world, is very different from a model of the real world. I don't know about Othello, but in chess it is pretty well known that all possible chess positions, are more than there are atoms in the universe. For all practical purposes, the dataset of all possible chess positions is infinite.
The dataset of every possible event on earth, every second is also more than all the atoms in the universe. For all practical purposes, it is infinite as well.
Do you know that one dataset is more infinite than the other? Does modern DL state that all infinities are the same?
Wrong again. When you apply statistical learning over a large enough dataset, the wrong answers simply become random normal noise (a consequence of the central limit theorem) - the kind of noise which deep learning has always excelled at filtering out, long before LLMs where a thing - and the truth becomes a constant offset. If you have thousands of pictures of dogs and cats and some were incorrectly labelled, you can still train a perfectly good classifier that will achieve more or less 100% accuracy (and even beat humans) on validation sets. It doesn't matter if a bunch of drunk labellers tainted the ground truth as long as the dataset is big enough. That was the state of DL 10 years ago. Today's models can do a lot more than that. You don't need infinite datasets, they just need to be large enough and cover your domain well.
> You don't need infinite datasets, they just need to be large enough and cover your domain well.
When you are talking about distinguishing noise from a signal, or truth from not-totally-truth, and the domain is sufficiently small, e.g a game like Othello or data from a corporation, then i agree with everything in your comment.
When the domain is huge, then distinguishing truth from lies/non-truth/not-totally-truth is impossible. There will not be such a high quality dataset, because everything changes over time, truth and lies are a moving target.
If we humans cannot distinguish between truth and non-truth, but the A.I. is able to, then we are talking about AGI. Then we put the machines to discover new laws of physics. I am all for it, i just don't see it happening anytime soon.
What you're talking about is by definition no longer facts but opinions. Even AGI won't be able to turn opinions into facts. But LLMs are already very good at giving opinions rather than facts thanks to alignment training.
LLM's become fluent in constructing coherent, sophisticated text in natural language from training on obscene amounts of coherent, sophisticated text in natural language. Importantly, there is no such corpus of text that contains only accurate knowledge, let alone knowledge as it unambiguously applies to some specific domain.
It's unclear that any such corpus could exist (a millennias old discussion in philosophy with no possible resolution), but even if you take for granted that such a corpus could, we don't have one.
So what happens is that after learning how to construct coherent, sophisticated text in natural language from all the bullshit-adled general text that includes truth and fiction and lies and fantasy and bullshit and garbage and old text and new text, there is a subsequent effort to sort of tune the model in on some generating useful text towards some purpose. And here, again, it's important to distinguish that this subsequent training is about utility ("you're a helpful chatbot", "this will trigger a function call that will supplement results", etc) and so still can't focus strictly on knowledge.
LLM's can produce intelligent output that may be correct and may be verifiable, but the way they work and the way they need to be trained prevents them from ever actually representing knowledge itself. The best they can do is create text that is more or less fluent and more or less useful.
It's awesome and has lots and lots of potential, but it's a radically different thing than a material individual that's composed of countless disparate linguistic and non-linguistic systems that have never yet been technologically replicated or modeled.