OLMo uses open datasets, such as CommonCrawl and StackOverflow, for training, about 5TB worth of text. I wonder how well it would perform if it was also trained on Annas Archive/LibGen (>600TB).
A possibly better question could be how well it would perform if it was trained on selected material - see the efforts of Mortimer Adler in the USA, or the efforts of any good publishing house in the definition of editorial collections.
But I remain skeptical that without "critical thinking as a condition to write into "conscious" memory" the barrier of "conformism" will ever be broken.
Not a lawyer but would assume downloading material from libgen is, in the vast majority of cases, illegal because it's a breach of copyright or similar. That’s gotten Meta in quite a spectacle of late [1]
CommonCrawl is composed of copyrighted contents too. You gain copyright on your work automatically the moment you created it, including this very comment.
One could argue that using copyrighted content in LLMs, much like reposting, should fall under fair use. This is also Microsoft's claim in the GitHub Copilot lawsuits. It's up to the court to decide though. (IANAL)
It’s a catchy term, but loaded. Copyright protects only original expression, not ideas and information. So if a computer algorithm reads the former and outputs the latter, arguably copyright isn’t involved at all.
There are plenty of good counterarguments to this as well, when you consider the effects of automation and scale. I’m definitely interested in seeing how the jurisprudence develops as these cases go through the courts.