Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

OLMo uses open datasets, such as CommonCrawl and StackOverflow, for training, about 5TB worth of text. I wonder how well it would perform if it was also trained on Annas Archive/LibGen (>600TB).


A possibly better question could be how well it would perform if it was trained on selected material - see the efforts of Mortimer Adler in the USA, or the efforts of any good publishing house in the definition of editorial collections.

But I remain skeptical that without "critical thinking as a condition to write into "conscious" memory" the barrier of "conformism" will ever be broken.


Not a lawyer but would assume downloading material from libgen is, in the vast majority of cases, illegal because it's a breach of copyright or similar. That’s gotten Meta in quite a spectacle of late [1]

[1] https://www.loeb.com/en/insights/publications/2023/12/richar...


CommonCrawl is composed of copyrighted contents too. You gain copyright on your work automatically the moment you created it, including this very comment.


What if I repost your comment without your permission?


One could argue that using copyrighted content in LLMs, much like reposting, should fall under fair use. This is also Microsoft's claim in the GitHub Copilot lawsuits. It's up to the court to decide though. (IANAL)


In many jurisdictions it's just sharing that is illegal, not obtaining.


Yes. The interesting legal question is that are you sharing the original knowledge if you've transformed it via teaching it to an AI.

https://www.reuters.com/legal/litigation/ai-companies-lose-b... reports on the ongoing case on the image generation side of the fence.


That is called copyright laundering FYI.


It’s a catchy term, but loaded. Copyright protects only original expression, not ideas and information. So if a computer algorithm reads the former and outputs the latter, arguably copyright isn’t involved at all.

There are plenty of good counterarguments to this as well, when you consider the effects of automation and scale. I’m definitely interested in seeing how the jurisprudence develops as these cases go through the courts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: