OLMo uses open datasets, such as CommonCrawl and StackOverflow, for training, ab...

mdp2021 · 2025-03-18T10:22:41 1742293361

A possibly better question could be how well it would perform if it was trained on selected material - see the efforts of Mortimer Adler in the USA, or the efforts of any good publishing house in the definition of editorial collections.

But I remain skeptical that without "critical thinking as a condition to write into "conscious" memory" the barrier of "conformism" will ever be broken.

rajman187 · 2025-03-18T04:38:57 1742272737

Not a lawyer but would assume downloading material from libgen is, in the vast majority of cases, illegal because it's a breach of copyright or similar. That’s gotten Meta in quite a spectacle of late [1]

[1] https://www.loeb.com/en/insights/publications/2023/12/richar...

maxloh · 2025-03-18T05:46:51 1742276811

CommonCrawl is composed of copyrighted contents too. You gain copyright on your work automatically the moment you created it, including this very comment.

AmazingTurtle · 2025-03-18T07:23:10 1742282590

What if I repost your comment without your permission?

maxloh · 2025-03-18T08:52:08 1742287928

One could argue that using copyrighted content in LLMs, much like reposting, should fall under fair use. This is also Microsoft's claim in the GitHub Copilot lawsuits. It's up to the court to decide though. (IANAL)

fulafel · 2025-03-18T04:47:22 1742273242

In many jurisdictions it's just sharing that is illegal, not obtaining.

akx · 2025-03-18T07:22:50 1742282570

Yes. The interesting legal question is that are you sharing the original knowledge if you've transformed it via teaching it to an AI.

https://www.reuters.com/legal/litigation/ai-companies-lose-b... reports on the ongoing case on the image generation side of the fence.

maxloh · 2025-03-18T08:58:25 1742288305

That is called copyright laundering FYI.

anon373839 · 2025-03-18T12:27:15 1742300835

It’s a catchy term, but loaded. Copyright protects only original expression, not ideas and information. So if a computer algorithm reads the former and outputs the latter, arguably copyright isn’t involved at all.

There are plenty of good counterarguments to this as well, when you consider the effects of automation and scale. I’m definitely interested in seeing how the jurisprudence develops as these cases go through the courts.