Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> This would mean there is an "official" source of all web data. LLM people can use snapshots of this

that already exists, its called CommonCrawl:

https://commoncrawl.org/



Common Crawl, while a massive dataset of the web does not represent the entirety of the web.

It’s smaller than Google’s index and Google does not represent the entirety of the web either.

For LLM training purposes this may or may not matter, since it does have a large amount of the web. It’s hard to prove scientifically whether the additional data would train a better model, because no one (afaik) not Google not common crawl not Facebook not Internet Archive have a copy that holds the entirety of the currently accessible web (let alone dead links). I’m often surprised using GoogleFu at how many pages I know exist even with famous authors that just don’t appear in googles index, common crawl or IA.


Is there any way to find patterns in what doesn't make it into Common Crawl, and perhaps help them become more comprehensive?

Hopefully it's not people intentionally allowing the Google crawler and intentionally excluding Common Crawl with robots.txt?


Cool! I will check it out




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: