I'm 99% confident there's currently multiple companies active in the "curated LLM dataset" space, where they go through heaps of data to organize them into curated datasets for just that purpose.
But it's a huge undertaking. Google had the objective of indexing all data in the world 20-odd years ago, and that's just putting it all on a big pile; curating it is an even bigger job that can only partially be automated. Compare it with social media moderation, which is a full-time job for tens- if not hundreds of thousands of people worldwide, and that's after the automated tools have had their first pass. And that's sort-of realtime, but there's 30+ years of that to go through if you want to curate a dataset (and more if you include pre-internet media)
But it's a huge undertaking. Google had the objective of indexing all data in the world 20-odd years ago, and that's just putting it all on a big pile; curating it is an even bigger job that can only partially be automated. Compare it with social media moderation, which is a full-time job for tens- if not hundreds of thousands of people worldwide, and that's after the automated tools have had their first pass. And that's sort-of realtime, but there's 30+ years of that to go through if you want to curate a dataset (and more if you include pre-internet media)