I think the answer depends on how many documents you have. To think in terms of tokens (assuming 750-1000 tokens is a page), if you have a good estimate of number of pages you want to query on, you can decide on the approach. Three popular approaches:
1. RAG: Most popular and works really well on smaller datasets. It is limited by number of vectors/embeddings. A typical embedding could be of 1000 tokens in size. Llamaindex did a lot of engineering on this and their techniques work pretty well. The problem with large datasets is almost always that users don't like writing long prompts/queries so the answers are more generic.
2. Finetuning + RAG: You can finetune a model on the expected outputs. If your datasets have the knowledge which might already be on open internet (blogposts, articles, anything non proprietary), then finetuning would work really well in combination with RAG, especially for large datasets. It may not work if you are working on a proprietary knowledge hard to find on open internet.
3. Continual pretraining - large large datasets, and when the knowledge is proprietary. I talked to a firm with 70GB worth of data. No way a RAG pipeline would give them results. They are struggling to get LLMs to work for them. Needs a model that is trained on their data and then Instruction Tuning of top of that. Most likely you wont need to do this
KG in the intuitive 90's sense is fundamentally worse, but with llm-era rethinking, potentially useful
- kg DBs have the same retrieval scale problem, and often built on top of the same DBs
- traditional kg mining would still use LLMs, except instead of storing rich text ("in a clearly friendly and playful manner, the dog chased the cat") or less-rich high-dimensional chunk embeddings of the same, they get discretized down to much lossier rdf ontologies like (dog,chased,cat). There are uses, like building an entity index, but I wouldn't use for core RAG operations like answering questions accurately
- kg can be useful in the sense of, after chunking & embedding, adding *additional* linkages, such as adding summarization hierarchies and then linking related summaries back to source citation chunks. So runtimes can look up not just similar chunks, but also those that are a logical hop away, even if the embeddings are dissimilar, and via pre-summarization, incorporate a lot more insight. Though that's not a traditional kg, it highlights needing non-vector linkage tracking.
We are working on real-time large-scale projects like teaching LLMs to understand the news as it breaks as part of louie.ai, and if folks have projects like that, happy to chat as we figure out our q1+q2 cohorts. It's a fascinating time -- we had to basically scrap our pre-2023 stack here because of the significant advances, and been amazing being able to tackle much harder problems.
Some caution here. Not everything needs to go into a RAG pipeline (Eg: a database table would not necessarily need to be embedded, but its schema should be.). There would be a lot of repetitions, lots of junk and useless data, and numerical data and parsing through that would be a pain. Then comes how the users would behave. You need a longer string to get accurate results. Most non tech users would rather write shorter strings and expect technology to read their mind. (it's a human issue and not tech issue)
A simpler way here is just train the model unsupervised so all the knowledge is there in the model, and instruction tune it on the use-cases you want. Simpler from human effort perspective. Somewhat costly though the cost of storing that many vectors would be more than training the model itself. Everything else requires a lot of custom effort. Knowledge graph augmentation is probably the next step in the hype cycle, but it does not solve the fundamental human problem of writing fewer letters. (Training solves as changing 1-2 keywords do the trick if the generic string does not get the answer. See how Chatgpt changes answers if you tweak your prompt a bit). In a way RAG is an engineering solution to what is basically a data problem. It works for many cases, but when it does not, people will have to solve it via data science.
> Wow so RAG is basically a toy for demos and low effort MVPs
I would not say it's for demos or low effort MVPs. Many companies wont have that amount of data. You can also segregate it by teams. Eg: customer support has one, sales has one, product has one. Then, a golden use case is for parsing user docs. We created one for GST queries in India that works quite well.[1]. It's a search engine, but points to right docs at the source when you ask about any clause. Useful for CAs only and addresses a very narrow use case.(it's a market need as the notifications are published in PDF format and not indexed by Google)
"Toy" is the wrong word to describe it but it seems like another order of magnitude or two increase in context size will solve all their problems.
On the other hand I've got a terabyte of text extracted from LibGen - let's say I can ignore the half that is fiction and I can dedupe the rest further by 80% - that's still 100gb. On top of that I've got 300gb of text extracted from court documents and that's just from California! I haven't even downloaded the federal dump yet, let alone the other 49 states. Even if I limited myself to just the US Code and Federal Code of Regulations, that's hundreds of millions of tokens of very dense text. Embedding based RAG has been pretty much useless in each of these cases but maybe I just suck at implementing the retrieval part.
What's the data size on the GST search engine? Do you have any examples of more complex queries?
The only thing that has even been remotely useful to tackling the kind of questions I want to ask of my data sources is having the LLM generate search queries, navigate the knowledge graph, and rank the relevance of retrieved snippets but that often takes dozens if not hundreds of LLM calls which is incredibly slow and expensive.
Forgive a relative layman chiming in, but isn't legal corpus already pre-chunked in various forms, like section/para/etc., i.e. 18 U.S. Code § 371? It seems that you could slice up the data, RAG from the slices, then connect something like Mixtral's so-called "mixture of experts” (MoE, i.e. 8x7b) for combinations.
Word of warning: we've done the "slicing" thing with aerospace data LLMing - we had a similar problem to yours, so we just made RAGs for each functional system (Fuel, Engine, Model for maintenance logs per 14 CFR 43, 91, etc) based on some simple filename filters - but not the MoE thing. Sigh.
Stoppage was not due to failure but due to . . let's say, lack of interest. No one wants to solve the problem in-house, but at the same time, no one's allowed to use any cloud-based LLM solutions off the shelf. Far easier to sit on one's hands, and wait for the program to yell at you.
As a human would you read 100GB of data all at once?
Or would you read it bit by bit, taking notes and summarising as you went along. Then compiling your notes/summaries into a final report?
Because I don't see why we expect these models to be so superhuman when a 100K context would already be considered superhuman memory.
Imagine me regurgitating 100k tokens worth of dialogue at you and expecting you to take into account every thing I said. I know I couldn't do it, ha ha.
As a human would you do tens of billions of multiplies and additions per second? Store tens of thousands of books on something the size of a finger nail and recall them with perfect fidelity every time? Communicate with another human via optical signals using thousand mile long optical fiber across the entire Pacific ocean? Eat electricity instead of food? Project images from your eyes? Can you stick an audio cable in your butt to power speakers?
I need it to be able to cite answers, explore surrounding context, and while it might have been trained on Libgen, it doesn't mean it "internalized" all the data, let alone enough to be useful.
Most use cases that actually require this much data are probably best solved by more traditional ML architectures (ie classification).
LLM work best on use cases where the working context is the length of a short research paper (or less). Building with LLM is mostly a exercise in application engineering on how to get them the most relevant context at the right time and how to narrow its scope to produce reliable outputs.
Fine tuning can help specialize the LLM model to perform better, but AFAIK, the training sets are relatively small (in big data terms)
With a large amount of data, a large amount of data can be "relevant" with a loose query.
I think in those situations it's fine, using a model with an extra large context and similarity etc filters quite tight.
Developing it to realise when there are too many results and to prompt the user to clarify or be more specific would help.
Companies that want to trawl data like this can just deal with it and pay hardware that can run model for >100k context.
If >all< of the 70gb of data is meant to be relevant ie "summarise all financial activity over 5 years into one report" then well...it has to be developed to do what a human would. 100k context already far exceeds what human brain is capable of "keeping in your head" imo; just need multiple steps to summarise, take notes and compress the overall data down smaller and smaller with each query until it's manageable with a single 100k query.
I have a _small_ e-commerce company and we have >300GB. Most of that bulk is photos and videos though, but in an ideal world I’d like my AI assistant to find that stuff too: “I’m making a Boxing Day ad campaign. Can you show me the ads that we’ve made in previous years and all of the photos that we’ve taken of our new Reindeer and Elf designs?”
That can be done if we use Imagebind from meta(embeds text, image, video, audio in same vector space). I would want to explore this if possible just for a POC if you are okay with it. Would you be interested?
1. RAG: Most popular and works really well on smaller datasets. It is limited by number of vectors/embeddings. A typical embedding could be of 1000 tokens in size. Llamaindex did a lot of engineering on this and their techniques work pretty well. The problem with large datasets is almost always that users don't like writing long prompts/queries so the answers are more generic.
2. Finetuning + RAG: You can finetune a model on the expected outputs. If your datasets have the knowledge which might already be on open internet (blogposts, articles, anything non proprietary), then finetuning would work really well in combination with RAG, especially for large datasets. It may not work if you are working on a proprietary knowledge hard to find on open internet.
3. Continual pretraining - large large datasets, and when the knowledge is proprietary. I talked to a firm with 70GB worth of data. No way a RAG pipeline would give them results. They are struggling to get LLMs to work for them. Needs a model that is trained on their data and then Instruction Tuning of top of that. Most likely you wont need to do this