I think the answer depends on how many documents you have. To think in terms of ...

throwup238 · on Dec 25, 2023

> I talked to a firm with 70GB worth of data. No way a RAG pipeline would give them results. They are struggling to get LLMs to work for them.

Wow so RAG is basically a toy for demos and low effort MVPs. 70GB is tiny, it’d barely qualify as “big data” 20 years ago.

Is anyone trying more advanced stuff like knowledge graph augmented generation to try to expand on that?

lmeyerov · on Dec 25, 2023

KG in the intuitive 90's sense is fundamentally worse, but with llm-era rethinking, potentially useful

- kg DBs have the same retrieval scale problem, and often built on top of the same DBs

- traditional kg mining would still use LLMs, except instead of storing rich text ("in a clearly friendly and playful manner, the dog chased the cat") or less-rich high-dimensional chunk embeddings of the same, they get discretized down to much lossier rdf ontologies like (dog,chased,cat). There are uses, like building an entity index, but I wouldn't use for core RAG operations like answering questions accurately

- kg can be useful in the sense of, after chunking & embedding, adding *additional* linkages, such as adding summarization hierarchies and then linking related summaries back to source citation chunks. So runtimes can look up not just similar chunks, but also those that are a logical hop away, even if the embeddings are dissimilar, and via pre-summarization, incorporate a lot more insight. Though that's not a traditional kg, it highlights needing non-vector linkage tracking.

We are working on real-time large-scale projects like teaching LLMs to understand the news as it breaks as part of louie.ai, and if folks have projects like that, happy to chat as we figure out our q1+q2 cohorts. It's a fascinating time -- we had to basically scrap our pre-2023 stack here because of the significant advances, and been amazing being able to tackle much harder problems.

ankit219 · on Dec 25, 2023

Some caution here. Not everything needs to go into a RAG pipeline (Eg: a database table would not necessarily need to be embedded, but its schema should be.). There would be a lot of repetitions, lots of junk and useless data, and numerical data and parsing through that would be a pain. Then comes how the users would behave. You need a longer string to get accurate results. Most non tech users would rather write shorter strings and expect technology to read their mind. (it's a human issue and not tech issue)

A simpler way here is just train the model unsupervised so all the knowledge is there in the model, and instruction tune it on the use-cases you want. Simpler from human effort perspective. Somewhat costly though the cost of storing that many vectors would be more than training the model itself. Everything else requires a lot of custom effort. Knowledge graph augmentation is probably the next step in the hype cycle, but it does not solve the fundamental human problem of writing fewer letters. (Training solves as changing 1-2 keywords do the trick if the generic string does not get the answer. See how Chatgpt changes answers if you tweak your prompt a bit). In a way RAG is an engineering solution to what is basically a data problem. It works for many cases, but when it does not, people will have to solve it via data science.

> Wow so RAG is basically a toy for demos and low effort MVPs

I would not say it's for demos or low effort MVPs. Many companies wont have that amount of data. You can also segregate it by teams. Eg: customer support has one, sales has one, product has one. Then, a golden use case is for parsing user docs. We created one for GST queries in India that works quite well.[1]. It's a search engine, but points to right docs at the source when you ask about any clause. Useful for CAs only and addresses a very narrow use case.(it's a market need as the notifications are published in PDF format and not indexed by Google)

[1]:https://clioapp.ai/gst-search

throwup238 · on Dec 25, 2023

"Toy" is the wrong word to describe it but it seems like another order of magnitude or two increase in context size will solve all their problems.

On the other hand I've got a terabyte of text extracted from LibGen - let's say I can ignore the half that is fiction and I can dedupe the rest further by 80% - that's still 100gb. On top of that I've got 300gb of text extracted from court documents and that's just from California! I haven't even downloaded the federal dump yet, let alone the other 49 states. Even if I limited myself to just the US Code and Federal Code of Regulations, that's hundreds of millions of tokens of very dense text. Embedding based RAG has been pretty much useless in each of these cases but maybe I just suck at implementing the retrieval part.

What's the data size on the GST search engine? Do you have any examples of more complex queries?

The only thing that has even been remotely useful to tackling the kind of questions I want to ask of my data sources is having the LLM generate search queries, navigate the knowledge graph, and rank the relevance of retrieved snippets but that often takes dozens if not hundreds of LLM calls which is incredibly slow and expensive.

MilStdJunkie · on Dec 26, 2023

Forgive a relative layman chiming in, but isn't legal corpus already pre-chunked in various forms, like section/para/etc., i.e. 18 U.S. Code § 371? It seems that you could slice up the data, RAG from the slices, then connect something like Mixtral's so-called "mixture of experts” (MoE, i.e. 8x7b) for combinations.

Word of warning: we've done the "slicing" thing with aerospace data LLMing - we had a similar problem to yours, so we just made RAGs for each functional system (Fuel, Engine, Model for maintenance logs per 14 CFR 43, 91, etc) based on some simple filename filters - but not the MoE thing. Sigh.

Stoppage was not due to failure but due to . . let's say, lack of interest. No one wants to solve the problem in-house, but at the same time, no one's allowed to use any cloud-based LLM solutions off the shelf. Far easier to sit on one's hands, and wait for the program to yell at you.

fennecbutt · on Dec 26, 2023

As a human would you read 100GB of data all at once?

Or would you read it bit by bit, taking notes and summarising as you went along. Then compiling your notes/summaries into a final report?

Because I don't see why we expect these models to be so superhuman when a 100K context would already be considered superhuman memory.

Imagine me regurgitating 100k tokens worth of dialogue at you and expecting you to take into account every thing I said. I know I couldn't do it, ha ha.

throwup238 · on Dec 26, 2023

As a human would you do tens of billions of multiplies and additions per second? Store tens of thousands of books on something the size of a finger nail and recall them with perfect fidelity every time? Communicate with another human via optical signals using thousand mile long optical fiber across the entire Pacific ocean? Eat electricity instead of food? Project images from your eyes? Can you stick an audio cable in your butt to power speakers?

I'm talking about computers, not humans.

fennecbutt · on Dec 27, 2023

I mean I wouldn't put it past myself to try the audio cable in the butt trick, bet it'd feel great

rolisz · on Dec 25, 2023

What are you trying to achieve with that dataset from LibGen? I kinda expect that GPT4 was trained on the data that is available on LibGen

throwup238 · on Dec 26, 2023

I need it to be able to cite answers, explore surrounding context, and while it might have been trained on Libgen, it doesn't mean it "internalized" all the data, let alone enough to be useful.

cjonas · on Dec 25, 2023

Most use cases that actually require this much data are probably best solved by more traditional ML architectures (ie classification).

LLM work best on use cases where the working context is the length of a short research paper (or less). Building with LLM is mostly a exercise in application engineering on how to get them the most relevant context at the right time and how to narrow its scope to produce reliable outputs.

Fine tuning can help specialize the LLM model to perform better, but AFAIK, the training sets are relatively small (in big data terms)

fennecbutt · on Dec 26, 2023

Right tool for the right job.

With a large amount of data, a large amount of data can be "relevant" with a loose query.

I think in those situations it's fine, using a model with an extra large context and similarity etc filters quite tight.

Developing it to realise when there are too many results and to prompt the user to clarify or be more specific would help.

Companies that want to trawl data like this can just deal with it and pay hardware that can run model for >100k context.

If >all< of the 70gb of data is meant to be relevant ie "summarise all financial activity over 5 years into one report" then well...it has to be developed to do what a human would. 100k context already far exceeds what human brain is capable of "keeping in your head" imo; just need multiple steps to summarise, take notes and compress the overall data down smaller and smaller with each query until it's manageable with a single 100k query.

It's totally doable, but not "out of the box".

hrdwdmrbl · on Dec 25, 2023

I have a _small_ e-commerce company and we have >300GB. Most of that bulk is photos and videos though, but in an ideal world I’d like my AI assistant to find that stuff too: “I’m making a Boxing Day ad campaign. Can you show me the ads that we’ve made in previous years and all of the photos that we’ve taken of our new Reindeer and Elf designs?”

rolisz · on Dec 25, 2023

Photos and videos are very different from text. 300 GB of text is not comparable to 300 GB of photos.

You can do something using image embeddings to get what you want.

ankit219 · on Dec 25, 2023

That can be done if we use Imagebind from meta(embeds text, image, video, audio in same vector space). I would want to explore this if possible just for a POC if you are okay with it. Would you be interested?