You don't train on documents. There are many startups claiming that but they are deliberately using a misleading term because they know that's what people are searching for.
You still do RAG. Llamaindex is still the best option that I know of. Most of the startups that have working products are likely using llamaindex. All of the ones that say they are training on documents are actually using RAG.
Test it out. If it really and truly doesn't work, search for a script that creates question and answer pairs automatically with gpt-4. Then try using that for qLoRA. I have never heard of anyone successfully using that for a private document knowledgebase though. Only for skills like math, reasoning, Python, etc. I think the issue is that you need a LOT of data and it needs to repeat concepts or any facts you need to learn many, many times in different supporting ways.
What absolutely does not work is trying to just feed a set of documents into fine tuning. I personally have proven that dozens of times because I had a client who is determined to do it. He has been mislead.
What it will do is learn the patterns that are in those documents.
We just held a workshop about this a few weeks ago: https://red.ht/llmappdev
We created a simple chatbot using local models with Ollama (llamacpp), LlamaIndex and streamlit.
Have a look at the streamlit folder, it's super easy.
I used this simple example to teach about RAG, the importance of the system prompt and prompt injection.
The notebook folder has a few more examples, local models can even do natural language SQL querying now.
You probably don't need fine-tuning, at least if it's just new content (and no new instructions). It may even be detrimental, since LLMs are als good at forgetting: https://twitter.com/abacaj/status/1739015011748499772
Good question, as you can see I haven't touched it for a month. I wanted to show what's possible then with open source and (open) local models and there's already so much new stuff out there.
I'll probably fix some things this week and then either update it or start from scratch. Guided generation, structured extraction, function calling and multi-modal are things I wanted to add and chainlit looks interesting.
Retrieval Augmented Generation - in brief, using some kind of search to find relevant documents to the user’s question (often vector DB search, which can search by “meaning”, by also other forms of more traditional search), then injecting those into the prompt to the LLM alongside the question, so it hopefully has facts to refer to (and its “generation” can be “augmented” by documents you’ve “retrieved”, I guess!)
So, as a contrived example, with RAG you make some queries, in some format, like “Who is Sauron?” And then start feeding in what books he’s mentioned in, paragraphs describing him from Tolkien books, things he has done.
Then you start making more specific queries? How old is he, how tall is he, etc.
And the game is you run a “questionnaire AI” that can look at a blob of text, and you ask it “what kind of questions might this paragraph answer”, and then turn around and feed those questions and text back into the system.
Is that a 30,000 foot view really of how this works?
The 3rd paragraph missed the mark but previous ones are in the right ballpark.
You take the users question either embed it directly or augment it for embedding (you can for example use LLM to extract keywords form the question), query the vector db containing the data related to the question and then feed it all of LLM as: here is question form the user and here is some data that might be related to it.
Essentially you take any decent model trained on factual information regurgitation, or well any decently well rounded model, a llama 2 variant or something.
Then you craft a prompt for the model along the lines of "you are a helpful assistant, you will provide an answer based on the provided information. If no information matches simply respond with 'I don't know that'".
Then, you take all of your documents and divide them into meaningful chunks, ie by paragraph or something. Then you take these chunks and create embeddings for them. An embedding model is another type (not an llm) that generates vectors for strings of text often based on how similar the words are in _meaning_. Ie if I generate embeddings for the phrase "I have a dog" it might (simplified) be a vector like [0.1,0.2,0.3,0.4]. This vector can be seen as representing a point in a multidimensional space. What an embedding model does with the word meaning is something like if I want to search for "cat" that might embed as a vector [0.42]. Now, say we want to search for the query "which pets do I have" first we generate embeddings for this phrase, the word "pet" might be embedded as [0.41] in the vector. Because it's based on trained meaning, the vectors for "pet" and for "dog" will be close together in our multidimensional space. We can choose how strict we want to be with this search (basically a limit to how close the vectors need to be together in space to count as a match).
Next step is to put this into a vector database, a db designed with vector search operations in mind. We store each chunk, the part of the file it's from and that chunks embedding vector in the database.
Then, when the LLM is queried, say "which pets do I have?", we first generate embeddings for the query, then we use the embedding vector to query our database for things that match close enough in space to be relevant but loose enough that we get "connected" words. This gives us a bunch of our chunks ranked by how close that chunks vector is to our query vector in the multidimensional space. We can then take the n highest ranked chunks, concatenate their original text and prepend this to our original LLM query. The LLM then digests this information and responds in natural language.
So the query sent to the LLM might be something like: "you are a helpful assistant, you will provide an answer based on the provided information. If no information matches simply respond with 'I don't know that'
Information:I have a dog,my dog likes steak,my dog's name is Fenrir
User query: which pets do I have?"
All under "information" is passed in from the chunked text returned from the vector db. And the response from that LLM query would ofc be something like "You have a dog, its name is Fenrir and it likes steak."
Stupid Question: Eli5; Can/Does/Would it make sense to 'cache' (for lack of a better term) a 'memory' of having answered that question.... and so if that question is asked again, it knows that it has answered it in the past, and can/does better?
(Seems like this is what reinforcement training is, but I am just not sure? Everything seems to mush together when talking about gpts logic)
You can decide to store whatever you like in the vector database.
For example you can have a table of "knowledge" as I described earlier, but you can just add easily have a table of the conversation history, or have both.
In fact it's quite popular afaik to store the conversation this way because then if you query on a topic you've queried before, even if the conversation history has gone behind the size of the context, it can still retrieve that history. So yes, what you describe is a good idea/would work/is being done.
It really all comes down to the non model logic/regular programming of how your vector db is queried and how you mix those query results in with the user's query to the LLM.
For example you could embed their query as I described, then search the conversation history + general information storage in the vector db and mix the results. You can even feed it back into itself in a multi step process a la "agents" where your "thought process" takes the user query and breaks it down further by querying the LLM with a different prompt; instead of "you are a helpful assistant" it can be "you have x categories of information in the database, given query {query} specify what data to be extracted for further processing" obv that's a fake general idea prompt but I hope you understand.
Well there's technically no model training involved here but I guess you consider the corpus of conversation data a kind of training, and yeah that would be RLHF based which LLMs learn pretty heavily on afaik (I've not fine tuned my own yet).
You can fine tune models to be better at certain things or respond in certain ways, this is usually done via a kind of reinforcement learning (with human feedback...idk why it's called this, any human feedback is surely just supervised learning right?) this is useful for example, to take a model trained on all kinds of text from everywhere, then fine tune it on text from scifi novels, to make it particularly good at writing scifi fiction.
A fine tune I would say is more the "personality" of the underlying LLM. Saying this, you can ask an LLM to play a character, but the underlying "personality" of the LLM is still manufacturing said character.
Vector databases are more for knowledge store, as if your LLM personality had a table full off open books in front of them; world atlases, a notebook of the conversation you've been having, etc.
Eg, personality: LLM fine tune on all David Attenborough narration = personality like a biologist/natural historian
Knowledge base = chunks of text from scientific papers on chemistry + chunks of the current conversation
Which with some clever vector db queries/feeding back into model = bot that talks like Attenboroughish but knows about chemistry.
Tbf the feedback model it's better to use something strict, ie instruct based model, bc your internal thought steps are heavily goal orientated, all of the personality can be added with the final step using your fine tune.
It fascinates me how much variance there is in peoples searching skills.
some people think they are talking to a person when searching e.g 'what is the best way that i can {action}'
I think the number one trick is to forget grammar and other language niceties and just enter concepts e.g. 'clean car best'
I used to do this. Then when Google's search results started declining in quality, I often found it better to search by what the average user would probably write.
Over the last couple of years, at least with Google, I've found that no strategy really seems to work all that well - Google just 'interprets' my request and assumes that I'm searching for a similar thing that has a lot more answers than what I was actually searching for, and shows me the results for that.
same experience. I'm generally getting better results at client's (VPN) network, we are all googling for the same stuff, I guess.
It must be possible to create a fixed set of google searches and rate the location based on the results. So you could physically travel to a Starbucks 20miles away to get the best results for the 'best USB-C dongle reddit'.
Unfortunately search engines have learned to, well, basically ignore user input.
Amazon is the worst.
I used "" and + and - for terms to get what I want, and its search engine still gives you the sponsored results and an endless list of matches based on what you might buy instead of what you searched for.
I had the same query and instead of just scrolling down, I copy and pasted the paragraph into Bing chat and asked it what it meant. It got it right, but I probably should have scrolled farther first lol.
Or just using a traditional search engine and "rag" plus literally any ML/AI/LLM term will yield a half dozen results at the top with "Retrieval-augmented generation" in the page title.
What percentage of people could you fool if you told them it was AI and replayed standard search results but with the "karaoke-like" prompt that highlights each word (as if we're 2nd graders in Special Ed learning how to string more than 2 sentences together)
To sing the praises of Bedrock again, it does have continuous pre-training as well as RAG “knowledge bases”. The former is based on JSON fragments and the RAG stuff is PDFs and other document formats.
With regards to its efficacy, I haven’t gone to production with it yet but I was reasonably impressed.
I uploaded 100 legal case documents to Bedrock via Claude and could push it pretty hard asking about the various cases and for situations across the knowledge base.
It did feel like it broke down and got confused at a certain point of complexity of questioning, but I still think it’s already useful as a “copilot” or search engine and surely it will only improve over time.
I forgot about the continuous pre-training thing. How long and how much did they cost on Bedrock?
I had tried to suggest continuous pre-training to my client but it seemed expensive and when I mentioned that he lost interest and just kept wanting me to do fine tuning.
Also to clarify, did you do the continuous pre-training or RAG? And did you compare the efficacy of one or the other or both?
Oh Great! How did you evaluate the LLM responses? I'm cofounder of an evaluation and monitoring platform - Athina AI (www.athina.ai)
You can use our monitoring dashboard and evals to check your LLM performance and iterate quickly.
LlamaIndex can't do chunk-level metadata, only document-level metadata, so you can't put precise references to where materials the LLM synthesized answers from originated, e.g. HTML anchors. Just write your own RAG with Pinecone and OpenAI APIs directly.
Thanks! Even with a better documentation, document importers don't extract node metadata so one needs to write their own "text and metadata extractor" as well. It's then easier to skip LlamaIndex altogether, or just get inspiration from some re-ranking etc. you guys did.
You basically don't use langchain for anything besides 30 minute demos that you copied from someone else's github. It has a completely spaghettified API, is not performant, and forces you into excessive mental contortions to reason about otherwise simple tasks.
Yea discovered this with Langchain last week. Was great for a demo then started to push it harder and spent ages trawling Reddit, discord, GitHub trying to find solutions to issues only to discover what was supposed to be supported was deprecated. Got a massive headache for what should have been a simple change. Moved on now.
We originally started out building features with LangChain (loading chains from YAML sounded good—it felt like it would be easy to get non-engineers to help with prompt development) but in practice it’s just way too complicated. Nice idea, but the execution feels lacking.
It also doesn’t help that LangChain is evolving so rapidly. When we first started using it a lot of code samples on the internet couldn’t be copy/pasted because of import paths changing, and at one point we had to bump by ~60 patch versions to get a bug fix, which was painful because it broke all kinds of stuff
Echoing others’ sentiments, I was frustrated with the bloat and obscurity of existing tools. This led me to start building Langroid with an agent-oriented paradigm 8 months ago https://github.com/langroid/langroid
We have companies using it in production for various use-cases. They especially like our RAG and multi-agent orchestration.
See my other comment for details.
Besides the other comments in this thread, I'd really recommending looking at least first to the (relatively new) "Managed index" in LlamaIndex: https://docs.llamaindex.ai/en/stable/community/integrations/... . These handle combining the retrieval with the generative side. I've seen a lot of users both get frustrated and get bad results by trying to write their own glue to string together various components of retrieval and generation and these are much easier to get started with
Ouch your client! I had one earlier this year like this. We were doing some audio processing for word matching, he had also been mislead before coming to us, he fully believed that this was going to be some form of super AI trained on his 5 audio records of him repeating the words over and over...
We did all we could to steer him toward a correct path of understanding. Sadly we launched a working product but he doesn't understand it and continues to miss represent and miss sell it.
After continuing to give him time and follow up with him (I tend to personally do this with Clients like this), I can tell he is starting to realize his lack of understanding...
The OpenAI assistants API is an implementation of a RAG pipeline. It performs both RAG on any documents you upload, and on any conversation you have with it that exceeds the context.
Not public but internally I wrote a tool to help us respond to RFPs. You pass in a question from a new RFP and it outputs surprisingly great answers most of the time. Is writing 75%+ of our RFP responses now (naturally we review and adjust sometimes and as needed). And best of all it was very quickly hacked together and it’s actually useful. Copied questions/answers from all previous ones into a doc, and am using OpenAI embeddings api + FAISS vector db + GPT-4 to load the chunks + store the embeddings + process the resulting chunks.
Another super easy option for RAG is AWS Bedrock Knowledge Base. It can ingest docs from S3. Just don’t use the OpenSearch serverless store it’s $$$. Can use a low end RDS with pgvector extension.
What’s the benefit of llamaindex over just storing documents in chroma and using chroma to query? I’ve done the latter and trying to understand if there’s a performance gain to the former?
For simpler usecases - inbuilt vector database RAG is sufficient
For more complex ones - LlamaIndex or Langchain options are suitable
For enterprise grade production use cases - Lyzr's SOTA RAG architecture comes in handy
Well said.
The problem is, there are way too many alternatives. Any idea how llamaindex's ingestion engine compares to unstructured.io? ( Which is used in langchain)
You don't just feed documents in, you need to build a dataset representative of how you want to interact with it. So likely using gpt-4 or something to create: a chunk of a document, a question that can be answered by that chunk and a good answer. (Or something)
Have you tried that? I find the results hard to believe because he says he is using Llama 7b and asking it questions but that is not a chat model. Also he does not appear at all in the issues when people ask him about reproducing the results. Instead there are people in the issues recommending RAG or creating a QA dataset.
Has anyone tried using an LLM for the retrieval stage? Instead of using vector embeddings, have a (small, fast) LLM literally scan the entire corpus in chunks extracting relevant sections?
You still do RAG. Llamaindex is still the best option that I know of. Most of the startups that have working products are likely using llamaindex. All of the ones that say they are training on documents are actually using RAG.
Test it out. If it really and truly doesn't work, search for a script that creates question and answer pairs automatically with gpt-4. Then try using that for qLoRA. I have never heard of anyone successfully using that for a private document knowledgebase though. Only for skills like math, reasoning, Python, etc. I think the issue is that you need a LOT of data and it needs to repeat concepts or any facts you need to learn many, many times in different supporting ways.
What absolutely does not work is trying to just feed a set of documents into fine tuning. I personally have proven that dozens of times because I had a client who is determined to do it. He has been mislead.
What it will do is learn the patterns that are in those documents.