Here's a summary of what's happened the past couple of years and what tools are out there.
After ChatGPT released, there was a lot of hype in the space but open source was far behind. Iirc the best open foundation LLM that existed was GPT-2 but it was two generations behind.
Awhile later Meta released LLaMA[1], a well trained base foundation model, which brought an explosion to open source. It was soon implemented in the Hugging Face Transformers library[2] and the weights were spread across the Hugging Face website for anyone to use.
At first, it was difficult to run locally. Few developers had the system or money to run. It required too much RAM and iirc Meta's original implementation didn't support running on the CPU but developers soon came up with methods to make it smaller via quantization. The biggest project for this was Llama.cpp[3] which probably is still the biggest open source project today for running LLMs locally. Hugging Face Transformers also added quantization support through bitsandbytes[4].
Over the next months there was rapid development in open source. Quantization techniques improved which meant LLaMA was able to run with less and less RAM with greater and greater accuracy on more and more systems. Tools came out that were capable of finetuning LLaMA and there were hundreds of LLaMA finetunes that came out which finetuned LLaMA on instruction following, RLHF, and chat datasets which drastically increased accuracy even further. During this time, Stanford's Alpaca, Lmsys's Vicuna, Microsoft's Wizard, 01ai's Yi, Mistral, and a few others made their way onto the open LLM scene with some very good LLaMA finetunes.
A new inference engine (software for running LLMs like Llama.cpp, Transformers, etc) called vLLM[5] came out which was capable of running LLMs in a more efficient way than was previously possible in open source. Soon it would even get good AMD support, making it possible for those with AMD GPUs to run open LLMs locally and with relative efficiency.
Then Meta released Llama 2[6]. Llama 2 was by far the best open LLM for its time. Released with RLHF instruction finetunes for chat and with human evaluation data that put its open LLM leadership beyond doubt. Existing tools like Llama.cpp and Hugging Face Transformers quickly added support and users had access to the best LLM open source had to offer.
At this point in time, despite all the advancements, it was still difficult to run LLMs. Llama.cpp and Transformers were great engines for running LLMs but the setup process was difficult and required a lot of time. You had to find the best LLM, quantize it in the best way for your computer (or figure out how to identify and download one from Hugging Face), setup whatever engine you wanted, figure out how to use your quantized LLM with the engine, fix any bugs you made along the way, and finally figure out how to prompt your specific LLM in a chat-like format.
However, tools started coming out to make this process significantly easier. The first one of these that I remember was GPT4All[7]. GPT4All was a wrapper around Llama.cpp which made it easy to install, easy to select the LLM that you want (pre-quantized options for easy download from a download manager), and a chat UI which made LLMs easy to use. This significantly reduced the barrier to entry for those who were interested in using LLMs.
The second project that I remember was Ollama[8]. Also a wrapper around Llama.cpp, Ollama gave most of what GPT4All had to offer but in an even simpler way. Today, I believe Ollama is bigger than GPT4All although I think it's missing some of the higher-level features of GPT4All.
Another important tool that came out during this time is called Exllama[9]. Exllama is an inference engine with a focus on modern consumer Nvidia GPUs and advanced quantization support based on GPTQ. It is probably the best inference engine for squeezing performance out of consumer Nvidia GPUs.
Months later, Nvidia came out with another new inference engine called TensorRT-LLM[10]. TensorRT-LLM is capable of running most LLMs and does so with extreme efficiency. It is the most efficient open source inferencing engine that exists for Nvidia GPUs. However, it also has the most difficult setup process of any inference engine and is made primarily for production use cases and Nvidia AI GPUs so don't expect it to work on your personal computer.
With the rumors of GPT-4 being a Mixture of Experts LLM, research breakthroughs in MoE, and some small MoE LLMs coming out, interest in MoE LLMs was at an all-time high. The company Mistral had proven itself in the past with very impressive LLaMA finetunes, capitalized on this interest by releasing Mixtral 8x7b[11]. The best accuracy for its size LLM that the local LLM community had seen to date. Eventually MoE support was added to all inference engines and it was a very popular mid-to-large sized LLM.
Cohere released their own LLM as well called Command R+[12] built specifically for RAG-related tasks with a context length of 128k. It's quite large and doesn't have notable performance on many metrics, but it has some interesting RAG features no other LLM has.
More recently, Llama 3[13] was released which similar to previous Llama releases, blew every other open LLM out of the water. The smallest version of Llama 3 (Llama 3 8b) has the greatest accuracy for its size of any other open LLM and the largest version of Llama 3 released so far (Llama 3 70b) beats every other open LLM on almost every metric.
Less than a month ago, Google released Gemma 2[14], the largest of which, performs very well under human evaluation despite being less than half the size of Llama 3 70b, but performs only decently on automated benchmarks.
If you're looking for a tool to get started running LLMs locally, I'd go with either Ollama or GPT4All. They make the process about as painless as possible. I believe GPT4All has more features like using your local documents for RAG, but you can also use something like Open WebUI[15] with Ollama to get the same functionality.
If you want to get into the weeds a bit and extract some more performance out of your machine, I'd go with using Llama.cpp, Exllama, or vLLM depending upon your system. If you have a normal, consumer Nvidia GPU, I'd go with Exllama. If you have an AMD GPU that supports ROCm 5.7 or 6.0, I'd go with vLLM. For anything else, including just running it on your CPU or M-series Mac, I'd go with Llama.cpp. TensorRT-LLM only makes sense if you have an AI Nvidia GPU like the A100, V100, A10, H100, etc.
I think Stable Diffusion was first to release a SOTA model (August 2022) that worked locally, not in language but image generation, but it set the tone for Meta. LLaMA only came in February 2023.
> The company Mistral had proven itself in the past with very impressive LLaMA finetunes
Mistal is not a finetune of LLaMA, it is a model trained from scratch. Also, Mistral was most of the time better than LLaMA during this period.
> Quantization techniques improved which meant LLaMA was able to run with less and less RAM with greater and greater accuracy
Quantization does not improve accuracy, except if you trade off precision for longer context maybe, but not on similar prompts. It is like JPEG compression, the original is always better for a specific image, but for the same byte size you get more resolution from JPEG than say... a PNG.
> I think Stable Diffusion was first to release a SOTA model (August 2022) that worked locally, not in language but image generation, but it set the tone for Meta. LLaMA only came in February 2023.
Sure, I was only covering LLMs though. If I wanted to cover image generation models and tools as well, the comment would be double its size.
> Mistal is not a finetune of LLaMA, it is a model trained from scratch. Also, Mistral was most of the time better than LLaMA during this period.
Oh, that's right. Iirc it was just the Llama 2 architecture that was used with sliding window attention.
> Quantization does not improve accuracy, except if you trade off precision for longer context maybe, but not on similar prompts. It is like JPEG compression, the original is always better for a specific image, but for the same byte size you get more resolution from JPEG than say... a PNG.
I'm well aware of how quantization works. I meant quantization methods were increasingly able to retain accuracy. Such as methods which quantize less important weights more heavily, improving accuracy for the same LLM size.
All the main IDE-integrated ones seem very much on par (Copilot, Sourcegraph Cody, Continue.dev), with cursor.sh liked by some as it has code assistant-first UI.
I've personally went back to the browser with Claude 3.5 Sonnet (and the projects + artifacts feature), as it is one of the most industrious ones, and I really like the UX of artifacts + it integrates new code well into existing code you paste into it.
In the end I think it also often comes down to what languages/frameworks you are using and how well the LLM/product handles it, so I'd still recommend to test around. E.g. some of the main frameworks I'm working with on a daily basis went through big refactors/interface changes 1-2 years ago, and I stopped using ChatGPT because it had a strong tendency to produce code based on the old interfaces/paradigms.
Aider[0] is also quite interesting, especially when it comes to more significant refactorings in the codebase and has gotten quite good with that with the last few bigger model releases, but it takes same time to get used to and doesn't have good IDE-integration.
I've been following the state of things, but I'm not sure which ones are the best. There's Meta's CodeLlama[1], Mistral's Codestral[2], DeepSeek AI's DeepSeek-Coder-V2-Instruct[3], CodeGemma[4], Alibaba's CodeQwen[5], and Microsoft's WizardCoder[6].
I'm pretty sure CodeLlama is out of date now. I've heard DeepSeek LLMs are good and DeepSeek-Coder-V2-Instruct was released recently. With the good reputation and its massive size (236b) I'd guess it is the best coding LLM, but if it's not being trained efficiently, maybe Codestral and Codestral Mamba come close.
I don't think the best coding LLMs are close to GitHub Copilot but I could be wrong since I'm just relaying information that I've heard secondhand.
After ChatGPT released, there was a lot of hype in the space but open source was far behind. Iirc the best open foundation LLM that existed was GPT-2 but it was two generations behind.
Awhile later Meta released LLaMA[1], a well trained base foundation model, which brought an explosion to open source. It was soon implemented in the Hugging Face Transformers library[2] and the weights were spread across the Hugging Face website for anyone to use.
At first, it was difficult to run locally. Few developers had the system or money to run. It required too much RAM and iirc Meta's original implementation didn't support running on the CPU but developers soon came up with methods to make it smaller via quantization. The biggest project for this was Llama.cpp[3] which probably is still the biggest open source project today for running LLMs locally. Hugging Face Transformers also added quantization support through bitsandbytes[4].
Over the next months there was rapid development in open source. Quantization techniques improved which meant LLaMA was able to run with less and less RAM with greater and greater accuracy on more and more systems. Tools came out that were capable of finetuning LLaMA and there were hundreds of LLaMA finetunes that came out which finetuned LLaMA on instruction following, RLHF, and chat datasets which drastically increased accuracy even further. During this time, Stanford's Alpaca, Lmsys's Vicuna, Microsoft's Wizard, 01ai's Yi, Mistral, and a few others made their way onto the open LLM scene with some very good LLaMA finetunes.
A new inference engine (software for running LLMs like Llama.cpp, Transformers, etc) called vLLM[5] came out which was capable of running LLMs in a more efficient way than was previously possible in open source. Soon it would even get good AMD support, making it possible for those with AMD GPUs to run open LLMs locally and with relative efficiency.
Then Meta released Llama 2[6]. Llama 2 was by far the best open LLM for its time. Released with RLHF instruction finetunes for chat and with human evaluation data that put its open LLM leadership beyond doubt. Existing tools like Llama.cpp and Hugging Face Transformers quickly added support and users had access to the best LLM open source had to offer.
At this point in time, despite all the advancements, it was still difficult to run LLMs. Llama.cpp and Transformers were great engines for running LLMs but the setup process was difficult and required a lot of time. You had to find the best LLM, quantize it in the best way for your computer (or figure out how to identify and download one from Hugging Face), setup whatever engine you wanted, figure out how to use your quantized LLM with the engine, fix any bugs you made along the way, and finally figure out how to prompt your specific LLM in a chat-like format.
However, tools started coming out to make this process significantly easier. The first one of these that I remember was GPT4All[7]. GPT4All was a wrapper around Llama.cpp which made it easy to install, easy to select the LLM that you want (pre-quantized options for easy download from a download manager), and a chat UI which made LLMs easy to use. This significantly reduced the barrier to entry for those who were interested in using LLMs.
The second project that I remember was Ollama[8]. Also a wrapper around Llama.cpp, Ollama gave most of what GPT4All had to offer but in an even simpler way. Today, I believe Ollama is bigger than GPT4All although I think it's missing some of the higher-level features of GPT4All.
Another important tool that came out during this time is called Exllama[9]. Exllama is an inference engine with a focus on modern consumer Nvidia GPUs and advanced quantization support based on GPTQ. It is probably the best inference engine for squeezing performance out of consumer Nvidia GPUs.
Months later, Nvidia came out with another new inference engine called TensorRT-LLM[10]. TensorRT-LLM is capable of running most LLMs and does so with extreme efficiency. It is the most efficient open source inferencing engine that exists for Nvidia GPUs. However, it also has the most difficult setup process of any inference engine and is made primarily for production use cases and Nvidia AI GPUs so don't expect it to work on your personal computer.
With the rumors of GPT-4 being a Mixture of Experts LLM, research breakthroughs in MoE, and some small MoE LLMs coming out, interest in MoE LLMs was at an all-time high. The company Mistral had proven itself in the past with very impressive LLaMA finetunes, capitalized on this interest by releasing Mixtral 8x7b[11]. The best accuracy for its size LLM that the local LLM community had seen to date. Eventually MoE support was added to all inference engines and it was a very popular mid-to-large sized LLM.
Cohere released their own LLM as well called Command R+[12] built specifically for RAG-related tasks with a context length of 128k. It's quite large and doesn't have notable performance on many metrics, but it has some interesting RAG features no other LLM has.
More recently, Llama 3[13] was released which similar to previous Llama releases, blew every other open LLM out of the water. The smallest version of Llama 3 (Llama 3 8b) has the greatest accuracy for its size of any other open LLM and the largest version of Llama 3 released so far (Llama 3 70b) beats every other open LLM on almost every metric.
Less than a month ago, Google released Gemma 2[14], the largest of which, performs very well under human evaluation despite being less than half the size of Llama 3 70b, but performs only decently on automated benchmarks.
If you're looking for a tool to get started running LLMs locally, I'd go with either Ollama or GPT4All. They make the process about as painless as possible. I believe GPT4All has more features like using your local documents for RAG, but you can also use something like Open WebUI[15] with Ollama to get the same functionality.
If you want to get into the weeds a bit and extract some more performance out of your machine, I'd go with using Llama.cpp, Exllama, or vLLM depending upon your system. If you have a normal, consumer Nvidia GPU, I'd go with Exllama. If you have an AMD GPU that supports ROCm 5.7 or 6.0, I'd go with vLLM. For anything else, including just running it on your CPU or M-series Mac, I'd go with Llama.cpp. TensorRT-LLM only makes sense if you have an AI Nvidia GPU like the A100, V100, A10, H100, etc.
[1] https://ai.meta.com/blog/large-language-model-llama-meta-ai/
[2] https://github.com/huggingface/transformers
[3] https://github.com/ggerganov/llama.cpp
[4] https://github.com/bitsandbytes-foundation/bitsandbytes
[5] https://github.com/vllm-project/vllm
[6] https://ai.meta.com/blog/llama-2/
[7] https://www.nomic.ai/gpt4all
[8] http://ollama.ai/
[9] https://github.com/turboderp/exllamav2
[10] https://github.com/NVIDIA/TensorRT-LLM
[11] https://mistral.ai/news/mixtral-of-experts/
[12] https://cohere.com/blog/command-r-plus-microsoft-azure
[13] https://ai.meta.com/blog/meta-llama-3/
[14] https://blog.google/technology/developers/google-gemma-2/
[15] https://github.com/open-webui/open-webui