Most the 7b instruct models are very bad outside very simple queries. You can ru...

dTal · on July 16, 2024

>Most the 7b instruct models are very bad outside very simple queries.

I can't agree with "very bad". Maybe your standards are set by the best, largest models, but have a little perspective: a modern 7b model is a friggin magical piece of software. Fully in the realm of sci-fi until basically last Tuesday. It can reliably summarize documents, bash a 30 minute rambling voice note into a terse proposal, and give you social counseling at least on par with r/Relationship_Advice. It might not always get facts exactly right but it is smart in a way that computers have never been before. And for all this capability, you can get it running on a computer a decade old, maybe even a Raspberry Pi or a smartphone.

To answer the parent: Download a "gguf" file (blob of weights) of a popular model like Mistral from HugginFace. Git pull and compile llama.cpp. Run ./main -m path/to/gguf -p "prompt"

visarga · on July 17, 2024

Even better, install ollama and then do "ollama run llama3", it works like docker, pulls the model locally and starts a chat session right there in the terminal. No need to compile. Or just run the docker image "ollama/ollama".

Agentus · on July 16, 2024

I'm looking to run something on a 24gb GPU for the purpose of running wild with agentic use of LLMs. Is there anything worth trying that would fit on that amount of vRAM? Or are all the open-source PC-sized LLMs laughable still?

TechDebtDevin · on July 16, 2024

You can run the llama 70b based models faster than 10 tkn/s on 24gb vram. I've found that the quality of this class of LLMs is heavily swayed by your configuration and system prompting and results may vary. This Reddit post seems to have some input on the topic:

https://www.reddit.com/r/LocalLLaMA/comments/1cj4det/llama_3...

I haven't used any agent frameworks other than messing around with langchain a bit so I can't speak to how that would effect things.

Zambyte · on July 17, 2024

You would probably get the same tokens per second with llama 3 70b if you just unplugged the 24gb GPU. For something that actually fits in 24gb of VRAM, I recommend gemma 2 27b up to q6. I use q4 and it works quite well for my needs.