this looks very awesome. can someone tell me why there is no chatter about this? is there something else out there that blows this out of the water in terms of ease of use and access to sample many LLM's ?
Ollama provides a web server with API that just works out of the box, which is great when you want to integrate multiple applications (potentially distributed on smaller edge devices) with LLMs that run on a single beefy machine.
In my home I have a large gaming rig that sometimes runs Ollama+Open WebUI, then I also have a bunch of other services running on a smaller server and a Raspberry Pi which reach out to Ollama for their LLM inference needs.
Are you talking about the Hugging Face Python libraries, the Hugging Face hosted inference APIs, the Hugging Face web interfaces, the Hugging Face iPhone app, Hugging Face Spaces (hosted Docker environments with GPU access) or something else?
But it's likely to be much slower than what you'd get with a backend like llama.cpp on CPU (particularly if you're running on a Mac, but I think on Linux as well), as well as not supporting features like CPU offloading.
I think the biggest selling point of ollama (llama.cpp) are quantizations, for a slight hit (with q8 or q4) in quality you can get a significant performance boost.
Does ollama/llama.cpp provide low bit operations (avx or cuda kernels) to speed up inference? Or just model compression with inference still done in fp16?
My understanding is the modern quantization algorithms are typically implemented in Pytorch.
The only thing I know (from using it) that with quantization I can fit models like llama2 13b, in my 24GB of VRAM when I use q8 (16GB) instead of fp16 (26GB). This means I can get nearly the full quality of llama2 13b's output while still being able to use only my GPU, without the need to do very slow inference on only CPU+RAM.
And the models are quantized before inference, so I'd only download 16GB for the llama2 13b q8 instead of the full 26GB, which means it's not done on the fly.
As an aside, even gpt4 level quality does not feel satisfactory to me lately. I can’t imagine willingly using models as dumb as llama2-13b. What do you do with it?
Yeah I agree, everytime a new model releases I download the highest quantization or fp16, that fits into my VRAM, test it out with a few prompts, and then realize that downloadable models are still not as good as the closed ones (except speed wise).
I don't know why I still do it, but everytime I read so many comments how good model X is, and how it outperforms anything else, and then I want to see it for myself.
Ollama is really organized - it relies on llama but the UX and organization it provides makes it legit. We recently made a one-click wizard to run Open WebUI and Ollama together, self hosted and remotely accessible but locally hosted [1]
LM Studio is a lot more user friendly, probably the easiest UI to use out there. No terminal nonsense, no manual to read. Just double click and chat. It even explains to you what the model names mean (eg diff between Q4_1 Q4_K Q4_K_M... For whatever reason all the other tools assume you know what it means).
Why do you think there is no chatter about this? There have been hundreds of posts about ollama on HN. This is a point release of an already well known project.
I use a mix of using llamacpp directly via my own python bindings and using it via llamacpp-python for function calling and full control over parameters and loading, but otherwise ollama is just great for ease of use. There's really not a reason not to use it, if just want to load gguf models and don't have any intricate requirements.
> ... if a statement can be proved, it also has a zero-knowledge proof.
Mind blown.
>Feeding the pseudorandom bits (instead of the random ones) into a probabilistic algorithm will result in an efficient deterministic one for the same problem.
This is nuts. AI is a probabilistic computation ... so what they're saying - if i'm reading this right - is that we can reduce the complexity of our current models by orders of magnitude.
If I'm living in noobspace someone please pull me out.
I don't know exactly what it's saying but it definitely isn't that. AI already uses pseudorandom numbers and is deterministic. (Except some weird AI accelerator chips that use analogue computation to improve efficiency.)
> AI is a probabilistic computation ... so what they're saying - if i'm reading this right - is that we can reduce the complexity of our current models by orders of magnitude.
Unfortunately, no. First, the result applies to decision, not search problems. Second, the resulting deterministic algorithm is much less efficient than the randomized algorithm, albeit it still belongs to the same complexity class (under some mild assumptions).
nature says eat or be eaten(or die). now that we are on top of the food chain, it's useful, for mental health and societal reasons, not not want to rip every one's throat out, even thou every other species (including us) continues to do so. You tend to steer and veer towards what you look at. The good news is we get to choose what we look at... the bad news is we (statistically) choose wrong. Race car drivers don't look at the wall when they drive, because when they do, they tend to hit it. stop looking at the wall guys ...
Rather than having selections for multiple languages (for each task) it seems like language detection or a selection/setup screen would be best. With fallback, to english, or whatever your default is. Maybe use online translations services?
edit: Oh it seems you do have a language drop down, but there are still multiple languages appearing in quests... this just means more quests I guess eh :)
This is exactly like what I was looking for but it seems very incomplete. Does anyone else know of a resource like this that isn't the first hit on google?
It made me cringe reading the first line "The Linux kernel is an abundant component of modern IT systems". Come on guys, it's the kernel, there is one of them per system.
I liked the link to Load Averages on his site [0] - including the heroric tale of finding the source of the original patch that changed it to include more than just processes in CPU wait state.
Every time I come across Brendan Gregg's work I'm blown away. And I come across his work a lot! (It also brings a semi-patriotic tear to my eye to hear his aussie twang)