More

thedatamonger · on April 28, 2024

this looks very awesome. can someone tell me why there is no chatter about this? is there something else out there that blows this out of the water in terms of ease of use and access to sample many LLM's ?

brrrrrm · on April 28, 2024

HN isnt really the best space for LLM news - r/LocalLlama and twitter are much better. I think HN has some cultural issues with “AI” news

wkat4242 · on April 28, 2024

Hmm I don't think so. Most comments are pretty positive.

I think the articles are just not really upvoted unless it's really big news, makes sense because HN is for more than just AI.

But I don't think it's anti-AI like most people here would be pretty anti-cryptocurrency (and for good reason IMO)

p1esk · on April 29, 2024

I didn’t upvote it because I don’t use Ollama. To experiment with LLMs I use Huggingface. Does Ollama provide something I cannot get with Huggingface?

lolinder · on April 29, 2024

Ollama provides a web server with API that just works out of the box, which is great when you want to integrate multiple applications (potentially distributed on smaller edge devices) with LLMs that run on a single beefy machine.

In my home I have a large gaming rig that sometimes runs Ollama+Open WebUI, then I also have a bunch of other services running on a smaller server and a Raspberry Pi which reach out to Ollama for their LLM inference needs.

p1esk · on April 29, 2024

Sure, maybe it’s better for niche use cases like yours.

HF is the biggest provider of llms, and I guess I haven’t run into it’s limitations yet.

jkh1 · on April 29, 2024

Running locally is sometimes necessary, e.g. you don't want to send sensitive data to any random third party server.

Zambyte · on April 29, 2024

Both Ollama and Huggingface distribute models. The latter sites have model hosting services too, but that isn't the only way to use models from there.

gertop · on April 29, 2024

Hugging face is a model repository.

Ollama allows you to run those models.

Different things.

p1esk · on April 29, 2024

I run models using HF just fine. I mean I’m using HF transformers repo, which gets models from HF hub.

Or do you mean commercial deployment of models for inference?

simonw · on April 29, 2024

Are you talking about the Hugging Face Python libraries, the Hugging Face hosted inference APIs, the Hugging Face web interfaces, the Hugging Face iPhone app, Hugging Face Spaces (hosted Docker environments with GPU access) or something else?

p1esk · on April 29, 2024

I updated my comment above: I’m using HF transformers repo, which gets models from HF hub.

simonw · on April 29, 2024

Do you have an NVIDIA GPU? I have not had much luck with the transformers library on a Mac.

p1esk · on April 29, 2024

Of course. I thought Nvidia GPUs are pretty much a must have to play with DL models.

objektif · on April 29, 2024

Well being able to run these models on CPU was pretty much the revolutionary part of llama.cpp.

p1esk · on April 29, 2024

I can run them on CPU - HF uses plain Pytorch code - fully supported on CPU.

tmostak · on April 29, 2024

But it's likely to be much slower than what you'd get with a backend like llama.cpp on CPU (particularly if you're running on a Mac, but I think on Linux as well), as well as not supporting features like CPU offloading.

p1esk · on April 29, 2024

Are there benchmarks? 2x speed up would not be enough for me to return to c++ hell, but 5x might be, in some circumstances.

SushiHippie · on April 29, 2024

I think the biggest selling point of ollama (llama.cpp) are quantizations, for a slight hit (with q8 or q4) in quality you can get a significant performance boost.

p1esk · on April 29, 2024

Does ollama/llama.cpp provide low bit operations (avx or cuda kernels) to speed up inference? Or just model compression with inference still done in fp16?

My understanding is the modern quantization algorithms are typically implemented in Pytorch.

SushiHippie · on April 29, 2024

Sorry I don't know much about this topic.

The only thing I know (from using it) that with quantization I can fit models like llama2 13b, in my 24GB of VRAM when I use q8 (16GB) instead of fp16 (26GB). This means I can get nearly the full quality of llama2 13b's output while still being able to use only my GPU, without the need to do very slow inference on only CPU+RAM.

And the models are quantized before inference, so I'd only download 16GB for the llama2 13b q8 instead of the full 26GB, which means it's not done on the fly.

p1esk · on April 30, 2024

As an aside, even gpt4 level quality does not feel satisfactory to me lately. I can’t imagine willingly using models as dumb as llama2-13b. What do you do with it?

SushiHippie · on April 30, 2024

Yeah I agree, everytime a new model releases I download the highest quantization or fp16, that fits into my VRAM, test it out with a few prompts, and then realize that downloadable models are still not as good as the closed ones (except speed wise).

I don't know why I still do it, but everytime I read so many comments how good model X is, and how it outperforms anything else, and then I want to see it for myself.

simonw · on April 29, 2024

There's a Python binding for llama.cpp which is actively maintained and has worked well for me: https://github.com/abetlen/llama-cpp-python

wkat4242 · on April 29, 2024

Ollama supports many radeons now. And I guess llama.cpp does too, after all it's what ollama uses as backend.

p1esk · on April 29, 2024

PyTorch (the underlying framework of HF) supports AMD as well, though I haven’t tried it.

chadsix · on April 28, 2024

Ollama is really organized - it relies on llama but the UX and organization it provides makes it legit. We recently made a one-click wizard to run Open WebUI and Ollama together, self hosted and remotely accessible but locally hosted [1]

[1] https://github.com/ipv6rslimited/cloudseeder

gertop · on April 29, 2024

LM Studio is a lot more user friendly, probably the easiest UI to use out there. No terminal nonsense, no manual to read. Just double click and chat. It even explains to you what the model names mean (eg diff between Q4_1 Q4_K Q4_K_M... For whatever reason all the other tools assume you know what it means).

Built-in model recommendations are also handy.

Very friendly tool!

However it's not open-source.

Cheer2171 · on April 29, 2024

Why do you think there is no chatter about this? There have been hundreds of posts about ollama on HN. This is a point release of an already well known project.

FieryTransition · on April 28, 2024

I use a mix of using llamacpp directly via my own python bindings and using it via llamacpp-python for function calling and full control over parameters and loading, but otherwise ollama is just great for ease of use. There's really not a reason not to use it, if just want to load gguf models and don't have any intricate requirements.

CharlesW · on April 29, 2024

I can recommend LM Studio and Msty if you're looking for something with an integrated UX.

perrygeo · on April 30, 2024

Opposite reaction here. I was just thinking, man I hear about Ollama every single day on HN. Not sure a point release is news :-)

throw03172019 · on April 28, 2024

Lola a has been brought up many times on HN. It’s a great tool!

thedatamonger · on April 10, 2024

From the related article: https://www.quantamagazine.org/avi-wigderson-complexity-theo...

> ... if a statement can be proved, it also has a zero-knowledge proof.

Mind blown.

>Feeding the pseudorandom bits (instead of the random ones) into a probabilistic algorithm will result in an efficient deterministic one for the same problem.

This is nuts. AI is a probabilistic computation ... so what they're saying - if i'm reading this right - is that we can reduce the complexity of our current models by orders of magnitude.

If I'm living in noobspace someone please pull me out.

IshKebab · on April 10, 2024

I don't know exactly what it's saying but it definitely isn't that. AI already uses pseudorandom numbers and is deterministic. (Except some weird AI accelerator chips that use analogue computation to improve efficiency.)

ilya_m · on April 10, 2024

> AI is a probabilistic computation ... so what they're saying - if i'm reading this right - is that we can reduce the complexity of our current models by orders of magnitude.

Unfortunately, no. First, the result applies to decision, not search problems. Second, the resulting deterministic algorithm is much less efficient than the randomized algorithm, albeit it still belongs to the same complexity class (under some mild assumptions).

mxkopy · on April 10, 2024

Can’t you build search from decision by deciding on every possible input?

thedatamonger · on April 8, 2024

thank you for that. I startled the dog with that laugh :)

thedatamonger · on Jan 7, 2024

nature says eat or be eaten(or die). now that we are on top of the food chain, it's useful, for mental health and societal reasons, not not want to rip every one's throat out, even thou every other species (including us) continues to do so. You tend to steer and veer towards what you look at. The good news is we get to choose what we look at... the bad news is we (statistically) choose wrong. Race car drivers don't look at the wall when they drive, because when they do, they tend to hit it. stop looking at the wall guys ...

thedatamonger · on March 18, 2022

Rather than having selections for multiple languages (for each task) it seems like language detection or a selection/setup screen would be best. With fallback, to english, or whatever your default is. Maybe use online translations services?

edit: Oh it seems you do have a language drop down, but there are still multiple languages appearing in quests... this just means more quests I guess eh :)

qw3rty01 · on March 18, 2022

He means more programming languages

blindpirate · on March 19, 2022

Exactly.

thedatamonger · on March 9, 2022

Bravo! I enjoyed this.

thedatamonger · on March 5, 2022

Bravo! Well written and informative! And as someone who's obsessed with Dart at the moment timely! Thanks!

thedatamonger · on Feb 18, 2022

This is exactly like what I was looking for but it seems very incomplete. Does anyone else know of a resource like this that isn't the first hit on google?

thedatamonger · on Feb 2, 2022

Herbert was a prophet. https://dune.fandom.com/wiki/Plasteel

thedatamonger · on May 3, 2018

I prefer the more deep dive approach presented by brendan gregg http://www.brendangregg.com/ebpf.html

It made me cringe reading the first line "The Linux kernel is an abundant component of modern IT systems". Come on guys, it's the kernel, there is one of them per system.

isostatic · on May 3, 2018

I liked the link to Load Averages on his site [0] - including the heroric tale of finding the source of the original patch that changed it to include more than just processes in CPU wait state.

[0] http://www.brendangregg.com/blog/2017-08-08/linux-load-avera...

anitil · on May 3, 2018

Every time I come across Brendan Gregg's work I'm blown away. And I come across his work a lot! (It also brings a semi-patriotic tear to my eye to hear his aussie twang)