Hacker Newsnew | past | comments | ask | show | jobs | submit | mingodad's commentslogin

I'm still a bit confused because it says "All uploads use Unsloth Dynamic 2.0" but then when looking at the available options like for 4 bits there is:

IQ4_XS 5.17 GB, Q4_K_S 5.39 GB, IQ4_NL 5.37 GB, Q4_0 5.38 GB, Q4_1 5.84 GB, Q4_K_M 5.68 GB, UD-Q4_K_XL 5.97 GB

And no explanation for what they are and what tradeoffs they have, but in the turorial it explicitly used Q4_K_XL with llama.cpp .

I'm using a macmini m4 16GB and so far my prefered model is Qwen3-4B-Instruct-2507-Q4_K_M although a bit chat but my test with Qwen3.5-4B-UD-Q4_K_XL shows it's a lot more chat, I'm basically using it in chat mode for basic man style questions.

I understand that each user has it's own specific needs but would be nice to have a place that have a list of typical models/hardware listed with it's common config parameters and memory usage.

Even on redit specific channels it's a bit of nightmare of loot of talk but no concrete config/usage clear examples.

I'm floowing this topic heavilly for the last 3 months and I see more confusion than clarification.

Right now I'm getting good cost/benefit results with the qwen cli with coder-model in the cloud and watching constantly to see when a local model on affordable hardware with enviroment firendly energy comsumption arrives.


Oh https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks might be helpful - it provides benchmarks for Q4_K_XL vs Q4_K_M etc for disk space vs KL Divergence (proxy for how close to the original full precision model)

Q4_0 and Q4_1 were supposed to provide faster inference, but tests showed it reduced accuracy by quite a bit, so they are deprecated now.

Q4_K_M and UD-Q4_K_XL are the same, just _XL is slightly bigger than _M

The naming convention is _XL > _L > _M > _S > _XS


Thanks for all your contributions.

Do you think it's time for version numbers in filenames? Or at least a sha256sum of the merged files when they're big enough to require splitting?

Even with gigabit fiber, it still takes a long time to download model files, and I usually merge split files and toss the parts when I'm done. So by the time I have a full model, I've often lost track of exactly when I downloaded it, so I can't tell whether I have the latest. For non-split models, I can compare the sha256sum on HF, but not for split ones I've already merged. That's why I think we could use version numbers.


Thanks! Oh we do split if over 50GB - do you mean also split on 50GB shards? HuggingFace XET has an interesting feature where each file is divided into blocks, so it'll do a SHA256 on each block, and only update blocks

That might be the answer -- something like BitTorrent that updates only the parts that need updating.

But I do think I'm identifying an unmet need. Qwen3.5-122B-A10B-BF16.gguf, for example: what's its sha256sum? I don't think the HF UI will tell you. I can only download the shards, verify each shard's sha256sum (which the HF UI does provide), llama-gguf-split --merge them, and then sha256sum the merged file myself. But I can't independently confirm that final sha256sum from any other source I trust.


> would be nice to have a place that have a list of typical models/ hardware listed with it's common config parameters and memory usage

https://www.localscore.ai from Mozilla Builders was supposed to be this, but there are not enough users I guess, I didn't find any Qwen 3.5 entries yet


I tried qwen3.5:4b in ollama on my 4 year old Mac M1 with my own coding harness and it exhibited pretty decent tool calling, but it is a bit slow and seemed a little confused with the more complex tasks (also, I have it code rust, that might add complexity). The task was “find the debug that does X and make it conditional based on the whichever variable is controlled by the CLI ‘/debug foo’” - I didn’t do much with it after that.

It may be interesting to try a 6bit quant of qwen3.5-35b-a3b - I had pretty good results with it running it on a single 4090 - for obvious reasons I didn’t try it on the old mac.

I am using 8bit quant of qwen3.5-27b as more or less the main engine for the past ~week and am quite happy with it - but that requires more memory/gpu power.

HTH.


What matters for Qwen models, and most/all local MoE models (ie. where the performance is limited) is memory bandwidth. This goes for small models too. Here's the top Apple chips by memory bandwidth (and to steal from clickbait: Apple definitely does not want you to think too closely about this):

M3 Ultra — 819 GB/s

M2 Ultra — 800 GB/s

M1 Ultra — 800 GB/s

M5 Max (40-core GPU) — 610 GB/s

M4 Max (16-core CPU / 40-core GPU) — 546 GB/s

M4 Max (14-core CPU / 32-core GPU) — 410 GB/s

M2 Max — 400 GB/s

M3 Max (16-core CPU / 40-core GPU) — 400 GB/s

M1 Max — 400 GB/s

Or, just counting portable/macbook chips: M5 max (top model, 64/128G) M4 max (top model, 64/128G), M1 max (64G). Everything else is slower for local LLM inference.

TLDR: An M1 max chip is faster than all M5 chips, with the sole exception of the 40-GPU-core M5 max, the top model, only available in 64 and 128G versions. An M5 pro, any M5 pro (or any M* pro, or M3/M2 max chip) will be slower than an M1 max on LLM inference, and any Ultra chip, even the M1 Ultra, will be faster than any max chip, including the M5 max (though you may want the M2 ultra for bfloat16 support, maybe. It doesn't matter much for quantized models)


For comparison, most recent (consumer) NVIDIA GPUs released:

- 5050 - MSRP: 249 USD - 320 GB/s

- 5060 - MSRP: 299 USD - 448 GB/s

- 5060 Ti - MSRP: 379 USD - 448 GB/s

- 5070 - MSRP: 549 USD - 672 GB/s

- 5070 Ti - MSRP: 749 USD - 896 GB/s

- 5080 - MSRP: 999 USD - 960 GB/s

- 5090 - MSRP: 1999 USD - 1792 GB/s

M3 Ultra seems to come close to a ~5070 Ti more or less.


You should really list memory with the graphics cards, and above should list (unified) memory and prices as well with particular price points.

I mean what I was curious (and maybe others) about was comparing it to parent's post, which is all about the memory bandwidth, hence the comparison.

But it doesn't matter if you have 1000GB/s memory bandwidth if you only have 32GB of vram. Well, maybe for some applications it works out (image generation?), but its not seriously competing with an ultra with 128 GB of unified memory or even a max with 64 GB if unified memory.

> but its not seriously competing with an ultra with 128 GB of unified memory or even a max with 64 GB if unified memory.

No one is arguing that either, this sub-thread is quite literally about the memory bandwidth. Of course there are more things to care about in real-life applications of all this stuff, again, no one is claiming otherwise. My reply was adding additional context to the "What matters [...] is memory bandwidth" parent comment, nothing more, hence the added context of what other consumer hardware does in memory bandwidth.


If we are talking about Apple silicon, where we can configure the memory separately from the bandwidth (and the memory costs the same for each processor), we can say something like "its all about bandwidth". If we switch to GPUs where that is no longer true, NVIDIA won't let you buy an 5090 with more 32GB of VRAM, then...we aren't comparing apples to apples anymore.

A 10GB 3080 still beats even an M2 Ultra with 192GB... memory bandwidth is not the only factor.

https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...


If the model is small enough to fit in to 10GB of VRAM the GPU can win.

But the bigger models are more useful, so that’s what people fixate on.


There is also prompt processing that's compute-bound, and for agentic workflows it can matter more than tg, especially if the model is not of "thinking" type.

This may help you work out the best quant to use for your use case.

https://www.siquick.com/blog/model-quantization-fine-tuning-...


There is also https://sparse.docs.kernel.org/ that can be used to help refactoring C code.


Interesting enough today I found that https://huggingface.co/openai/gpt-oss-20b when downloaded an running locally and asked:

====

when have you been built?

I’m a GPT‑4‑based model that OpenAI released on March 14 2023. The underlying training data goes up to early 2023, and the model was trained in the months leading up to that release. If you’re asking about the ChatGPT product that ships the model to users, it went live in November 2022 and has since received updates (GPT‑3.5, GPT‑4, etc.) that keep it current.

====

But when supposedly running it from https://huggingface.co/chat/models/openai/gpt-oss-20b:

====

when have you been built?

I’m a language model created by OpenAI. The current generation (GPT‑4) that powers this chat was first released in March 2023 and has been updated and fine‑tuned up through the end of 2024. My training data runs up to the beginning of June 2025, so I’m built on knowledge available up to that point.

====

And that makes me thinking that although https://huggingface.co/chat claims to be using the models available to public at https://huggingface.co , it doesn't seems to be true and I raised this question here https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/discussions... , https://github.com/huggingface/inference-playground/issues/1... and https://github.com/ggml-org/llama.cpp/discussions/15396#disc... .


I did an experiment while learning about LLMs and llama.cpp consisting in trying to use create a Lua extension to use llama.cpp API to enhance LLMs with agent/RAG written in Lua with simple code to learn the basics and after more than 5 hours chatting with https://aistudio.google.com/prompts/new_chat?model=gemini-3-... (see the scrapped output of the whole session attached) I've got a lot far in terms of learning how to use an LLM to help develop/debug/learn about a topic (in this case agent/RAG with llama.cpp API using Lua).

I'm posting it here just in case it can help others to see and comment/improve it (it was using around 100K tokens at the end and started getting noticeable slow but still very helpful).

You can see the scrapped text for the whole seession here https://github.com/ggml-org/llama.cpp/discussions/17600


I've asked to index my project https://github.com/mingodad/parsertl-playground and the result https://deepwiki.com/mingodad/parsertl-playground seems to be reasonable good (still going through in more detail but overall impressive).



Here I've got it to work with recent compilers/OSs https://github.com/mingodad/cfront-3


I'm collecting a collection of PEG grammars here https://mingodad.github.io/cpp-peglib and Yacc/Lex grammars here https://mingodad.github.io/parsertl-playground/playground both are wasm based playgrounds to test/develop/debug grammars.

The idea is to improve the tooling to work with grammars, for example generating railroad diagrams, source, stats, state machines, traces, ...

On both of then select one grammar from "Examples" then click "Parse" to see a parse tree or ast for the content in "Input source", then edit the grammar/input to test new ideas.

There is also https://mingodad.github.io/plgh/json2ebnf.html to generate EBNF for railroad diagram generation form tree-sitter grammars.

Any feedback, contribution is welcome !


This is awesome! I've recently begun diving deeper into working with grammars, using them as part of a new project, and these tools look super useful.


There is also https://github.com/ricomariani/CG-SQL-author that has a powerful stored procedure capabilities that can be transpiled to C/Lua/..., you can try it in your browser here https://mingodad.github.io/CG-SQL-Lua-playground .


There is also https://sciter.com/ that the author tried to find finance to make it opensource but couldn't find enough supporters.


It's not a browser.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: