Update: Done! [Most likely, If I find the time to do so,I will keep an eye on it for a few days to see if its computed daily or not but I am most likely sure that it might be the case]
By that logic, there was no value added in updating the syntax from ANSI C to C99: no reason to add double-slash comments, no reason to allow function-local variables to be declared and initialized where you use them instead of the top of the function, and no reason to omit a return statement for the main() function.
From your reasoning, it also follows that using infix notation instead of Polish notation for arithmetic does absolutely nothing to lower the learning curve.
yes, but that's not what happen here. this part of the AGPL is there to avoid people adding more restrictions, but here mattermost is loosening up the restrictions.
> > We promise that we will not enforce the copyleft provisions in AGPL v3.0 against you if your application ... [set of conditions]
I mean... I don't really see how they are. Technically they are but at the same time they aren't, because the set of conditions make the loosening of the AGPL a conditional thing. Which to me sounds like a violation of the AGPL because it's a further restriction: "We will (not) hold the AGPL against you... As long as you do these things..." I... Really don't think the AGPL was written to be... Abused? That way.
You can see the spirit of what they're going for also with the MIT binaries - that's also like saying the whole project is AGPL, but a loosening for using it as-is.
Given their goals seem to be
- Permissive use without modification, even in combined works ("MIT binaries"); but
- Copyleft with modification, including for the Affero "network hole", or commercial terms
could you suggest a clearer license option? AGPL triggers copyleft across combined works, LGPL doesn't cover the network hole, GPL has both problems. Their goals seem really reasonable, honestly, there should be a simple answer. It seems messy but I like it more than the SSPL/BSL/other neo-licenses.
I don't know anything more reasonable, but I would argue that this (isn't) reasonable precisely because it causes so much confusion due to the ambiguity and their refusal to clarify exactly what the terms really are.
zsync is better for that. zsync precalculates all the hashes and puts them in a file alongside the main one. The client downloads the hashes, compares them to what it has then downloads the parts it is missing.
With rsync, you upload hashes of what you have, then the source has to do all the hashing work to figure out what to send you. It's slightly more efficient, but If you are supporting even 10s of downloads it's a lot of work for the source.
The other option is to send just a diff, which I believe e.g. Google Chrome does. Google invented Courgette and Zucchini which partially decompile binaries then recompile them on the other end to reduce the size of diffs. These only work for exact known previous versions, though.
I wonder if the ideas of Courgette and Zucchini can be incorporated into zsync's hashes so that you get the minimal diff, but the flexibility of not having a perfect previous version to work from.
I just kept scrolling, hoping it would learn from how long I paused over content to read it the way FB's seems to, but it seems you're right, in this case "likes" are required.
> The 1.8-bit (UD-TQ1_0) quant will run on a single 24GB GPU if you offload all MoE layers to system RAM (or a fast SSD). With ~256GB RAM, expect ~10 tokens/s. The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs.
If the model fits, you will get >40 tokens/s when using a B200.
To run the model in near full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe.
For strong performance, aim for >240GB of unified memory (or combined RAM+VRAM) to reach 10+ tokens/s. If you’re below that, it'll work but speed will drop (llama.cpp can still run via mmap/disk offload) and may fall from ~10 tokens/s to <2 token/s.
We recommend UD-Q2_K_XL (375GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading.
I'm running the Q4_K_M quant on a xeon with 7x A4000s and I'm getting about 8 tok/s with small context (16k). I need to do more tuning, I think I can get more out of it, but it's never gonna be fast on this suboptimal machine.
you can add 1 more GPU so you can take advantage of tensor parallel. I get the same speed with 5 3090's with most of the model on 2400mhz ddr4 ram, 8.5tk almost constant. I don't really do agents but chat, and it holds up to 64k.
That is a very good point and I would love to do it, but I built this machine in a desktop case and the motherboard has seven slots. I did a custom water cooling manifold just to make it work with all the cards.
I'm trying to figure out how to add another card on a riser hanging off a slimsas port, or maybe I could turn the bottom slot into two vertical slots.. the case (fractal meshify 2 xl) has room for a vertical mounted card that wouldn't interfere with the others, but I'd need to make a custom riser with two slots on it to make it work. I dunno, it's possible!
I also have an RTX Pro 6000 Blackwell and an RTX 5000 Ada.. I'd be better off pulling all the A7000s and throwing both of those cards in this machine, but then I wouldn't have anything for my desktop. Decisions, decisions!
reply