Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[flagged] Revert for jart’s llama.cpp MMAP miracles (github.com/ggerganov)
86 points by mmoustafa on April 2, 2023 | hide | past | favorite | 86 comments


I really dislike giving HN exposure to this kind of issue; it only brings us the forbidden pleasure of voyeurism while not helping the maintainers & contributors in the slightest – and can even crystallize conflicts while we eat popcorn.

Let us let them take their time, wash their dirty laundry among themselves, and take the time they need to go forward on the project.


There's an opportunity for a wider discussion about how these situations should be handled. We can all learn and grow from that.


Such discussions are dime a dozen on any major GH repo, no need to follow the latest, most fashionable, scandal.


Ideally, this will be resolved diplomatically by someone who knows how to handle these personalities. From that point we might learn from their example.


i vote for someone like dang to handle these open-source drama episodes.

i have limited knowledge on what is being discussed but i echo the "ego-driven" comment below. because it's Sunday. and the replies are coming in so fast that it's certain, they are not giving enough thought to the other side's argument. and already talking about forking, not accepting collaborators, etc.


[flagged]


Please tell me what people who would discover this thread on HN could constructively add to this conversation.

> Real discussion and transparency can involve multiple viewpoints that conflict with each other while each having their plusses and minuses.

Which is exactly what's currently happening between the concerned people in the GH issue.


> Please tell me what people who would discover this thread on HN could constructively add to this conversation.

I mean is that really the bar for posting hacker news articles?

> > Real discussion and transparency can involve multiple viewpoints that conflict with each other while each having their plusses and minuses.

> Which is exactly what's currently happening between the concerned people in the GH issue.

Yeah that was my point.


At least one reasonable bar is not effectively vandalizing other communities by airlifting in a bunch of uninvolved commenters into an already drama filled situation.


ggerganov commented Apr 2, 2023

So this is pretty stupid - I just lost my Sunday trying to figure out how to salvage this stupid drama

@anzz1 and @jart

You are no longer welcome as collaborators to the project.


But you should quote it completely, else it's manipulation:

> You are no longer welcome as collaborators to the project. I know you really care about it and only doing it because you really want to make it better - I'm 100% sure about this. But in fact, you are doing the opposite. And if you fail to see this - I'm sorry


It's really bad reply from Greg.

He's the owner of the project, he has the power to accept / not accept changes, and he didn't object to the version change, now he pushes responsibility to the contributors. It's ugly way of dealing with other people.

The way to solve this situation is to set up a video call between them to deescalate the emotional part of the situation (which is not a big deal anyways, we can wait a few days for the technical details to get resolved).


Not super surprising. These projects have (understandably given the current environment) blown up in the last few weeks. The level of exposure / pace / etc is something very few people in general are probably equipped to deal with.


I agree. Also as a person who tried to contribute to open source projects, I understand the pressure Jart had to put all his changes into 1 big change instead of splitting it up to pieces: there's always a power asymmetry that favors the project maintainer by default as he's getting more famous (which happened to Greg).

If Jart would have tried to do everything the ,,proper way'', this change wouldn't have had this visibility, and he would have had much harder time to change the memory layout in the file.

I think things are going great, I just wish people would be able to not care about the negative feedback part that comes with it naturally.


looks like a she/her, and she _might_ have mis-appropriated attribution. Not sure, but we live in a world that doesn't assume good-will. https://rentry.org/Jarted


Then he'd have to do a video call between him, myself, and 4chan.


huh? Of all things what does 4chan have to do with anything here?


See https://rentry.org/Jarted I'm not sure if it's 4chan but it appears to be some chan.

When I was working on solving the mmap() problem in https://github.com/ggerganov/llama.cpp/issues/91 (that issue tracks the full history) I originally wrote a malloc() hack to prove zero latency load times could work. But I thought we should hold off and wait until we could fix the file format before merging anything into master.

A week or so later, an anonymous GitHub user called Slarin pointed out something really smart. He said the 7B model only has 1D tensors, so we could bring the benefits of mmap() to 7B users right away. My file format change would have solved it for all models. But it wasn't ready yet. So when Slarin asked me to abandon my mmap branch, start over, and use his change as the starting point, I said sure. Since that would have ensured users of the project get some benefit as soon as possible. And we could iterate.

So Slarin and I collaborated on Discord. I had a really positive impression. As far as I knew, everything was fine. Slarin made a pull request, which he kept in draft mode. The work he was doing was basically to (1) call mmap() to load the full weights into memory, and (2) update the tensor->data instead of calling read() if the tensor was 1D. The issue is he Slarin got blocked on getting the WIN32 support to work. He had difficulties getting it to not crash. And by the time he was blocked, I had already finished rewriting the file format and data conversion tools. So I said, since I'm done with the full change, just push whatever you did to your PR branch, and I'll just rebase on that, I'll fix your WIN32 code, and then we'll merge our commits as part of the same PR. That PR was https://github.com/ggerganov/llama.cpp/pull/613 where I was very careful to recognize and document all of Slarin's contributions, which are now reflected in the master branch.

Slarin agreed, and told me "it looks like you've got this situation handled" and then left. We haven't seen him on the Discord since. I honestly had no idea he was so unhappy about the way things went. It really bums me out, because I have no way of ensuring people stay happy if they don't tell me they're unhappy. It appears, based on the link above, that Slarin felt I had put him in charge of the mmap() contribution and that the change belonged to him because his work was going to be its genesis once merged in the master branch. After the contribution went viral, he probably felt I didn't do enough to give him credit (I'm not exactly sure how to credit an anonymous developer more than I did?) and as a result he went to an online forum to ask a troll brigade to give justice.

The fallout of that "chan" troll brigade, is what you see here. They've also been going after my Wikipedia page among other things. I'm probably never going to work with an anonymous person again. I'm a public figure. I live with the sword of damocles hanging over my head. The power that gives anonymous people over me is really quite huge. The consequences here have damaged my reputation and career. And I don't even know who I'd reach out to if I wanted to mediate this. Quite a shame.


I was trying to distance myself from this situation, but this is just too painful to read. I am sincerely sorry that people have harassed you on my behalf, but I have no control over what some people say or do on an anonymous board based on publicly available information.

That doesn't mean that I am happy with the way our collaboration was handled. Why did you create a new converter when you knew there was already an existing pull request that addressed the same issue? Why did you modify the model format and break backwards compatibility when the current format was proven to work with mmap? Why did you change the magic string of the file format to include your initials, when there was an explicit version number field for this purpose? Why did you create a new pull request when you could have added your changes to mine? Why did you rush to merge the PR instead of taking your time to verify that everything worked properly, while listening to feedback from the other contributors and users? Why did you did you ignore concerns raised by other contributors in my PR? Why are you claiming that I was unable to make the WIN32 code work when the final version in your PR is virtually identical to mine, making me look incompetent?

Ultimately, it was my decision to move on, close my PR and allow yours to continue unchallenged, and I owned that decision every single time that I have commented about it, including in the PR linked in this post, where I recommended keeping your PR and working on fixing the issues being raised. I am sorry that some people have harassed you, but making me responsible about this is extremely unfair. There are plenty of reasons for people to feel disappointed about your behavior without me having to say anything about it.

I don't expect that anyone will believe me about this, after all I am just an "anonymous person". The truth is, I am extremely weary about posting this because I know how much damage you can do to me if you insist on this route to your followers. What is your theory, that because I am nobody I have nothing to lose? How are you not aware of the huge power imbalance between a "celebrity programmer" with thousands of followers and a nobody like me?

Anyway, all the information is publicly available on github for anyone who cares enough to verify it.

- slaren


  > I was trying to distance myself from this situation [...]
You were trying to distance yourself from the situation?

Looking at https://rentry.co/Jarted [0], there was somebody claiming to be you saying things like:

1. "Slaren here. Pretty much all that has been said here is correct, what jart did was to take my code, remove backwards compatibility, add a new converter and then proceed to take all the credit."

2. "To understand why I did that, you have to go back here: https://github.com/ggerganov/llama.cpp/issues/91#issuecommen... jart initially created an implementation of mmap a couple of week back that was an abomination that relied on doing things like replacing malloc. Completely unworkable in a real code base. [...] So anyway, I joined jart's discord and talked to her about this a bit, she seemed to be interested in collaborating and that's why I added her as co-author, even thought she didn't write a line of code of the PR. Eventually out of nowhere she opened the PR that you all know and asked me to close mine. That's when I realized what was happening. So whatever, I did what she asked, left her discord and tried to forget about it".

This was intermingled into comments calling jart a "troon" (a derogatory term for transwoman).

Are you saying unequivocally that this wasn't you?

Even assuming that somebody was stealing your identity, why didn't you point out the problems with the PR upfront? One can't sit on the sidelines casting aspersions (even posthoc) while also claiming that they are distanced from the situation.

Either (a) people are competently pretending to be you on some kind of anti-trans imageboard by some how managing to make past statements that are perfectly consistent with the statements you've made today, or (b) you did pop into that thread to stir up shit.

I think there is a later comment in that thread from you responding to someone saying `#JusticeForSlaren` and requesting that they don't do anything, but that was yesterday and at this point it was too late.

In general what can be seen online doesn't look good for you and you are lucky to be anonymous. I know that you feel that jart stole your glory and it seems they did, but you responded with passive aggressive behaviour and whipped up a mob -- it's hard to believe you are stupid enough to not realise what you were doing with your comments. There were many better responses you could have made: you picked the worst one.

[0] The `.org` is mysteriously down: https://web.archive.org/web/20230402192741/https://rentry.or...


They weren't calling for drama:

> Maybe I should have contested it but I was and still am going through a pretty rough patch in my life and just didn't have the willpower to start any drama. Mostly I think it sucks because IMO the worse technical solution got merged because their PR had a more flashy title.

>That's when I realized what was happening. So whatever, I did what she asked, left her discord and tried to forget about it.

>I really don't want to start any drama so I'll just say that I wrote the code in my commits.

No one claiming to be that user said anything derogatory or ever called for drama. You're making hollow accusations, basically: "Your messages were intermingled among bad comments" "some other users said bad things," "do you denounce you ever gave your side of the story?"

And I'd certainly consider it distancing yourself from the conversation when you close your PR and remain silent when someone essentially steals your code, and goes from "I was co-author" to "my code," "my work," "I did this," "I'm the author" all over Twitter, a bragging PR where she changed the magic number to her initials, etc.

I'm the one "responsible" for noticing this and raising the flag that something isn't right. Code was stolen and the toxic user responsible was taking more and more credit. The community reacted appropriately, on the whole, as did the owner of the project in banning the plagiarist. She's free to add her side of the story, of course. This issue was actually raised with her on Twitter twice and she ignored it, before it made its way to Github. To the extent that drama was caused by this, it's wholly the fault of the person who created this situation with her unethical behavior and intentionally misleading statements.

Of course no one can verify if that user is the same one in the 4chan threads. It's 4chan. But the commit history speaks for itself, and is well-documented by now.

None of this is an excuse for derogatory terms or slurs to be used on 4chan (or elsewhere), but you're intentionally muddying the waters.


>The major point I make is that the posts online that purport to come from Slaren do not show that they had "distanced themselves from [the] situation".

It's evident from simply looking at the PR that the user @slaren on the Github distanced himself from the situation. Days had passed with no one discussing the stolen code until I brought it up in the original PR (which jart rebased off of and created the infamous "Make loading weights 10-100x faster" PR)

> Yes, I can see that you were one of the key people that created drama, by asking "I'm wondering how much was written by you and how much by jart?" [0]

I didn't create drama. Jart created drama. By stealing code. Plagiarism. Then shameless self-promotion, to this very moment.

Do you have absolutely no integrity whatsoever?

>and then when they publicly said they didn't want to start drama, trying to get private comment from them by saying "My contact info is in my profile if there's more to say."

They didn't contact me, and I simply went through the public Github history to document what jart had done, and continues to do.

>I think somebody could choose to believe this, but somebody that reads the GitHub and desuarchive.org threads might also feel that @InconsolableCellist and @slaren had a part to play, too.

Yes, correct. I told you my part, I noticed what jart was doing. Why have you continually ignored what jart did? Why do you seem to think it's some minor issue that she stole code, bragged about it, took all the credit, and damaged the community with unnecessary drama? Why are you so focused on everything except the central ethical issue?

>Yes, it's technically possible that somebody pretending to be Slaren investigated the GitHub and was able to correctly infer exactly what happened chronologically including that they had collaborated on jart's Discord. However, Occam's razor suggests it was Slaren themselves and not a very clever troll.

Even if that were true--which it isn't, to my knowledge--it changes nothing about jart's behavior. Even if every user also used derogatory slurs, it changes nothing about the wrong was committed (but adds additional wrongs).

Fortunately, for anyone level-headed enough to look at what's been discussed so far, you can see the unethical behavior of the user that stole code, the aftermath, and the appropriate reaction for that user to be banned. The behavior that you've ignored and seem wholly unconcerned with, as if blind to it. Plagiarism.


I believe you meant to respond to: https://news.ycombinator.com/item?id=35430052

However, I won't respond to you here, since (1) it should be quite clear that I think @slaren wasn't given enough recognition for their work from my prior comments and that there is a more positive approach you could have taken to helping to give them this, and (2) the rest of what you said about ethics is subjective, and I think wrong in magnitude -- for example, I'm not sure it's correct to call it "plagiarism" when @jart's PR mentioned the collaboration with @slaren, used co-authored commits and linked to their PR.


"my changes"

"Here's how folks in the community have been reacting to my work."

"I just wrote a change that's going to let your LLaMA models load instantly..."

https://archive.ph/PyPFZ

"I'm the author"

https://archive.ph/qFrcY

"Author here..."

"Tragedy of the commons...We're talking to a group of people who live inside scientific papers and jupyer notebooks."

"My change helps inference go faster."

"The point of my change..."

"I stated my change offered a 2x improvement in memory usage."

https://archive.ph/k34V2

"I can only take credit for a 2x recrease in RAM usage."

https://archive.ph/MBPN0

"I just wrote a change that's going to let your LLaMA models load instantly, thanks to custom malloc() and the power of mmap()"

https://archive.ph/yrMwh

jart was working on a malloc() approach that didn't work and slaren wrote all the code actually doing mmap, which jart then rebased in a random new PR, changed to support an unnecessary version change, magic numbers, a conversion tool, and WIN32 support when that was already working in the draft PR. https://archive.ph/Uva8c


From what I can see, @jart had spent a considerable amount of time on this problem and had posted an interesting-but-not-production hack to it (https://github.com/ggerganov/llama.cpp/commit/5b8023d9354010...) on March 17th, which they had also excitedly posted about on Twitter.

This was 2 weeks prior to @slaren's contribution (https://github.com/slaren/llama.cpp/commit/fc685122f95f212d1...) on March 29th, so in a sense, it's quite possible that what you've just shown is that @slaren saw that @jart was working on mmap support, worked out a cleaner solution and then wasn't happy with only being a co-author -- for their contribution, they believed that they must be the only person mentioned on the PR: although this is weird, since I don't think they even have a public profile, so maybe instead the truth is that they weren't comfortable with working with somebody that hypes up any changes they've worked on for popularity?

I don't think saying "my changes" on Twitter and other social media means what you suggest it does as is it is just informal speech to refer to things you've worked on with "my", and particularly when you see the times this was expanded (e.g. "yesterday my changes to the LLaMA C++ file format were approved") it seems more reasonable than it does without this context.


If you read the rentry you'll see that both of them were working on an issue that l29ah raised, along with other users. jart's work was on something that didn't end up making it in, the malloc() approach. slaren is the one who wrote the code in the commits I linked to, and that's the code that was adopted. You can (and should) do a comparison of the mmap code and see. What I wrote about the version change, magic number, WIN32, etc., is all true too. As is the haste with which the new PR was made, leading to the recent pushes to revert due to swap thrashing and anger over false and rushed claims about "miracle RAM reduction" etc.

In fact, if you read the thread you linked to, you'll see this for yourself too, no reentry required. There's nothing actually objectionable or "repulsive," as jart put it, in that renetry, with an exception of the "r word" being applied to a proposed technical solution.

Your interpretation is incompatible with what we see and the clear timeline. The social media bragging, the second PR, etc., are further evidence. I hope whatever anger you had going into this has abated to the point where you can now actually judge the evidence.


The major point I make is that the posts online that purport to come from Slaren do not show that they had "distanced themselves from [the] situation".

  > No one claiming to be that user said anything derogatory or ever called
  > for drama. You're making hollow accusations, basically: "Your messages 
  > were intermingled among bad comments"
I did not make that accusation. My accusation is that it is ill-advised to enter an anonymous imageboard where people use words like "troon" and often show mob-like behaviour, and to decide there to mouth-off about how someone took all credit for your code, removed backwards compatibility, and to add that their original attempt was an "abomination".

This is not "removing yourself from [the] situation" as Slaren asserts.

  > And I'd certainly consider it distancing yourself from the conversation
  > when you close your PR and remain silent when someone essentially steals
  > your code [...] I'm the one "responsible" for noticing this and raising
  > the flag that something isn't right.
Yes, I can see that you were one of the key people that created drama, by asking "I'm wondering how much was written by you and how much by jart?" [0] and then when they publicly said they didn't want to start drama, trying to get private comment from them by saying "My contact info is in my profile if there's more to say."

  > To the extent that drama was caused by this, it's wholly the fault of the
  > person who created this situation with her unethical behavior and 
  > intentionally misleading statements.
I think somebody could choose to believe this, but somebody that reads the GitHub and desuarchive.org threads might also feel that @InconsolableCellist and @slaren had a part to play, too.

  > Of course no one can verify if that user is the same one in the 4chan threads.
Yes, it's technically possible that somebody pretending to be Slaren investigated the GitHub and was able to correctly infer exactly what happened chronologically including that they had collaborated on jart's Discord. However, Occam's razor suggests it was Slaren themselves and not a very clever troll.

I'm really not muddying the waters here. What you're trying to argue is difficult for me to believe, and whether or not you disagree with the level of recognition given by jart, your comment that "it's wholly the fault of the person who created this situation with her unethical behavior" is ugly. It pins all the blame on jart when it's quite clear from both Slaren and your comments that you were trying to cause drama (anonymously and publicly).

I'd just like to add that if yourself and @anzz1 had wanted to give @slaren the recognition that they deserved in a positive way, you'd have linked to https://github.com/ggerganov/llama.cpp/issues/91#issuecommen... and signal-boosted that as the key insight that enabled the PR to land, rather than taking the approach you took.

[0] https://github.com/ggerganov/llama.cpp/pull/586#issuecomment...


Pretending to be someone else to stir up drama is pretty much par for the course for 4chan, especially because they really, REALLY hate transexuals. I'd have been more surprised if someone WASN'T claiming to be slaren.

Trust absolutely nothing from that site or any other imageboard unless people provide documented evidence, e.g. timestamped picture or signed message, etc

>The stories and information posted here are artistic works of fiction and falsehood.

>Only a fool would take anything posted here as fact.

I don't actually have any idea where this quote appears these days, I've just been hearing it in reference to chans for well over a decade now. I think it might have been on the bottom of every page in the past.


It's /b/'s banner.


I can't thank you enough for posting this. I found the link too repulsive to click that far. I learned a new word today. So that's what they call people like me these days. Hate is such a lost opportunity.


it is good to hear your side. sympathy for all involved. lets hope this is resolved amicably and this important project and helpful contributors lives are not further impact. (I myself believe you. sincerity comes straight through and +1 for not using "probably" in describing what "happened".)


Thanks, that clears it up somewhat (especially the trollings and downvotes).

The cool thing is nobody cares about your wikipedia page, people care about your contributions.

I still think you should hop onto a video chat with Greg and be humble (I hope he can be more understanding and humble as well). File formats don't matter, that's just technical detail (although important) that will be resolved. People relations do matter.


I tried getting him on video chat before this troll brigade even happened. He hasn't responded to any of my messages over the past few days.


I see, I'm sorry about that.


No reason to be sorry. GG has had more success than he can handle right now. It's a good problem to have. It takes quite a stressful toll the first time one of your community projects skyrockets into the big leagues. I'm sure things will be fine if we just wait a little bit for this to blow over.


You're Wondering Now, But All Results Well :)


Hey Jart, I read that rentry and suspected there was more to it, so I'm relieved to see your input and the full picture here.

Such is the way with online drama. Though I think only working with known people might not be it, as the same could still happen (not being communicative on their part, silently brooding, then doing something regrettable). Perhaps it's less likely.

I would venture those in the field understand that unfortunate misunderstandings of this sort happen all the time and find nuance in the situation. Please don't let this stop you from contributing, your work is amazing. Cheers.


ugh, that's so shitty. so many people in this space seem to be absurdly demanding and angry at devs, but one thing I've noticed is that every text AI project discord I've hung out in has this sleazy, obsessive 4chan /g/ vibe hiding somewhere in it.


Wow... I feel sorry for you.

Having an entire troll brigade going after you, after you acted with the best intentions... not nice. I hope you get over it, soon.


The moment you involve a topic that is hot with bunch of bystanders you are endangering yourself by fighting/invoking with stupid people with abundant time.


That's where the brunt of the coordination, outreach, and testing around these models is happening.


What is the purpose of re-posting a comment in the linked thread here verbatim?


Updated context for those who read the issue before it was posted, that wasn’t there during the time of the original submission.


This is someone angrily filing a revert-all PR due to a performance regression, rather than helping diagnose the issue or make it configurable. Don't bother reading.

It sounds like one person experienced a performance regression as a result of the llama.cpp MMAP changes, and decided to create a pull request to revert all of those changes. While they propose wrapping the mmap changes behind a feature flag / command-line flag, that's not what this PR does -- it just reverts the original commits. It's a "tear it down NOW" reaction rather than a "how can I improve this" reaction.

jart and a few other people in the PR have now proposed a variety of feature-flags or forked versions to address the issue in a more nuanced way.


That’s not what’s happening. File format compatibility is broken while performance degrades by 10x for some people.


This is the part of Open Source I really despise. It looks like the top contributors in this repository have contributed a few hundred lines of code (as opposed to the 20Kloc by the author). I understand that lines of code is not comparable to level of effort, but there is at least some level of correlation there.

The predominant attitude I have seen with my open source projects is entitlement and anger at decisions I have made, (whether that's because my license isn't MIT or because I don't want to use the latest and greatest features of language X, or because I use 2 spaces instead of 4). I just want to share my code, but some people make this unnecessarily difficult and want to cause drama where there never needed to be any.

Now, with that said, I have also met amazing people who have offered invaluable insights. These people have made contributions, to code and discussions, and on the other side provided amazing libraries and support. I really love Open Source, but there is a certain aspect of the community that can be downright hostile, and I hate that. I never understood why some developers feel the need to belittle others or to scoff at what other people want to share. I hope that if people know my name it's because I encouraged them and gave them help and/or praise for a cool project, and not because I was a dick and made them feel like crap for an inconsequential action.


This contributor doesn't appear to know how mmap works if they're claiming the only benefit is sharing data between processes (what? MAP_PRIVATE mappings aren't shared), and that memory is leaked after the process exits.

There are a lot of thorny issues with mmap, and I'm sure there are legitimate regressions with the approach and things to be fixed, but it sure would be nice to see an analysis from someone who actually knows what mmap is.


The code is related to MAP_PRIVATE mappings of the same file that are not written to. Such mappings are effectively deduplicated and thus occupy the RAM once no matter how many processes map the file.


I don’t understand the controversy in this issue. It seems they could have saved a lot of time spent throwing shade back and forth by just implementing a feature flag.

The argument against the feature flag is ultimately more egregious; it’s an experimental feature, breaks compatibility, decreases memory usage for a fair portion of the population while 10x’ing load speed for the rest so very YMMV for an optimization.

In another project this wouldn‘t even be a revert PR but just a PR to feature flag it. Can’t help but notice that more than half the replies on this PR are from people who have a limited understanding of LLMs, admit to it, and are just adding noise because this project is popular right now. First step would be to lock this contributors only so they can get a more streamlined discussion going.


In my experience, there will always be a population of developers/users that prefer for things to always stay the same (and therefore never break). Unfortunately that means never improving.


> > > memory mapping means that the model will stay behind and eat your memory even after the process is closed

> > I don't think I'm unterstanding this right: You're saying that memory will not be freed by the OS after the process terminates?

> You're understanding it perfectly. The whole raison d'etre for mmap() is the ability to leave stuff in RAM (or swap, albeit if that happens it's completely detrimental to this use case) unlinked from the process itself. Basically it's just storing a file/memory block in RAM which can be accessed from multiple processes.

Wow, does this developer not understand the nuances of mmap().


The fundamental operation of mmap is to add new entries to the page table of a process, and the precise properties of those entries are heavily dependent on what the arguments to mmap are.

When you mmap a regular file, you're essentially adding an entry to the page table that shares the data with the kernel's filesystem cache. I think he was trying to explain the implications of this fact, but doing so in an incredibly garbled manner, and getting his conclusions wrong.

There are performance implications to using mmap (not always good, not always bad), but both sides of the discussion here immediately dug their heels in on their conclusion without anyone trying to do any analysis to see what the actual implications were, and why.


> I think he was trying to explain the implications of this fact, but doing so in an incredibly garbled manner, and getting his conclusions wrong.

Yeah. I like using the expression "knows just enough to be dangerous" (usually applied in humility to myself), and this situation is such a perfect example. Someone who seems to know just enough about the advanced workings under mmap() to completely misunderstand the implications.


If you squint maybe you can argue that mmap will leave things in the page cache. But, you know, it doesn’t matter and not even munmap will save you there so I have no idea what they’re getting at.


Rather than just pointing and laughing, your post would have more substance if you could correct the flawed assumptions.


Why reverting it instead of adding upon it? The author of the revert could easily start working on reintroducing the previous format behind a flag. AFAIK llama.cpp is not even v1 yet. I see this revert PR as unnecessary


I’m not in the know at all here, but the original PR wasn’t purely additive- there was code deletion, and additions across a number of files. It seems to change the checkpoint format. The code should be abstracted differently for it to be placed behind a flag.


I understood that, but it was accepted. We don't need to cry over the spilled milk, one can re-add the previous model based on the removal PR. No need to push a revert


Feature flags are great but they should not be used as a crutch to leave unfinished or non-working code in the codebase. IMO they should be used to rapidly pull back the change before the slower revert can take place.


Can someone in the know describe what the hullaballoo is about?

Seems like ego-driven optimization that breaks compatibility?


A project that has been generating a lot of buzz lately (CPU-based inference for Facebook's LLaMa model that works on commodity hardware) has attracted contributions from a tech/activist celebrity (https://en.wikipedia.org/wiki/Justine_Tunney). Their somewhat overly self-assured/-aggrandizing style (e.g. Github posts written in a tone like they run the place, changing the file format magic number to include their own initials along those of the project's originator) rubbed many people the wrong way, and a sweeping change they introduced may have resulted in performance regressions for several users (while also being hugely oversold: a fantastical and quickly disproven claim about significantly reduced memory usage sat at >1k upvotes on HN yesterday). Then another long-standing contributor made a PR just seeking to flat out revert the patch in question. The discussion quickly turned toxic, with a (thankfully) not-yet-quite-vocalised US culture war undercurrent and people bandwagoning based on their personal disposition towards the person at the core of it.


Also worth mentioning that there is some level of controversy over how much of this work involving mmap should be attributed to jart vs slaren. Slaren originally authored a PR using mmap which some people are claiming was the much better implementation of the feature (including not needing to change the model format) and jart basically re-wrote it so that she could take credit for it.


> basically re-wrote it so that she could take credit for it.

compare the actual PR: https://github.com/ggerganov/llama.cpp/pull/613

You'll note that it includes the original commits from Slaren, explicitly mentions the collaboration with Slaren and explicitly requests to preserve these commits on merge.


Is not ego-driven, the requirements for running any model dropped by more than half, you can know run the largest model, 30b, on domestic over the shelf computers. The change is very welcome in the community


There were no improvements to memory use, the earlier GBs were a measurement error. If you couldn't run a model before (or were swapping, so running very slow), then you still have the same problem. You will actually "swap" a lot more than before if you have barely enough memory, but this is fixable with the --mlock flag.

Edit: For everyone downvoting, please tell me what is wrong with my comment. I don't have a bone in this fight.


4chan came here to chime in a bit. You were downvoted by 10 year old children, good luck explaining to them how memory locking works :)


mmap breaks previous (not guaranteed) compatibility and (few) people demand option to turn it off by reverting all of the commits and throwing hands up by demanding more testing, documenting, and more optional arguments. While I agree with some of the premises, it’s up to ggerganov in the end. If he wants this to be the default, so be it. Throwing a (polite) tantrum because you should please all, while not offering to do any of the work, is entitled demand.


The tl;dr as I understand it is that jart had a misunderstanding of how what was actually happening and the benefits of the map optimization… the claims of actually being able to shrink the model size from 20GB > 6GB were just completely false, and while there was a model loading time improvement, actual memory required and used did not change.

A number of people saw this and said that making a breaking change to the repo that a lot of people are using and have forked for other models was a bad idea, thus this new PR.


The 20GB to 6GB confusion appears to have come from the title of the Hacker News post the other day: https://news.ycombinator.com/item?id=35393284

The PR it linked to said nothing of the sort: https://github.com/ggerganov/llama.cpp/pull/613


Sadly I’m on my phone at the moment and can’t find the specific post, but in that PR or related discussion there was talk of only a few GB of the weights actually being used during the computation, which anyone who understands how a multi headed attention transformer works would know is impossible… your QKV matmuls need to touch all of the weights once you go through all the layers. Since that post yesterday getting 1200+ upvotes resulted in multiple conversations in my social circle that took that untrue statement as fact.


The comment you're looking for is here[1], where @jart does indeed seem to be echoing the claim that memory usage is reduced to surprising levels given the LLM architecture. Justine's original guess was that this is because the model is sparse (?) but in actuality it seems like memory usage is just being reported strangely by the system and that a model that takes up 20GB on disk still effectively needs 20GB of RAM in the end.

[1] https://github.com/ggerganov/llama.cpp/discussions/638#discu...


> [which anyone who understands] how a multi headed attention transformer works [would know is impossible]

How is that productive? The author already claimed in their changes that they aren't an ML expert and asked for advice.

It could be written like

multi-headed attention transformers frobnicate all the weights in their QKV matmuls. <supposition ... I would look at IO rates during inference ...>

You have an opportunity to teach rather than admonish.


I appreciate your feedback, and I agree I could have better worded and taken it in a more constructive direction. Thank you, and I hope myself and others will try to have more productive discourse.


here mmap is being used for essentially lazy loading

that's it


But then it's pretty much the same as using the old version and turning on swap, so I don't really see the point. As far as I understand the whole model needs to be read constantly so there's no benefit from the random access mmap provides.


Swap is a system level property not a program level property. They are similar, use similar mechanisms, but the experience that a user would see are very different.


I'm not sure it would be that different but mmap has the benefit that it can swap directly to the model file on the disk instead of making a copy in swap space.


mmap is like tightly scoped, targeted swap. If the PoR is on disk, the OS is free to reclaim that memory for something else at anytime. It really is a beautiful hack, but if you turn on swap for the system in that way, it has to balance memory usage across all the running programs.

In this use of mmap, there is nothing to swap out as the source of truth is always on disk. During general swap usage, memory has to travel in both directions.


What's ego-driven optimization?


Pushing optimizations for personal glory that look good with some benchmarks but may have unthought of or hidden regressions on other aspects of the code/user base.

On the same hand this may be an ego driven revert. from reading the bug it seems like some people might be salty that jart gets a lot of publicity for a few changes where other people which bigger contributions to the project don't get.


The issue was with the model versioning compatibility change and someone inserting his initials in the magic number. The optimizations worked great for me on Linux, the first tiem your run the CLI program you have to wait a lot of time, more then 1 minute for a 20Gb model , but the next runs are instant. Maybe some Windows guys are salty that this does not work as great on their OS , if that is the case they should update it to get the performance boost. This are not soem nanoseconds you gain but actual minutes for each run.


Someone should use GPT3 and a voice model to make a dramatic version of github issues.


So this matches a pattern of submissions that eventually is or should be flagged. Can we not post GitHub issues that represent inter-project drama? It's not the discussion here that is the problem its effectively bringing commenters FROM here to stir up drama there where tensions are already high.


The gist is that people runing proprietary operating systems without the capability don't benefit from the change and don't want it to be the default.


Editorialized title. Please rename it to "Bring back the ggml model format and revert breaking mmap change (#613)" @dang


The only editorialized bit is "miracles," in my opinion. The current title gets closer to including the relevant context than does your updated title.


So much snark! Here and in the issue. Programming is hard enough.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: