Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Code for the Byte Pair Encoding algorithm, commonly used in LLM tokenization (github.com/karpathy)
81 points by magoghm on Feb 18, 2024 | hide | past | favorite | 31 comments


Hi all, it's still too early to look at this code :) I wanted to put up my alpha version to start the feedback going a bit. I'm recording a video alongside where we build minBPE and expect that to be a lot more useful, coming out in a few days.

I will say tokenization is UGLY. Really ugly. There are a ton of little, subtle, sharp edges and gotchas. Eternal glory awaits anyone who figures out a way to delete this stage.


Really wish that more folks had read the last authoritative piece on why BPE causes serious issues for LLMs, particularly in creativity and also for things like math: https://gwern.net/gpt-3#bpes


Karpathy himself has raised some of these issues with BPE https://twitter.com/karpathy/status/1657949234535211009

There is research about getting rid of tokens, some of which is in that thread


The gwern article significantly predates the twitter thread


That is fair. I didn't mean to imply it was earlier, just that he had mentioned it and research continued in this area. The article you linked is definitely more comprehensive.


Do you have a link to the original Gage paper from 1994?



Hehe.

https://github.com/karpathy/minbpe/blob/master/minbpe/base.p...

This code will incorrectly count pairs in [1,1,1] as two pairs of (1,1), despite there is only one, if you consider what merge function does. ;)

This can be seen in real world scenarios, for example, when there are many spaces.

My attempt to implement BPE has the same flaw and the fix appear to be very complex. But, as I looked at stats, it just makes such pairs to appear as top ones a couple of runs earlier, so it appears as a benign flaw. But, it's interesting that this is not mentioned in the code.


  def get_corrected_stats(ids):
    corrected_counts = {}
    i = 0

    while i < len(ids) - 1:
        current_pair = (ids[i], ids[i+1])
        # If the pair is new or if the current element is not a repeat of the previous one (or it's the first element)
        if current_pair not in corrected_counts:
            corrected_counts[current_pair] = 1
        else:
            if i == 0 or ids[i] != ids[i-1]:
                corrected_counts[current_pair] += 1

        # If the current and next element are the same, skip to the next unique element
        if i < len(ids) - 2 and ids[i] == ids[i+1]:
            while i < len(ids) - 1 and ids[i] == ids[i+1]:
                i += 1
        else:
            i += 1
  
I have a basic grasp of this topic. Can you explain why the proposed solution is (or isn’t) effective in this scenario?


I do not understand why your solution skips to the next unique elements.

Consider sequence [1,1,1,1,1,1]. We should count there 3 (three) (1,1) pairs. Karpathy's (and mine's) algorithm would count 5 pairs, yours' will count 1.

One should just skip one or two elements counting pairs, one element when pair's elements are different, two when they are same. This may break vectorization, though.

I also should note that error is here, but no-one hits it. My experiments show that (A,A) pairs gets counted as most frequent ahead of their natural turn about half a dozen symbols earlier. I also never saw that symbol B, assigned for pair (A,A), is used in any most frequent pair before an actual turn for (A,A). One need to construct special string for that to happen.


It sounds like that sequence is being k-merized. In the case of byte-pair that should be a look-ahead of 1 and then a shift up of 1. Maybe I am missing something.


Could something like that be why generated images tend to put people in a t-shirt that reads 'fooobar' or similar if you ask for say 'man in t-shirt fiddling with foobars'? Or is that expecting a far too tractable and deterministic link, considering understanding how the model actually works (I can't recall the name for it?) is a research area in its own right?


Nice catch, seeing as this is implemented by pretty much every LLM out there, it’s probably not such a big deal?


More or less so. The resulting vocabulary will be structurally same even if you correct for that, only indices will differ slightly.

When I spotted this in my code, I did an investigation whether this affects my encoding. Turned out, it does not.


And, for indic languages, there is orthographic syllable pair encoding, which reduces the amount of space taken by indic characters by encoding the most common letter pairs as a single byte. This was presented i think last year, at the URTC in MIT.

(Disclosure, it was published by my friend)


Isn't that a specialization of BPE for specific distribution? BPE is essentially a data compression scheme that constructs a straight-line grammar (SLG) that can be directly used as a model encoding, and it is hardly surprising that any other SLG would work well for some other inputs.


It is less byte pair at that point (as the indic languages generally take up 2-3 bytes per codepoint.)


Yeah, of course, but I thought BPE should pick it up anyway if your training data is primarily written in Indic scripts.


> for indic languages

amateur here - but "Indic languages" are quite different. Maybe you mean Hindi and that group ?


Presumably, they’re referring to the Indic scripts (also known as Brahmic scripts): https://en.wikipedia.org/wiki/Brahmic_scripts


I am! I mean to say the languages usually represented in brahmic scripts (With most of the brahmic scripts being phonetic, one can write multiple languages in the same script in a sensible manner)


He’s a couple days unemployed and is already producing code that looks like it’s meant to teach people. Very happy to see that! The world will be a better place with more Karpathy educational content.


Relatedly, Karpathy had an insightful post awhile back on perils of tokenization https://twitter.com/karpathy/status/1657949234535211009

(I cited it in my Excel tutorial on BPE https://www.youtube.com/watch?v=PvZN3-WqAOI&t=1916s )


Also very neat is the Unigram algorithm. IIRC that paper also introduces subword regularization which I think is really cool.

https://arxiv.org/abs/1804.10959


Wouldn't this introduce hidden bias towards word-like content, i.e. making the output seem more coherent than its actually is, masking LLM output quality?


It also destroys models ability to follow phonetic or syntactic constraints: https://paperswithcode.com/paper/most-language-models-can-be...

I've been waiting for real innovation to come in tokenization algorithms. I think a truly well trained character level model would be awesome. Far easier to guide, and might even know how to "count" it's own output.


It's not really hidden, that's an intentional feature. All our LMs have been using this (or similar) since GPT-1.


The world is already benefiting from Karpathy becoming unemployed.

Now if only other OpenAI researchers would follow in his footsteps.


Interesting parallels here with how the Japanese written language works


In what way?


Well that's also a tokenizing system; the language can be expressed using 'sound' characters (hiragana, which work similarly to Korean as in, each character expresses a syllable) and there's a deeper 'layer' of tokenization in the form of kanji, one more complex character per noun (vaguely)

Probably not expressing this well but the concepts do seem similar to me




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: