Code for the Byte Pair Encoding algorithm, commonly used in LLM tokenization

karpathy · on Feb 18, 2024

Hi all, it's still too early to look at this code :) I wanted to put up my alpha version to start the feedback going a bit. I'm recording a video alongside where we build minBPE and expect that to be a lot more useful, coming out in a few days.

I will say tokenization is UGLY. Really ugly. There are a ton of little, subtle, sharp edges and gotchas. Eternal glory awaits anyone who figures out a way to delete this stage.

Der_Einzige · on Feb 18, 2024

Really wish that more folks had read the last authoritative piece on why BPE causes serious issues for LLMs, particularly in creativity and also for things like math: https://gwern.net/gpt-3#bpes

ianand · on Feb 18, 2024

Karpathy himself has raised some of these issues with BPE https://twitter.com/karpathy/status/1657949234535211009

There is research about getting rid of tokens, some of which is in that thread

Der_Einzige · on Feb 18, 2024

The gwern article significantly predates the twitter thread

ianand · on Feb 19, 2024

That is fair. I didn't mean to imply it was earlier, just that he had mentioned it and research continued in this area. The article you linked is definitely more comprehensive.

kleiba · on Feb 18, 2024

Do you have a link to the original Gage paper from 1994?

kleiba · on Feb 18, 2024

Found it myself: http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045....

thesz · on Feb 18, 2024

Hehe.

https://github.com/karpathy/minbpe/blob/master/minbpe/base.p...

This code will incorrectly count pairs in [1,1,1] as two pairs of (1,1), despite there is only one, if you consider what merge function does. ;)

This can be seen in real world scenarios, for example, when there are many spaces.

My attempt to implement BPE has the same flaw and the fix appear to be very complex. But, as I looked at stats, it just makes such pairs to appear as top ones a couple of runs earlier, so it appears as a benign flaw. But, it's interesting that this is not mentioned in the code.

gzer0 · on Feb 18, 2024

  def get_corrected_stats(ids):
    corrected_counts = {}
    i = 0

    while i < len(ids) - 1:
        current_pair = (ids[i], ids[i+1])
        # If the pair is new or if the current element is not a repeat of the previous one (or it's the first element)
        if current_pair not in corrected_counts:
            corrected_counts[current_pair] = 1
        else:
            if i == 0 or ids[i] != ids[i-1]:
                corrected_counts[current_pair] += 1

        # If the current and next element are the same, skip to the next unique element
        if i < len(ids) - 2 and ids[i] == ids[i+1]:
            while i < len(ids) - 1 and ids[i] == ids[i+1]:
                i += 1
        else:
            i += 1

I have a basic grasp of this topic. Can you explain why the proposed solution is (or isn’t) effective in this scenario?

thesz · on Feb 18, 2024

I do not understand why your solution skips to the next unique elements.

Consider sequence [1,1,1,1,1,1]. We should count there 3 (three) (1,1) pairs. Karpathy's (and mine's) algorithm would count 5 pairs, yours' will count 1.

One should just skip one or two elements counting pairs, one element when pair's elements are different, two when they are same. This may break vectorization, though.

I also should note that error is here, but no-one hits it. My experiments show that (A,A) pairs gets counted as most frequent ahead of their natural turn about half a dozen symbols earlier. I also never saw that symbol B, assigned for pair (A,A), is used in any most frequent pair before an actual turn for (A,A). One need to construct special string for that to happen.

theGnuMe · on Feb 18, 2024

It sounds like that sequence is being k-merized. In the case of byte-pair that should be a look-ahead of 1 and then a shift up of 1. Maybe I am missing something.

OJFord · on Feb 20, 2024

Could something like that be why generated images tend to put people in a t-shirt that reads 'fooobar' or similar if you ask for say 'man in t-shirt fiddling with foobars'? Or is that expecting a far too tractable and deterministic link, considering understanding how the model actually works (I can't recall the name for it?) is a research area in its own right?

jerpint · on Feb 18, 2024

Nice catch, seeing as this is implemented by pretty much every LLM out there, it’s probably not such a big deal?

thesz · on Feb 18, 2024

More or less so. The resulting vocabulary will be structurally same even if you correct for that, only indices will differ slightly.

When I spotted this in my code, I did an investigation whether this affects my encoding. Turned out, it does not.

samarthr1 · on Feb 18, 2024

And, for indic languages, there is orthographic syllable pair encoding, which reduces the amount of space taken by indic characters by encoding the most common letter pairs as a single byte. This was presented i think last year, at the URTC in MIT.

(Disclosure, it was published by my friend)

lifthrasiir · on Feb 18, 2024

Isn't that a specialization of BPE for specific distribution? BPE is essentially a data compression scheme that constructs a straight-line grammar (SLG) that can be directly used as a model encoding, and it is hardly surprising that any other SLG would work well for some other inputs.

samarthr1 · on Feb 19, 2024

It is less byte pair at that point (as the indic languages generally take up 2-3 bytes per codepoint.)

lifthrasiir · on Feb 19, 2024

Yeah, of course, but I thought BPE should pick it up anyway if your training data is primarily written in Indic scripts.

mistrial9 · on Feb 18, 2024

> for indic languages

amateur here - but "Indic languages" are quite different. Maybe you mean Hindi and that group ?

ubutler · on Feb 18, 2024

Presumably, they’re referring to the Indic scripts (also known as Brahmic scripts): https://en.wikipedia.org/wiki/Brahmic_scripts

samarthr1 · on Feb 19, 2024

I am! I mean to say the languages usually represented in brahmic scripts (With most of the brahmic scripts being phonetic, one can write multiple languages in the same script in a sensible manner)

atomon · on Feb 18, 2024

He’s a couple days unemployed and is already producing code that looks like it’s meant to teach people. Very happy to see that! The world will be a better place with more Karpathy educational content.

ianand · on Feb 18, 2024

Relatedly, Karpathy had an insightful post awhile back on perils of tokenization https://twitter.com/karpathy/status/1657949234535211009

(I cited it in my Excel tutorial on BPE https://www.youtube.com/watch?v=PvZN3-WqAOI&t=1916s )

jszymborski · on Feb 18, 2024

Also very neat is the Unigram algorithm. IIRC that paper also introduces subword regularization which I think is really cool.

https://arxiv.org/abs/1804.10959

countWSS · on Feb 18, 2024

Wouldn't this introduce hidden bias towards word-like content, i.e. making the output seem more coherent than its actually is, masking LLM output quality?

Der_Einzige · on Feb 18, 2024

It also destroys models ability to follow phonetic or syntactic constraints: https://paperswithcode.com/paper/most-language-models-can-be...

I've been waiting for real innovation to come in tokenization algorithms. I think a truly well trained character level model would be awesome. Far easier to guide, and might even know how to "count" it's own output.

thatguysaguy · on Feb 18, 2024

It's not really hidden, that's an intentional feature. All our LMs have been using this (or similar) since GPT-1.

dkjaudyeqooe · on Feb 18, 2024

The world is already benefiting from Karpathy becoming unemployed.

Now if only other OpenAI researchers would follow in his footsteps.

15457345234 · on Feb 18, 2024

Interesting parallels here with how the Japanese written language works

messe · on Feb 18, 2024

In what way?

15457345234 · on Feb 18, 2024

Well that's also a tokenizing system; the language can be expressed using 'sound' characters (hiragana, which work similarly to Korean as in, each character expresses a syllable) and there's a deeper 'layer' of tokenization in the form of kanji, one more complex character per noun (vaguely)

Probably not expressing this well but the concepts do seem similar to me