Hi all, it's still too early to look at this code :) I wanted to put up my alpha version to start the feedback going a bit. I'm recording a video alongside where we build minBPE and expect that to be a lot more useful, coming out in a few days.
I will say tokenization is UGLY. Really ugly. There are a ton of little, subtle, sharp edges and gotchas. Eternal glory awaits anyone who figures out a way to delete this stage.
Really wish that more folks had read the last authoritative piece on why BPE causes serious issues for LLMs, particularly in creativity and also for things like math: https://gwern.net/gpt-3#bpes
That is fair. I didn't mean to imply it was earlier, just that he had mentioned it and research continued in this area. The article you linked is definitely more comprehensive.
This code will incorrectly count pairs in [1,1,1] as two pairs of (1,1), despite there is only one, if you consider what merge function does. ;)
This can be seen in real world scenarios, for example, when there are many spaces.
My attempt to implement BPE has the same flaw and the fix appear to be very complex. But, as I looked at stats, it just makes such pairs to appear as top ones a couple of runs earlier, so it appears as a benign flaw. But, it's interesting that this is not mentioned in the code.
def get_corrected_stats(ids):
corrected_counts = {}
i = 0
while i < len(ids) - 1:
current_pair = (ids[i], ids[i+1])
# If the pair is new or if the current element is not a repeat of the previous one (or it's the first element)
if current_pair not in corrected_counts:
corrected_counts[current_pair] = 1
else:
if i == 0 or ids[i] != ids[i-1]:
corrected_counts[current_pair] += 1
# If the current and next element are the same, skip to the next unique element
if i < len(ids) - 2 and ids[i] == ids[i+1]:
while i < len(ids) - 1 and ids[i] == ids[i+1]:
i += 1
else:
i += 1
I have a basic grasp of this topic. Can you explain why the proposed solution is (or isn’t) effective in this scenario?
I do not understand why your solution skips to the next unique elements.
Consider sequence [1,1,1,1,1,1]. We should count there 3 (three) (1,1) pairs. Karpathy's (and mine's) algorithm would count 5 pairs, yours' will count 1.
One should just skip one or two elements counting pairs, one element when pair's elements are different, two when they are same. This may break vectorization, though.
I also should note that error is here, but no-one hits it. My experiments show that (A,A) pairs gets counted as most frequent ahead of their natural turn about half a dozen symbols earlier. I also never saw that symbol B, assigned for pair (A,A), is used in any most frequent pair before an actual turn for (A,A). One need to construct special string for that to happen.
It sounds like that sequence is being k-merized. In the case of byte-pair that should be a look-ahead of 1 and then a shift up of 1. Maybe I am missing something.
Could something like that be why generated images tend to put people in a t-shirt that reads 'fooobar' or similar if you ask for say 'man in t-shirt fiddling with foobars'? Or is that expecting a far too tractable and deterministic link, considering understanding how the model actually works (I can't recall the name for it?) is a research area in its own right?
And, for indic languages, there is orthographic syllable pair encoding, which reduces the amount of space taken by indic characters by encoding the most common letter pairs as a single byte.
This was presented i think last year, at the URTC in MIT.
Isn't that a specialization of BPE for specific distribution? BPE is essentially a data compression scheme that constructs a straight-line grammar (SLG) that can be directly used as a model encoding, and it is hardly surprising that any other SLG would work well for some other inputs.
I am!
I mean to say the languages usually represented in brahmic scripts (With most of the brahmic scripts being phonetic, one can write multiple languages in the same script in a sensible manner)
He’s a couple days unemployed and is already producing code that looks like it’s meant to teach people. Very happy to see that! The world will be a better place with more Karpathy educational content.
Wouldn't this introduce hidden bias towards word-like content,
i.e. making the output seem more coherent than
its actually is, masking LLM output quality?
I've been waiting for real innovation to come in tokenization algorithms. I think a truly well trained character level model would be awesome. Far easier to guide, and might even know how to "count" it's own output.
Well that's also a tokenizing system; the language can be expressed using 'sound' characters (hiragana, which work similarly to Korean as in, each character expresses a syllable) and there's a deeper 'layer' of tokenization in the form of kanji, one more complex character per noun (vaguely)
Probably not expressing this well but the concepts do seem similar to me
I will say tokenization is UGLY. Really ugly. There are a ton of little, subtle, sharp edges and gotchas. Eternal glory awaits anyone who figures out a way to delete this stage.