Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You could, but you wouldn't when those keywords can all change in equivalent contexts.




The BPE or wordpiece tokenization algorithm will greedily take the longest valid token prefix. So if your text starts with “public static void main” it will try to find the longest token which matches that prefix. Even if “public” is a token, it will prefer to tokenize “public static” together.

yes, but then you have both alternatives as tokens, which nullifies GP's argument

What do you mean?

`public` might have a token by itself, even though you can have `pub` occurring in other contexts, too.


I meant that it wouldn't be efficient to agglomerate tokens in that way and that's why the system won't do it



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: