Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Hand-written lexer in Javascript compared to the regex-based ones (thegreenplace.net)
81 points by joeyespo on July 16, 2013 | hide | past | favorite | 22 comments


I'm not surprised the hand-written version is faster. You need to store an object for each regex. If these objects are DFAs, for example, then there is a considerable amount of overhead storing the graph to represent the DFA. The hand-written version has comparatively little state, and no generalized logic which would further slow it down.

Error-reporting is another good reason for hand-written lexers (and parsers). It's harder to get readable and accurate error messages when the logic is embedded in the regex or some library/framework.

EDIT: The link posted in the comments of the article is great. From Rob Pike:

A regular expression library is a big thing. Using one to parse identifiers is like using a Ferrari to go to the store for milk.

http://commandcenter.blogspot.com/2011/08/regular-expression...


The reason you should be surprised is this: regular expressions are implemented in C. In some implementation they are even jit'ed.

It's a testament to how good V8 is that char-by-char loop is faster in JavaScript than optimized C code but if you were to benchmark pre-V8 JavaScript interpreters or current implementations of Python or Ruby, regex matching would smoke char-by-char loop written in the language simply by virtue of being C code.


No. This isn't too surprising to anybody who knows how regular expressions work under the hood. The hand rolled parser is asymptotically faster than using regex. It wouldn't surprise me at all if the hand rolled parser were still faster for a complicated enough language and a large enough input file.

It's analogous to a hand optimized assembly language implementation of bubble sort being slower than quicksort in a slow, high level language. At some point the bubble sort will lose, simply because it's asymptotically worse.


Well hold on; that "asymptotically faster" part needs a bit more detail so people don't get wrong ideas. In a DFA-based regexp implementation (which, admittedly, V8 doesn't use) matching is guaranteed O(n), where n is the number of characters you look at before finding a match or deciding that the match has failed. There are DFA-based lexer generators, like GNU flex, which take regular expressions for your tokens, turn it into a big DFA, and spit out a linear-time lexer. This is the same asymptotic time as the hand-rolled parser.

The regexp-based version linked to in the article, on the other hand, has a list of regular expressions which are sequentially tried against the input to grab each token. For this particular lexer, a failed match will fail on the first character, i.e. in constant time. So this is asymptotically O(n*k) where n is the input length and k is the (small, constant) number of tokens.

Finally, most regular expression engines are not DFA-based; they use a backtracking algorithm which is, theoretically, exponential-time. However, with the particular regular expressions used here, successful matching will be O(n) for each one, and failed matching O(1).


> I'm not surprised the hand-written version is faster. You need to store an object for each regex. If these objects are DFAs, for example, then there is a considerable amount of overhead storing the graph to represent the DFA.

I'm not sure what this means specifically.

Presumably, the DFA is created once and then only some sort of small state has to be maintained (i.e. which state we're in, perhaps a little more). It's not like a bunch of heavy-wight regex objects have to be created for each input character.


I meant, you have a number of regular expressions, and each gets compiled (e.g. by calling re.compile in Python), so you have one DFA per regex.

At least in Python, there are also transient objects created whenever you match something with the regex (e.g. calling regex.match).

>It's not like a bunch of heavy-wight regex objects have to be created for each input character.

I'm not familiar with the Python or JavaScript implementations. Abstractly, if the DFA is just a graph (nodes and edges), then you would need to store something for every node and every edge. This could be an object per character (depending on the implementation and the regex, I suppose?). EDIT: I'm confusing myself. Each node represents a state. The characters would be on the edges. You would need to store a set of characters for every state.

I just had this fuzzy notion that each regex requires a bit more state and a bit more logic than a handwritten version, so the hand-written version could very well be faster.


Aside from technical difficulty, there's really nothing stopping the V8 guys from spitting out native code to match their regular expressions. And, in fact, they seem to do this:

http://blog.chromium.org/2009/02/irregexp-google-chromes-new...

With a sufficiently smart regular expression engine and JIT, they should be able to do about as well as an equivalent hand-rolled thing.


Yep, compiling the regex into some sort of "VM" is not a new technique, but eventually the algorithm this bytecode implements is what matters most. The article you linked to suggests that even after the rewrite, V8's regexes are backtracking rather than DFA (the re2 approach), which explains the results, in a way.

This is really interesting. The most common point raised in the backtracking vs. DFA debates is pathological inputs that induce exponential behavior. But large alternations seem to be affected quite a bit as well, and are more common. Admittedly, there's no exponential behavior there - just an unfortunate constant factor.


An oddity in this lexer is that it allows numbers followed by an identifier with no separation: "99foo" turns into two tokens, number "99" and identifier "foo".


This is normal for lexers in whitespace-insensitive languages. In spirit this is similar to allowing "foo+" (plus after "foo") or any other token concatenation.

If this is an error in the language, the parser easily catches it.

Some languages AFAIR allow "units" to be attached to numbers so some use cases are actually valid. In any case, it's a higher-level decision the parser makes.


C, for example, allows trailing letters to specify the type of a numeric literal:

    42 // int
    42U // unsigned int
    42LL // long long
    42ULL // unsigned long long
    42.0 // double
    42.0f // float
Edit: except that whitespace isn't allowed here. "42 LL" is not legal. So that's kind of the opposite case!


In the case of C it's a bit special because that suffix is usually handled by the lexer - i.e. there are no production rules for it in the C grammar. This is also why a whitespace is not allowed - it's not a single token any longer!


The token -8 is a bit special as well.


Not sure what you mean by this.

In the C standard, numeric tokens include signs, floating point dots, type specifiers (U, L, etc.), and much more. For the gory details see: https://github.com/eliben/pycparser/blob/master/pycparser/c_...

But in the end, it's all in the same token.


I mean tokenizing things like "8-8" vs "8+-8" vs "8---8" present some special challenges for the lexer.


That's fairly common in languages that aren't particularly whitespace sensitive.


Agreed. The lexer just recognizes tokens it does (usually/ideally) not know about the grammar, the parser checks if the "token stream" actually complies with the grammar.

If the grammar allows for example "2x" as a "shortcut" for "2 * x" (if you parse mathematical expressions) it would make sense.

Nitpick: Depending on the syntax the lexer might also yield 99f (float literal) and the identifier "oo".


The benchmark here is on a large document, but it is very short-running. Would be interesting to see something even larger. The regex engine might not decide to JIT for short-running code, for example, not sure how that works.


Is that decision time-based? Because in terms of exercising the regexen and hand-written functions, they are surely run many thousands of times.

I've concatenated a number of Tablegen files together (the benchmark is described better here: http://eli.thegreenplace.net/2013/06/25/regex-based-lexical-...) for a total of ~20KLOC, so we're talking about tens of thousands of tokens here (around a megabyte of data). That sounds like enough to kick the JITs, don't you think?


I honestly do not know how regex JITs are tuned, but in JS JITs, even tens of thousands of iterations may not suffice.

Just to be sure, I would concatenate some more stuff ;) so the test runs for say 10 seconds.


While it seems strange to me that a highly repetitive computation that lasts ~0.2 seconds would not kick the JITs (who needs them, then ;-) ?), I tried a simple blow-up of the input (by abusing "cat") - same results. For a 16MB input with > 2.6 million tokens, the handwritten lexer smokes the single-alternating-regex one: 1.1 sec vs. 2.9 sec.


Nice, thanks for verifying that. Was a longshot, but might as well rule out everything possible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: