A slightly more accurate summary of Tom Christiansen's excellent answer there wo...

lloeki · on July 8, 2011

That said, his HTML chunker is, dare I say, gorgeous.

If that is an example of what should not be done, I wish there was more of them like that around.

Besides, lexing HTML in 234 lines grand total, most of them being whitespace, (169 SLOCs according to sloccount) is impressive. Writing even a basic non regex-based parser is bound to take quite some space.

To me the real conclusion is not: "don't try to parse random HTML using regexes" but "don't try to write your own wide-purpose HTML parser".

Or, as Tom put it in his SO answer:

> The correct and honest answer is that they shouldn’t attempt [trying to parse arbitrary HTML] because it is too much of a bother to figure out from scratch

jerf · on July 8, 2011

"Besides, lexing HTML in 234 lines grand total, most of them being whitespace, (169 SLOCs according to sloccount) is impressive."

I mean no disrespect at all to tchrist, but it isn't impressive at all; not because tchrist is wrong, but because lexing isn't hard. If you understand the problem, you can almost literally read the lexer right off the standard; indeed, that's part of the purpose of the standard. Look at it (taking HTML4 here as it's easier to see): http://www.w3.org/TR/html401/types.html#h-6.2 You can literally read off the lexer expression for ID and NAME right from 'must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").'

Generally, if you're having a hard time putting a lexer together for some language you're creating (bearing in mind this includes the broader definition of "language" beyond just "programming language", which includes things like JSON or text formats you may create ad hoc), that's a sign that you've got an overcomplicated language on your hand. (Hi, C++! I see you over there!)