Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A slightly more accurate summary of Tom Christiansen's excellent answer there would be: "Oh yes you can use regexes to parse HTML, but you usually shouldn't, unless what you want to do is really, really simple."

Actual quotations: "Even if my program is taken as illustrative of why you should not use regexes for parsing general HTML -- which is ok, because I kinda meant for it to be that"; "That was kinda my point, actually. I wanted to show how hard it is." (the latter in response to someone else who said "You can write a novel, like tchrist did, or you can use a DOM library and write one line of XPath").



That said, his HTML chunker is, dare I say, gorgeous.

If that is an example of what should not be done, I wish there was more of them like that around.

Besides, lexing HTML in 234 lines grand total, most of them being whitespace, (169 SLOCs according to sloccount) is impressive. Writing even a basic non regex-based parser is bound to take quite some space.

To me the real conclusion is not: "don't try to parse random HTML using regexes" but "don't try to write your own wide-purpose HTML parser".

Or, as Tom put it in his SO answer:

> The correct and honest answer is that they shouldn’t attempt [trying to parse arbitrary HTML] because it is too much of a bother to figure out from scratch


"Besides, lexing HTML in 234 lines grand total, most of them being whitespace, (169 SLOCs according to sloccount) is impressive."

I mean no disrespect at all to tchrist, but it isn't impressive at all; not because tchrist is wrong, but because lexing isn't hard. If you understand the problem, you can almost literally read the lexer right off the standard; indeed, that's part of the purpose of the standard. Look at it (taking HTML4 here as it's easier to see): http://www.w3.org/TR/html401/types.html#h-6.2 You can literally read off the lexer expression for ID and NAME right from 'must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").'

Generally, if you're having a hard time putting a lexer together for some language you're creating (bearing in mind this includes the broader definition of "language" beyond just "programming language", which includes things like JSON or text formats you may create ad hoc), that's a sign that you've got an overcomplicated language on your hand. (Hi, C++! I see you over there!)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: