Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I actually find myself needing something which is close to the front-end of a compiler:

I get a file in a C-like language (say it's C for the sake of discussion and to make life easy), and I want to figure out the names of the top-level functions defined in this file. I am willing to assume that there are no "Gotcha" macros used, which would redefine keywords or types, or otherwise mess up the syntax. The caveat is that I don't want to include any files - even though this file has some include directives; and not including them would mean some types are not defined etc.

What would I do in such a case? Should I take the "not a compiler" approach and start matching regex'es?



If the syntax for function definitions is relatively fixed, I'd say use regular expressions. Others have mentioned recursive descent parsers, and you'd be implementing the "base case" portion of one of those.

Not including "includes" is easy. Just don't go looking for them.

---

Edit: the fun part will be trying to implement doc-strings. :D

You wind up with a RDP that has a grammar like

    fn_def := documented
        | un_documented

    documented := doc_string def

    un_documented := def

    doc_string := <your docstring format regex here>

    def := <your function defition regex here>
Note, for the last two, it will be easier to break those up into pieces. If I were writing a RDP for C# method declarations, I'd have something like

    def := ACCESS_SCOPE MUTABILITY RETURN_TYPE FN_NAME L_PAREN PARAMS R_PAREN

    ACCESS_SCOPE := "private" <-- these would grammar "terminals" or the literal strings we are looking for
        | "public"
        | <etc>

    MUTABILITY := static
        | <others?>

    RETURN_TYPE := "bool" <-- note that this should actually be much more complex
        | "int"               because you can (almost) arbitrarily specify types
        | <etc>               based on C#'s internal types

    FN_NAME := <some regular expression for allowed function names in C#>

    L_PAREN := "("

    R_PAREN := ")"

Something like that. You'll need to test as you implement, though.

Each of the above grammar definitions should correspond 1 to 1 with a function that you implement. If you need/want help with implementation, my email is in my profile. I'd be more than happy to walk you through this (I've done it before :D specifically this use-case, too, where I was trying to auto-document a language that doesn't have any development/documentation tools).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: