I actually find myself needing something which is close to the front-end of a compiler:
I get a file in a C-like language (say it's C for the sake of discussion and to make life easy), and I want to figure out the names of the top-level functions defined in this file. I am willing to assume that there are no "Gotcha" macros used, which would redefine keywords or types, or otherwise mess up the syntax. The caveat is that I don't want to include any files - even though this file has some include directives; and not including them would mean some types are not defined etc.
What would I do in such a case? Should I take the "not a compiler" approach and start matching regex'es?
If the syntax for function definitions is relatively fixed, I'd say use regular expressions. Others have mentioned recursive descent parsers, and you'd be implementing the "base case" portion of one of those.
Not including "includes" is easy. Just don't go looking for them.
---
Edit: the fun part will be trying to implement doc-strings. :D
Note, for the last two, it will be easier to break those up into pieces. If I were writing a RDP for C# method declarations, I'd have something like
def := ACCESS_SCOPE MUTABILITY RETURN_TYPE FN_NAME L_PAREN PARAMS R_PAREN
ACCESS_SCOPE := "private" <-- these would grammar "terminals" or the literal strings we are looking for
| "public"
| <etc>
MUTABILITY := static
| <others?>
RETURN_TYPE := "bool" <-- note that this should actually be much more complex
| "int" because you can (almost) arbitrarily specify types
| <etc> based on C#'s internal types
FN_NAME := <some regular expression for allowed function names in C#>
L_PAREN := "("
R_PAREN := ")"
Something like that. You'll need to test as you implement, though.
Each of the above grammar definitions should correspond 1 to 1 with a function that you implement. If you need/want help with implementation, my email is in my profile. I'd be more than happy to walk you through this (I've done it before :D specifically this use-case, too, where I was trying to auto-document a language that doesn't have any development/documentation tools).
I get a file in a C-like language (say it's C for the sake of discussion and to make life easy), and I want to figure out the names of the top-level functions defined in this file. I am willing to assume that there are no "Gotcha" macros used, which would redefine keywords or types, or otherwise mess up the syntax. The caveat is that I don't want to include any files - even though this file has some include directives; and not including them would mean some types are not defined etc.
What would I do in such a case? Should I take the "not a compiler" approach and start matching regex'es?