Meditation, The Art of Exploitation: Compilers Part 1, top-down vs. bottom-up and why nested C style comment is disallowed

Yacc allows BNF syntax such as this (note definition section is omitted for illustration purpose):



program:
      program statement '\n'
    | 
    ;

statement:
      expression
    | VARIABLE '=' expression
    ;

expression:
      INTEGER
    | VARIABLE
    | expression '+' expression
    | expression '-' expression
    | expression '*' expression
    | expression '/' expression
    | '(' expression ')'
    ;

A program is a collection of statements and has a left recursion in its grammar. Now this would have been a problem for a top-down (predicative) or recursive descent or LL (left to right and left most derivation) parser due to the fact that left recursive grammar causes indefinite parsing of input string. Yacc has no problem with such kind of grammar because Yacc is a bottom-up or shift-reduce or LR (left to right and producing right most derivation) parser. In fact left recursive grammar produces better parser with Yacc due to less number of stack entries used during shift-reduce.

Often hand crafted lexers and parsers take LL(k) approach, that is LL with k # of characters look ahead. As the LL parser reads a input string, it generates a syntax tree started from nothing (root). It's done more often simply because it's easier to write a LL(k) parser.

LR or shift-reduce parser often has an easier time parsing because a LR parser is an automaton suitable for parsing string patterns efficiently (Refer to the finite automaton regex pattern matching algorithm in Introduction to Algorithm). Often course it's possible and done to hand craft LR parsers.

In improved form LALR (look ahead left to right right most derivation production) parser such as Yacc, stacks are used to support shift-reduce and reduce-reduce operations. Yacc takes a default action when there is a conflict. For shift-reduce conflicts, yacc will shift. For reduce-reduce conflicts, it will use the first rule in the listing. It also issues a warning message whenever a conflict exists.

A common problem is parsing of a c comment /* this is a comment *****/, such syntax can be expressed as:

comment.l



%%
"/*"        {
            register int c;

            for ( ; ; )
                {
                while ( (c = input()) != '*' &&
                        c != EOF )
                    ;    /* eat up text of comment */

                if ( c == '*' )
                    {
                    while ( (c = input()) == '*' )
                        ;
                    if ( c == '/' )
                        break;    /* found the end */
                    }

                if ( c == EOF )
                    {
                    error( "EOF in comment" );
                    break;
                    }
                }
            }
%%

In this example, the lexer simply skips the comment, also note that nested comments are not allowed. This lexer code is prevalent in most C compiler implementation and is the reason why nested comment is still not allowed in C regardless the advance of parsing technology.

Do use lex/yacc to implement scanner/parsers instead of handcrafting them.

On using lex/yacc with C++:
"To summarize: don't bother to compile your Lexer in C++, keep it in C. Make your Parser in C++ and explain your compiler that some functions are C functions with extern "C" statements."

References
1. http://www.lysator.liu.se/c/ANSI-C-grammar-l.html
2. http://www.garshol.priv.no/download/text/bnf.html

Meditation, The Art of Exploitation

About Me

Tuesday, January 22, 2008

Compilers Part 1, top-down vs. bottom-up and why nested C style comment is disallowed