Meditation, The Art of Exploitation

Thinking? At last I have discovered it--thought; this alone is inseparable from me. I am, I exist--that is certain. But for how long? For as long as I am thinking. For it could be, that were I totally to cease from thinking, I should totally cease to exist....I am, then, in the strict sense only a thing that thinks.

Thursday, January 31, 2008

Compilers Part 3, lex & yacc debugging and error recovery

So far we deliberately didn't talk much about the basics of Lex and Yacc because they are better covered in books and online documents (check the reference section of this entry). In our last entry we focused on how to use C++ code in Lex&Yacc generated parsers and lexers. In this entry, we will talk about debugging lex & yacc and show simple error recovery techniques. These topics are important because as we develop a compiler (or to develop any software), debugging becomes a necessary repeated task if not most frequently. We collect the tips here hidden in corners of various documents.

To help with debugging, start with compile

1) add -d to lex
2) add --debug to yacc (add -t to bison)

In yacc source code, in the definition section, add the following code:

extern "C"{
#include
#define YYDEBUG 1
}
extern int yydebug;
yydebug = 1;

These additions will allow verbose debugging messages displayed in both lex and yacc to learn how tokens are matched and states transitioned.

For error recovery:
The Lex & Yacc book has a dedicated chapter on error recovery. The basic idea is that we need a user defined yyerror function to report error and a hint where the error occured. In the grammar file, we provide action for error state that allows the parser to recover from syntax error in user input.

To demonstrate the debugging techniques and error recovery techniques, a complete lex&yacc specifications are provided. In this example, we would like to provide an interface to allow client modify and display a c++ std::map object. The syntax will be similar to sql. We are going to build a small in memory dictionary (database) that lives in a tcp/ip server that provides a minimal terminal interface to allow data manipulation. We start with simple insert/select commands to demonstrate various techniques that will be used in building this application. We will cover alternative input, symbol table, and tcp/ip client/server in the succeeding entries.



#ifndef MAP_H
#define MAP_H
#include < string>
#include < iostream>
#include < map>

#include < boost/lambda/lambda.hpp>
typedef std::map map_t;
#endif

%{
#include "map.tab.h"
#include "map.h"

extern map_t symtab;
int lineno;
std::string line;
int tokenpos;
bool report_err;
%}
D [0-9]
N {D}+
L [a-zA-Z]
A [a-zA-Z0-9]
ID ({L}{A}*)
%%

select { tokenpos += yyleng; return SELECT; }
insert { tokenpos += yyleng; return INSERT; }
into { tokenpos += yyleng; return INTO; }
from { tokenpos += yyleng; return FROM; }
\${ID} { tokenpos += yyleng; return OBJECT; }
{ID} { tokenpos += yyleng;
symtab[std::string(yytext)] = std::string(yytext);
std::cout << symtab[yytext] << '\n';
yylval.text=strdup(yytext);
return TEXT;
}
[ \t] { tokenpos += yyleng; /* ignore white space */ }
. { tokenpos += yyleng; return yytext[0]; }
\n.* { report_err = true; tokenpos = 0; line = yytext+1; yyless(1); lineno++; return '\n'; }

%%

void yyerror(char * s){
if(report_err){
std::cout << lineno << " : " << s << " at \n" << line << '\n';
printf("%*s\n", 1+tokenpos, "^");
}
}
%{
extern "C"{
#include < stdio.h>
#define YYDEBUG 1
}
extern int yyerror(char *);
extern int yylex();

#include "map.h"

map_t object;
map_t symtab;
%}

%union{
char * text;
}

%token INSERT SELECT INTO TEXT OBJECT FROM
%%

statements: statements statement
| statement
;
statement: insert_stmt '\n'
| select_stmt '\n'
| '\n'
| error '\n' { yyclearin; yyerrok; std::cout << "Enter another command\n"; }
;
insert_stmt:
INSERT INTO OBJECT '<' TEXT ',' TEXT '>' {
object[std::string($5)] = std::string($7);
}
;
select_stmt:
SELECT '*' FROM OBJECT {
std::cout << "SELECT\n";
map_t::const_iterator it = object.begin();
for(; it != object.end(); ++it)
std::cout << it->first << ' ' << it->second << '\n';
it = symtab.begin();
for(; it != symtab.end(); ++it)
std::cout << "symbol: " << it->first << ' ' << it->second << '\n';
}
%%

int main(){
extern int yydebug;
yydebug = 1;
yyparse();
}



We will use the same makefile provided in last entry. 'make map' will build the binary 'map'. Try
./map and input 'insert into $mm ' and 'insert int $ss ', we get the following output and diagnosis from our parser. There is a caveat with the error reporting, if the first line the user entered has syntax error, it won't be able to report it because the first line's text is not saved (it can be altered to save the text of every line but I haven't found a good way to do it).

Starting parse
Entering state 0
Reading a token: --(end of buffer or a NUL)
insert into $mm
--accepting rule at line 19 ("insert")
Next token is token INSERT ()
Shifting token INSERT ()
Entering state 2
Reading a token: --accepting rule at line 29 (" ")
--accepting rule at line 20 ("into")
Next token is token INTO ()
Shifting token INTO ()
Entering state 10
Reading a token: --accepting rule at line 29 (" ")
--accepting rule at line 22 ("$mm")
Next token is token OBJECT ()
Shifting token OBJECT ()
Entering state 16
Reading a token: --accepting rule at line 29 (" ")
--accepting rule at line 30 ("<")
Next token is token '<' ()
Shifting token '<' ()
Entering state 18
Reading a token: --accepting rule at line 23 ("abc")
abc
Next token is token TEXT ()
Shifting token TEXT ()
Entering state 20
Reading a token: --accepting rule at line 30 (",")
Next token is token ',' ()
Shifting token ',' ()
Entering state 21
Reading a token: --accepting rule at line 23 ("def")
def
Next token is token TEXT ()
Shifting token TEXT ()
Entering state 22
Reading a token: --accepting rule at line 30 (">")
Next token is token '>' ()
Shifting token '>' ()
Entering state 23
Reducing stack by rule 7 (line 31):
$1 = token INSERT ()
$2 = token INTO ()
$3 = token OBJECT ()
$4 = token '<' ()
$5 = token TEXT ()
$6 = token ',' ()
$7 = token TEXT ()
$8 = token '>' ()
-> $$ = nterm insert_stmt ()
Stack now 0
Entering state 7
Reading a token: --(end of buffer or a NUL)

Entering state 5
Reading a token: --(end of buffer or a NUL)
insert int $ss
--accepting rule at line 31 ("
insert int $ss ")
Next token is token '\n' ()
Shifting token '\n' ()
Entering state 4
Reducing stack by rule 5 (line 27):
$1 = token '\n' ()
-> $$ = nterm statement ()
Stack now 0 5
Entering state 13
Reducing stack by rule 1 (line 22):
$1 = nterm statements ()
$2 = nterm statement ()
-> $$ = nterm statements ()
Stack now 0
Entering state 5
Reading a token: --accepting rule at line 19 ("insert")
Next token is token INSERT ()
Shifting token INSERT ()
Entering state 2
Reading a token: --accepting rule at line 29 (" ")
--accepting rule at line 23 ("int")
int
Next token is token TEXT ()
2 : syntax error at
insert int $ss <------------ Nice diagnosis from the parser
^
Error: popping token INSERT ()
Stack now 0 5
Shifting token error ()
Entering state 1
Next token is token TEXT ()
Error: discarding token TEXT ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --accepting rule at line 29 (" ")
--accepting rule at line 22 ("$ss")
Next token is token OBJECT ()
Error: discarding token OBJECT ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --accepting rule at line 29 (" ")
--accepting rule at line 30 ("<")
Next token is token '<' ()
Error: discarding token '<' ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --accepting rule at line 23 ("a")
a
Next token is token TEXT ()
Error: discarding token TEXT ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --accepting rule at line 30 (",")
Next token is token ',' ()
Error: discarding token ',' ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --accepting rule at line 23 ("b")
b
Next token is token TEXT ()
Error: discarding token TEXT ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --accepting rule at line 30 (">")
Next token is token '>' ()
Error: discarding token '>' ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --(end of buffer or a NUL)

References:
1. Lex & Yacc John R. Levine, Tony Mason, Doug Brown ISBN: 1565920007
2. http://dinosaur.compilertools.net/yacc/index.html