Meditation, The Art of Exploitation

Thinking? At last I have discovered it--thought; this alone is inseparable from me. I am, I exist--that is certain. But for how long? For as long as I am thinking. For it could be, that were I totally to cease from thinking, I should totally cease to exist....I am, then, in the strict sense only a thing that thinks.

Friday, February 01, 2008

Compilers Part 4, Symbol tables

Continue from last entry, we will add a symbol table to our little sql parser that interacts with in memory dictionaries. We have chosen to support dictionaries of the type std:map< std::string, std::string> because coupled with serialization/deserialization techniques, a dictionary is fit to describe a hierarchical organization of data objects. Theoretically, all things can be represented by strings in a Turing machine. We may cover automatic type deduction and more generic (in terms of coding convenience) solution in future entries on symbol tables.

The reason a map is used instead of using a plain std::set is coding convenience. For std::set we have to supply user defined ordering functions to work with properly. Since we know that symbol table entries are always keyed by symbol name, a straight forward std::string key type works well. Our symbol table itself is a std::map type with key_type std::string and value_type value. struct value is user defined in map.h that contains a union ivalue type in which we hold multiple types our scripting language supports. One of the ivalue type is the dictionary type std::map, typdef map_t. Our mini sql language revolves around manipulating this type of data structure.

We will also make significant enhancement to the syntax of our little scripting language, allowing greater freedom in dictionary creation, object manipulation. Let's first see a transcript of the interpreter in action:

./map
create table $sss;
list
symbol: sss m
insert into $sss (abc, efg)
list
symbol: sss m
key = abc value = efg
insret into $sss( ddd, hik)
4 : syntax error at
insret into $sss( ddd, hik)
^
insert into $sss (ddd, hik)
Enter another command
list
symbol: sss m
key = abc value = efg
key = ddd value = hik
select * from $sss;
key = abc value = efg
key = ddd value = hik
select * from $sss where key = ddd
key = ddd value = hik


We have added more reserved keywords, more grammars to enhance the scripting language. Here is the complete code listing:


#ifndef MAP_H
#define MAP_H
#include < string>
#include < iostream>
#include < map>

#include < boost/lambda/lambda.hpp>
typedef std::map< std::string, std::string> map_t;

// a value type that describe the value of a symbol table entry
// Essentially the symbol table entry data structure
struct value{
unsigned char type; // the first letter of corresponding union type
union {
int ival;
float fval;
double dval;
char * sval;
map_t * mval;
} ivalue;
};

// The symbol table data structure
// Symbol table entry is keyed by the symbol string value
typedef std::map< std::string, value> symtab_t;

extern int yyerror(char *);
#endif


%{
#include "map.tab.h"
#include "map.h"

extern symtab_t symtab;
std::string line;
bool report_err;
int lineno;
int tokenpos;
%}
D [0-9]
N {D}+
L [a-zA-Z]
A [a-zA-Z0-9]
ID ({L}{A}*)
%%

select { tokenpos += yyleng; return SELECT; }
insert { tokenpos += yyleng; return INSERT; }
into { tokenpos += yyleng; return INTO; }
from { tokenpos += yyleng; return FROM; }
create { tokenpos += yyleng; return CREATE; }
table { tokenpos += yyleng; return TABLE; }
list { tokenpos += yyleng; return LIST; }
where { tokenpos += yyleng; return WHERE; }
key { tokenpos += yyleng; return KEY; }
value { tokenpos += yyleng; return VALUE; }

${ID} { tokenpos += yyleng; yylval.text = strdup(yytext+1); return OBJECT; }
{ID} { tokenpos += yyleng; yylval.text = strdup(yytext); return TEXT; }
[ \t] { tokenpos += yyleng; /* ignore white space */ }
. { tokenpos += yyleng; return yytext[0]; }
\n.* { report_err = true; tokenpos = 0; line = yytext+1; yyless(1); lineno++; return '\n'; }

%%

int yyerror(char * s){
if(report_err){
std::cout << lineno << " : " << s << " at \n" << line << '\n';
printf("%*s\n", 1+tokenpos, "^");
}
}

%{
extern "C"{
#include < stdio.h>
#define YYDEBUG 1
}
extern int yyerror(char *);
extern int yylex();

#include "map.h"

symtab_t symtab;
bool where_by_key = false;
bool where_by_value = false;
std::string tablename;
std::string where;
%}

%union{
char * text;
}

%token < text> INSERT SELECT INTO TEXT OBJECT FROM CREATE TABLE LIST
%token < text> WHERE KEY VALUE
%%

statements: statements statement
| statement
;
statement: insert_stmt opt_semicolon '\n'
| select_stmt opt_semicolon '\n'
| create_stmt opt_semicolon '\n'
| assign_stmt opt_semicolon '\n'
| list_stmt opt_semicolon '\n'
| '\n'
| error '\n' { yyclearin; yyerrok; std::cout << "Enter another command\n"; }
;
opt_semicolon:
| ';'
;
assign_stmt:
OBJECT '=' TEXT
{
// string variable assignment
symtab_t::iterator it = symtab.find($1);
if(it != symtab.end() && it->second.type == 's') // Symbol found and type is correct
it->second.ivalue.sval = $3;
else{ // New symbol, add to symbol table
value v;
v.ivalue.sval = $3;
v.type = 's';
symtab[$1] = v;
}
}
;
create_stmt:
CREATE TABLE OBJECT
{
// Create a new dictionary
std::string symbol = $3;
symtab_t::iterator it = symtab.find(symbol);
if(it != symtab.end() && it->second.type == 'm'){ // Symbol found and type is correct
std::cerr << "symbol: " << symbol << " already exists\n";
}else{ // New symbol, create new map(table), add to symbol table
value v;
v.ivalue.mval = new(map_t);
v.type = 'm';
symtab[symbol] = v;
}
}
;
insert_stmt:
INSERT INTO OBJECT '(' TEXT ',' TEXT ')'
{
// insert key, value pair into an existing dictionary
symtab_t::const_iterator it = symtab.find($3);
if(it != symtab.end() && it->second.type == 'm'){ // Symbol found and type is correct
(*(it->second.ivalue.mval))[std::string($5)] = std::string($7);
}else
std::cerr << "unknown symbol: " << $3 << " create first\n";
}
;
select_stmt: simple_select_stmt
{
// go through all key, value pair of a dictionary
symtab_t::const_iterator it = symtab.find(tablename);
if(it != symtab.end() && it->second.type == 'm'){
map_t::const_iterator mit = it->second.ivalue.mval->begin();
for(; mit != it->second.ivalue.mval->end(); ++ mit)
std::cout << "key = " << mit->first << ' '
<< "value = " << mit->second << '\n';
}else
std::cerr << "invalid object\n";

}
| simple_select_stmt opt_where_stmt
{
// go through all key, value pair of a dictionary
// based on where criteria, search by key or value
symtab_t::const_iterator it = symtab.find(tablename);
if(it != symtab.end() && it->second.type == 'm'){
map_t::const_iterator mit = it->second.ivalue.mval->begin();
for(; mit != it->second.ivalue.mval->end(); ++ mit)
if( (where_by_key && mit->first == where) ||
(where_by_value && mit->second == where) ||
(!where_by_key && !where_by_value) )
std::cout << "key = " << mit->first << ' '
<< "value = " << mit->second << '\n';
}else
std::cerr << "invalid object\n";

where_by_key = where_by_value = false;

}
;
simple_select_stmt:
SELECT '*' FROM OBJECT { tablename = $4; }
;
opt_where_stmt: WHERE KEY '=' TEXT { where_by_key = true; where = $4; }
| WHERE VALUE '=' TEXT { where_by_value = true; where = $4; }
;
list_stmt:
LIST
{
// Dump the entire symbol table
// For dictionaries, dump all key, value pairs as well
//
// Iterate through the symbol table
symtab_t::const_iterator it = symtab.begin();
for(; it != symtab.end(); ++it){
std::cout << "symbol: " << it->first << ' ' << it->second.type << '\n';
switch(it->second.type){
case 's': std::cout << "value = " << it->second.ivalue.sval << '\n';
break;
case 'm': {
// iterate through the dictionary
map_t::const_iterator mit = it->second.ivalue.mval->begin();
for(; mit != it->second.ivalue.mval->end(); ++ mit)
std::cout << "key = " << mit->first << ' '
<< "value = " << mit->second << '\n';
}
break;
default:
std::cerr << "Unknown data type\n";
break;
}
}
}
;
%%

int main(){
extern int yydebug;
yydebug = 0;
yyparse();
}




We do not provide symlookup kind of functions because it's fairly straightforward to find symtable entry with our symtab_t data structure and the returned iterator is of immediate use in the action code. The complex object chaining action code may seem a little taunting at first but they are just regular C++ STL ways of getting/setting std::map data. We also let STL to handle all the memory issues, computing efficiency (std::map is typically implemented as a red-black tree, a kind of balanced binary search tree with very decent insertion, search, deletion speed O(lgN)), APIs. This is primarily why we would like to get lex and yacc to work with C++.

Perhaps most telling is the 'list' command covered by list_stmt grammar, whose action code performs a complete dump of the entire symbol table. We first iterate the symbol table data structure, upon seeing a map_t type, we also iterate all key, value pairs of this symbol. This code illustrate the data structures we use to contain data.