Meditation, The Art of Exploitation

Thinking? At last I have discovered it--thought; this alone is inseparable from me. I am, I exist--that is certain. But for how long? For as long as I am thinking. For it could be, that were I totally to cease from thinking, I should totally cease to exist....I am, then, in the strict sense only a thing that thinks.

Thursday, January 31, 2008

Compilers Part 3, lex & yacc debugging and error recovery

So far we deliberately didn't talk much about the basics of Lex and Yacc because they are better covered in books and online documents (check the reference section of this entry). In our last entry we focused on how to use C++ code in Lex&Yacc generated parsers and lexers. In this entry, we will talk about debugging lex & yacc and show simple error recovery techniques. These topics are important because as we develop a compiler (or to develop any software), debugging becomes a necessary repeated task if not most frequently. We collect the tips here hidden in corners of various documents.

To help with debugging, start with compile

1) add -d to lex
2) add --debug to yacc (add -t to bison)

In yacc source code, in the definition section, add the following code:

extern "C"{
#include
#define YYDEBUG 1
}
extern int yydebug;
yydebug = 1;

These additions will allow verbose debugging messages displayed in both lex and yacc to learn how tokens are matched and states transitioned.

For error recovery:
The Lex & Yacc book has a dedicated chapter on error recovery. The basic idea is that we need a user defined yyerror function to report error and a hint where the error occured. In the grammar file, we provide action for error state that allows the parser to recover from syntax error in user input.

To demonstrate the debugging techniques and error recovery techniques, a complete lex&yacc specifications are provided. In this example, we would like to provide an interface to allow client modify and display a c++ std::map object. The syntax will be similar to sql. We are going to build a small in memory dictionary (database) that lives in a tcp/ip server that provides a minimal terminal interface to allow data manipulation. We start with simple insert/select commands to demonstrate various techniques that will be used in building this application. We will cover alternative input, symbol table, and tcp/ip client/server in the succeeding entries.



#ifndef MAP_H
#define MAP_H
#include < string>
#include < iostream>
#include < map>

#include < boost/lambda/lambda.hpp>
typedef std::map map_t;
#endif

%{
#include "map.tab.h"
#include "map.h"

extern map_t symtab;
int lineno;
std::string line;
int tokenpos;
bool report_err;
%}
D [0-9]
N {D}+
L [a-zA-Z]
A [a-zA-Z0-9]
ID ({L}{A}*)
%%

select { tokenpos += yyleng; return SELECT; }
insert { tokenpos += yyleng; return INSERT; }
into { tokenpos += yyleng; return INTO; }
from { tokenpos += yyleng; return FROM; }
\${ID} { tokenpos += yyleng; return OBJECT; }
{ID} { tokenpos += yyleng;
symtab[std::string(yytext)] = std::string(yytext);
std::cout << symtab[yytext] << '\n';
yylval.text=strdup(yytext);
return TEXT;
}
[ \t] { tokenpos += yyleng; /* ignore white space */ }
. { tokenpos += yyleng; return yytext[0]; }
\n.* { report_err = true; tokenpos = 0; line = yytext+1; yyless(1); lineno++; return '\n'; }

%%

void yyerror(char * s){
if(report_err){
std::cout << lineno << " : " << s << " at \n" << line << '\n';
printf("%*s\n", 1+tokenpos, "^");
}
}
%{
extern "C"{
#include < stdio.h>
#define YYDEBUG 1
}
extern int yyerror(char *);
extern int yylex();

#include "map.h"

map_t object;
map_t symtab;
%}

%union{
char * text;
}

%token INSERT SELECT INTO TEXT OBJECT FROM
%%

statements: statements statement
| statement
;
statement: insert_stmt '\n'
| select_stmt '\n'
| '\n'
| error '\n' { yyclearin; yyerrok; std::cout << "Enter another command\n"; }
;
insert_stmt:
INSERT INTO OBJECT '<' TEXT ',' TEXT '>' {
object[std::string($5)] = std::string($7);
}
;
select_stmt:
SELECT '*' FROM OBJECT {
std::cout << "SELECT\n";
map_t::const_iterator it = object.begin();
for(; it != object.end(); ++it)
std::cout << it->first << ' ' << it->second << '\n';
it = symtab.begin();
for(; it != symtab.end(); ++it)
std::cout << "symbol: " << it->first << ' ' << it->second << '\n';
}
%%

int main(){
extern int yydebug;
yydebug = 1;
yyparse();
}



We will use the same makefile provided in last entry. 'make map' will build the binary 'map'. Try
./map and input 'insert into $mm ' and 'insert int $ss ', we get the following output and diagnosis from our parser. There is a caveat with the error reporting, if the first line the user entered has syntax error, it won't be able to report it because the first line's text is not saved (it can be altered to save the text of every line but I haven't found a good way to do it).

Starting parse
Entering state 0
Reading a token: --(end of buffer or a NUL)
insert into $mm
--accepting rule at line 19 ("insert")
Next token is token INSERT ()
Shifting token INSERT ()
Entering state 2
Reading a token: --accepting rule at line 29 (" ")
--accepting rule at line 20 ("into")
Next token is token INTO ()
Shifting token INTO ()
Entering state 10
Reading a token: --accepting rule at line 29 (" ")
--accepting rule at line 22 ("$mm")
Next token is token OBJECT ()
Shifting token OBJECT ()
Entering state 16
Reading a token: --accepting rule at line 29 (" ")
--accepting rule at line 30 ("<")
Next token is token '<' ()
Shifting token '<' ()
Entering state 18
Reading a token: --accepting rule at line 23 ("abc")
abc
Next token is token TEXT ()
Shifting token TEXT ()
Entering state 20
Reading a token: --accepting rule at line 30 (",")
Next token is token ',' ()
Shifting token ',' ()
Entering state 21
Reading a token: --accepting rule at line 23 ("def")
def
Next token is token TEXT ()
Shifting token TEXT ()
Entering state 22
Reading a token: --accepting rule at line 30 (">")
Next token is token '>' ()
Shifting token '>' ()
Entering state 23
Reducing stack by rule 7 (line 31):
$1 = token INSERT ()
$2 = token INTO ()
$3 = token OBJECT ()
$4 = token '<' ()
$5 = token TEXT ()
$6 = token ',' ()
$7 = token TEXT ()
$8 = token '>' ()
-> $$ = nterm insert_stmt ()
Stack now 0
Entering state 7
Reading a token: --(end of buffer or a NUL)

Entering state 5
Reading a token: --(end of buffer or a NUL)
insert int $ss
--accepting rule at line 31 ("
insert int $ss ")
Next token is token '\n' ()
Shifting token '\n' ()
Entering state 4
Reducing stack by rule 5 (line 27):
$1 = token '\n' ()
-> $$ = nterm statement ()
Stack now 0 5
Entering state 13
Reducing stack by rule 1 (line 22):
$1 = nterm statements ()
$2 = nterm statement ()
-> $$ = nterm statements ()
Stack now 0
Entering state 5
Reading a token: --accepting rule at line 19 ("insert")
Next token is token INSERT ()
Shifting token INSERT ()
Entering state 2
Reading a token: --accepting rule at line 29 (" ")
--accepting rule at line 23 ("int")
int
Next token is token TEXT ()
2 : syntax error at
insert int $ss <------------ Nice diagnosis from the parser
^
Error: popping token INSERT ()
Stack now 0 5
Shifting token error ()
Entering state 1
Next token is token TEXT ()
Error: discarding token TEXT ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --accepting rule at line 29 (" ")
--accepting rule at line 22 ("$ss")
Next token is token OBJECT ()
Error: discarding token OBJECT ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --accepting rule at line 29 (" ")
--accepting rule at line 30 ("<")
Next token is token '<' ()
Error: discarding token '<' ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --accepting rule at line 23 ("a")
a
Next token is token TEXT ()
Error: discarding token TEXT ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --accepting rule at line 30 (",")
Next token is token ',' ()
Error: discarding token ',' ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --accepting rule at line 23 ("b")
b
Next token is token TEXT ()
Error: discarding token TEXT ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --accepting rule at line 30 (">")
Next token is token '>' ()
Error: discarding token '>' ()
Error: popping token error ()
Stack now 0 5
Shifting token error ()
Entering state 1
Reading a token: --(end of buffer or a NUL)

References:
1. Lex & Yacc John R. Levine, Tony Mason, Doug Brown ISBN: 1565920007
2. http://dinosaur.compilertools.net/yacc/index.html

Tuesday, January 29, 2008

Compilers Part 2, lex & yacc with C++, types of external linkage

Lex and Yacc were traditionally used with C, more importantly lots of default functions provided by the lex/yacc (or flex/bison) library all have C linkage, meaning their function names are not mangled, notably yyparse, yywrap, yyerror.

yyparse generated by yacc internally calls yylex generated by lex. Therefore it's important that both yyparse and yylex use same linkage, either C or C++. linkage is determined in the definition section.

For example in this yacc definition of grammar.y:
%{
extern "C"{
extern int yyerror(char *);
}
extern int yylex(void);
%}

yyerror has C linkage, this parser uses the default yyerror implementation provided by the yacc library (-ly). yylex has the linkage the compiler used to compile the generated source code, in the case (gcc -c y.tab.c) the result yylex will have a C linkage, evidenced by (nm y.tab.o|grep yylex) 'U yylex'

On the other hand, if g++ is used to compile (g++ -c -x c++ y.tab.c), the result yylex symbol in the grammar object file will have C++ linkage, its name will be mangled as shown by 'nm': 'U _Z5yylexv'. In both cases, 'U' means undefined symbol because it has external linkage and will be provided by another compilation unit. The mangled name can be inspected by 'nm -C y.tab.o|grep lex' which yields 'U yylex()'

Ok, enough introduction of library, external linkage and nm tricks. The point is when using lex&yacc with C++, it's very important to pay attention to function names and declare proper linkage.

If we intend to use C++ to compile/link our grammar.y example with a grammar.l lex file, the lex file needs to have the following definition:
%{
extern "C"{
//extern int yylex(void);
}
#include "y.tab.h"
%}

Note that yylex is specifically commented out to make it clear that it will use the compiler default linkage (C for gcc or C++ for g++).

Here is a makefile that can be used to compile lex/yacc with C++ code embedded directly.


LEX=flex
YACC=bison
CXX=g++
CXXFLAGS=-g -O0
%: %.l %.y
$(LEX) -t $@.l > $@.c
if [[ -e $@.y ]] ; then \
$(YACC) -d --verbose --debug $@.y; \
$(CXX) $(CXXFLAGS) -c -x c++ $@.tab.c; \
$(CXX) $(CXXFLAGS) -c -x c++ $@.c; \
$(CXX) $@.o $@.tab.o -o $@ -ly -lfl -lm ; \
else \
$(CXX) $(CXXFLAGS) -o $@ $@.c -lfl -lm ; \
fi
@if [[ -e y.tab.c ]] ; then rm $@.tab.c ; fi
@if [[ -e y.tab.h ]] ; then rm $@.tab.h ; fi
#@-rm $@.c
clean:
rm *.o



If you examine the lex generated source code, you will see something like this:


/* Default declaration of generated scanner - a define so the user can
* easily add parameters.
*/
#ifndef YY_DECL
#define YY_DECL_IS_OURS 1

extern int yylex (void);

#define YY_DECL int yylex (void)
#endif /* !YY_DECL */
/** The main scanner function which does all the work.
*/
YY_DECL
{



The extra 'extern' storage specifier for int yylex() is redundant and confusing. According to C and C++ linkage rule, an 'extern' function with a visible definition in the same file will result in external linkage. Since yylex is later defined in the lex generated source code, yylex has external linkage and internal definition (lacking a better term). In the following example, test has external linkage and internal definition; test1 has external linkage and external definition; test2 causes compilation failure. In this nm output, 'T' means the test has internal definition and its definition is in the text section of the object file; 'U' means test1 is undefined (external definition, defined in another translation/compilation unit/object file).

00000000 T test
U test1



#include < errno.h>
extern int errno;

int errno;

extern int test();
extern int test1();
extern int test2();

int test(){
test1();
test();
errno = 10;
}

static int test2(){
test();
}


References:
1. http://publications.gbdirect.co.uk/c_book/chapter4/linkage.html
2. http://publications.gbdirect.co.uk/c_book/chapter8/declarations_and_definitions.html

Tuesday, January 22, 2008

Compilers Part 1, top-down vs. bottom-up and why nested C style comment is disallowed

Yacc allows BNF syntax such as this (note definition section is omitted for illustration purpose):


program:
program statement '\n'
|
;

statement:
expression
| VARIABLE '=' expression
;

expression:
INTEGER
| VARIABLE
| expression '+' expression
| expression '-' expression
| expression '*' expression
| expression '/' expression
| '(' expression ')'
;



A program is a collection of statements and has a left recursion in its grammar. Now this would have been a problem for a top-down (predicative) or recursive descent or LL (left to right and left most derivation) parser due to the fact that left recursive grammar causes indefinite parsing of input string. Yacc has no problem with such kind of grammar because Yacc is a bottom-up or shift-reduce or LR (left to right and producing right most derivation) parser. In fact left recursive grammar produces better parser with Yacc due to less number of stack entries used during shift-reduce.

Often hand crafted lexers and parsers take LL(k) approach, that is LL with k # of characters look ahead. As the LL parser reads a input string, it generates a syntax tree started from nothing (root). It's done more often simply because it's easier to write a LL(k) parser.

LR or shift-reduce parser often has an easier time parsing because a LR parser is an automaton suitable for parsing string patterns efficiently (Refer to the finite automaton regex pattern matching algorithm in Introduction to Algorithm). Often course it's possible and done to hand craft LR parsers.

In improved form LALR (look ahead left to right right most derivation production) parser such as Yacc, stacks are used to support shift-reduce and reduce-reduce operations. Yacc takes a default action when there is a conflict. For shift-reduce conflicts, yacc will shift. For reduce-reduce conflicts, it will use the first rule in the listing. It also issues a warning message whenever a conflict exists.

A common problem is parsing of a c comment /* this is a comment *****/, such syntax can be expressed as:

comment.l


%%
"/*" {
register int c;

for ( ; ; )
{
while ( (c = input()) != '*' &&
c != EOF )
; /* eat up text of comment */

if ( c == '*' )
{
while ( (c = input()) == '*' )
;
if ( c == '/' )
break; /* found the end */
}

if ( c == EOF )
{
error( "EOF in comment" );
break;
}
}
}
%%


In this example, the lexer simply skips the comment, also note that nested comments are not allowed. This lexer code is prevalent in most C compiler implementation and is the reason why nested comment is still not allowed in C regardless the advance of parsing technology.

Do use lex/yacc to implement scanner/parsers instead of handcrafting them.

On using lex/yacc with C++:
"To summarize: don't bother to compile your Lexer in C++, keep it in C. Make your Parser in C++ and explain your compiler that some functions are C functions with extern "C" statements."


References
1. http://www.lysator.liu.se/c/ANSI-C-grammar-l.html
2. http://www.garshol.priv.no/download/text/bnf.html

Friday, January 18, 2008

Notes on Linux signal in the context of process and thread

Handing Linux signals correctly is difficult, for a few reasons 1) Linux signal descends from the archaic Unix SysV signal system, it still supports the signal/pause calls etc that are susceptible to race conditions and all kinds of haphazard ways of bad signal handling practice; 2) The POSIX standard is intentionally cloudy on a couple of signal related issues, e.g. fields in siginfo_t; 3) Linux signal does not always follow the POSIX standard; 4) There are still lots of code using the SysV signal mechanism, that should be migrated to the better POSIX system.

Linux Programming by Example has an excellent chapter on Linux signal handling. The picture is not complete because Linux thread increases the complexity of signal handling. The following code tries to demonstrate a few important points of Linux signal handling in the context Linux threads:



/*
1. there is no per thread signal mask or signal handler, these concepts only
applies to a process

2. raise and kill work differently with pthreads

3. synchronous signal raised by a thread goes to that thread itself *only* not the process group

*/
#include < iostream>
using namespace std;

#include < boost/thread/thread.hpp>

extern "C"{
#include < signal.h>
}

volatile sig_atomic_t interrupted = 0;
int standby_thr_pid = (long int)syscall(224);

//#define SI_USER 0 /* sent by kill, sigsend, raise */
//#define SI_KERNEL 0x80 /* sent by the kernel from somewhere */
//#define SI_QUEUE -1 /* sent by sigqueue */
//#define SI_TIMER __SI_CODE(__SI_TIMER,-2) /* sent by timer expiration */
//#define SI_MESGQ __SI_CODE(__SI_MESGQ,-3) /* sent by real time mesq state change */
//#define SI_ASYNCIO -4 /* sent by AIO completion */
//#define SI_SIGIO -5 /* sent by queued SIGIO */
//#define SI_TKILL -6 /* sent by tkill system call */
//#define SI_DETHREAD -7 /* sent by execve() killing subsidiary threads */

// pid = 0 uid = 0 -> process itself
// else pid > 0 uid > 0 -> external process sent signal
// si_code = 128 (0x80) sent by kernel, e.g. interactive terminal ctl+c
// si_code = -6 sent by tkill
// si_code = 0 sent by kill/killpg/raise call
// there is no per thread signal handler, signal handler is installed
// process wise always
void ctlc(int sig, siginfo_t * info, void * context){
interrupted = 1;
int pid = (long int)syscall(224);
cout << pid << " received: " <<
sig << ' ' << info->si_code << ' ' << info->si_pid << ' ' << info->si_uid << '\n';
}

// raise SIGINT in a separate thread
void sig_sender(void){
cout << "sig sender " << (long int)syscall(224) << '\n';
sleep(3);
raise(SIGINT); // raise = kill(getpid(), sig) in process, signal sent to sender itself only
sleep(1);
kill(standby_thr_pid, SIGINT); // signal only sent to the standby thread process/thread
}

void sig_receiver(void){
//sigset_t set, old_set;

//sigaddset(&set, SIGINT);
//sigprocmask(SIG_BLOCK, &set, &old_set);

cout << "sig installer: " << (long int)syscall(224) << '\n';
struct sigaction act, old_act;
sigaddset(&(act.sa_mask), SIGINT);
sigaddset(&(act.sa_mask), SIGSEGV);
act.sa_flags = SA_SIGINFO;
act.sa_sigaction = ctlc;

sigaction(SIGINT, &act, &old_act);
sigaction(SIGSEGV, &act, &old_act);
}

void standby(void){
standby_thr_pid = (long int)syscall(224);
cout << "sig standby " << standby_thr_pid << '\n';
sleep(5);
}
// one thread acts as sender, the other receiver
int main(){
boost::thread trs(sig_sender);
boost::thread trr(sig_receiver);
boost::thread trsb(standby);
while(true){
sleep(1);
if(interrupted){
cout << "interrupt handler invoked\n";
interrupted = 0;
}
}
}

Thursday, January 10, 2008

Annoying issue with Linux sound

To this day, linux sound device cannot be shared by multiple sound players. Typially /dev/dsp is locked by a single process and no other process can access it and output any sound. This is quite annoying because sometimes it's difficult to figure out what process has the lock on /dev/dsp. lsof does not do anything.

If you insist to reuse the sound device, you have to restart the sound service. This is /etc/init.d/alsasound on Suse Linux. Restarting the sound service will cause termination of locking process (through SIGIO or SIGPIPE I imagine) and release the sound device. After which (in gnome) add back the volume control applet to the application tray in lower right corner.

Thursday, January 03, 2008

Graph algorithms, data structures, pattern, and algorithm correctness

GNU tool chain

I have been quite busy with several projects for the last month (hence the lack of blog activity) and in the process, I've learnt a few tricks about makefile, vim/cscope, and man page.

Given a large project on GNU/linux, it's often necessary to first cross reference the code, getting a higher level overview of the data structures, generate man pages of essential APIs and data types.

The following tools are my favorite

1. umbrello, for creating high level UML diagrams of essential data structures and APIs
2. cscope, ctags to generate cross reference, sometimes I also use lxr for c/c++ projects
3. creating man pages, this generally involves a few shell scripts and perl scripts to convert html document to man page.
4. use small test programs to understand the existing framework's APIs and design structure.

During the process, I found it's essential to have a basic knowledge of the following GNU toolchain to make a developer's life easier:

1. bash scripting. Writing bash script is like writing assembly, succinct, efficient, and to the point. One additional trick is the bash built in 'help' command to look up information on bash builtin commands, e.g. 'help for'

2. vim or emacs. After 13 years of vim, there are still new things to be learnt, this is a keybind macro I devised recently to lookup C++ stl API/data structures directly from SGI website inside vim (look at the html source code directly to see how this macro is done, there is no direct way to expose it through blogspot):

:vmap :!links -dump http://www.sgi.com/tech/stl/=expand('').html\|vim -R -

Enter visual mode (v), highly your keyword, and push Ctrl+k, this will take you to another vim session with the page downloaded and formatted. Isn't it neat?

3. makefile. It's naive to think of makefile/make as only a compile/link tool. It's more than that. Ever notice its similarity with the EBNF form in terms of structure? Yes it's actually an automaton, a complete turing machine. It can be literally used to perform any task C/C++/Perl etc can do. Its EBNF structure provides a powerful and intuitive hierarchical approach to resolve difficult problems.

4. The old and good man page. Use 'shift+k' inside vim on a keyword (non visual mode) to get its man page, this is default installed in vim. Typically MANPATH is the search path for man pages. I have not found a good way to break up long lines in man pages. COLUMNWIDTH etc does not seem to affect man page generation from a text file with troff.

References:
1. http://www.hsrl.rutgers.edu/ug/shell_help.html
2. http://vim.wikia.com/wiki/Mapping_keys_in_Vim_-_Tutorial_(Part_1)#Visual_mode_maps
3. http://www.osdev.org/wiki/Makefile
4. Bash Cookbook solutions and examples for bash users
5. Hacking vim a cookbook to get the most out of the latest vim editor
6. http://www.gnu.org/software/make/manual/make.html (unfortunately there is not a single book available to systematically introduce gnu make to general public. The manual remains the sole source of comprehensive explanation of gnu make)