Meditation, The Art of Exploitation

Thinking? At last I have discovered it--thought; this alone is inseparable from me. I am, I exist--that is certain. But for how long? For as long as I am thinking. For it could be, that were I totally to cease from thinking, I should totally cease to exist....I am, then, in the strict sense only a thing that thinks.

Name: Fei Liu

View my complete profile

Thursday, February 28, 2008

modprobe returns Invalid kernel module Format

I just experienced this problem recently on my new gentoo installation. I wanted to use my USB wireless card (netgear ma111), this requires a custom build of linux-wlan-ng that provides prism2 kernel module (this is now part of official kernel since 2.6.24).

So I downloaded the latest (0.2.8) version of linux-wlan-ng and went through the configure/make chore. Everything worked fine (on newer kernel, one has to change socket->mac to socket->mac_header) and I got the modules compiled. But when I tried to insert the modules into kernel space, modprobe tells me the newly built kernel modules have 'Invalid Format' and cannot be used.

This is rather interesting. After much research, the problem was found to be a mismatch of running kernel config and the config file found as /usr/src/linux/.config. It appears the 2.6 kernel has a tightened requirement of how a kernel module must be built to be used with a live running kernel.

Two things I learned helped me to diagnose the problem:
1. 2.6 kernel allows a copy of running kernel config accessible as /proc/config.gz. This is now a kernel configuration option. Make sure you turn it on if you build 2.6 kernel from source, this can be a life/time-saver.

2. modinfo. Running this command on a kernel module file shows detailed information of the kernel module, in particular the versionmagic string shows options used to build the kernel module. The options in this magic string must match between this module and running kernel for the module to be useful. Mismatch here will produce the mysterious 'Invalid Format' error I initially stumbled upon.

With these knowledge, it's simple to fix the problem:
1. emerge the correct kernel source (yum, aptitude, smart, etc...on other distro)
2. cp /proc/config.gz to /usr/src/linux and decompress it; rename config to .config
3. in /usr/src/linux, make (don't have to be complete, make sure include/linux/version.h is produced)
4. re-configure/make linux-wlan-ng
5. make sure modinfo of newly built kernel module agree with existing modules of the live kernel
6. modprobe/modins prism2 module (should work perfectly now)

Tuesday, February 19, 2008

Bayes probability model

Bayes probability model is an important probability model to predict prior stage probabilities of a multi-stage probability model. Typically it's applied to a 2 stage probability model where the 2nd stage is dependent on the events in the first stage.

Unlike 2 part probability model where events in either part are independent of events in the other part, 2 stage probability model is more complicated because 2nd stage probability is dependent on the 1st stage probability, where conditional probability applies. Bayes probability model quantifies the relationship between probabilities of different events happening in the model. It can be used to compute the 1st stage probability given the conditional probability of 2nd stage events.

A simple dependent probability for event E and F is: P(E and F) = P(E) P(F|E) (the multiplication law), it says the probability that both E and F happen is the product of E happens and the conditional probability of F happens given that E happens. It often is written as P(F|E) = P(E and F)/P(E).

The symmetry in the formulation can be explored: P(E and F) = P(F) P(E|F) = P(E) P(F|E). Thus,
given P(E) = P(F) P(E|F)/P(F|E). Given conditional probabilities of P(E|F), P(F|E), and P(F), one can compute P(E).

Let E and F represent simple event Ei and Fji (Fj happens given that Ei happens) in event sets, P(Ei and Fj) = pE_i * pF_j_i where P(Ei) = pE_i and P(Fj|Ei) = pF_j_i, therefore the chance Fj happens in a 2 stage experiment is sigma(P(Ei and Fj), i = 0..I) = sigma(pE_i * pF_j_i, i = 0..I)

Thus we can rewrote the symmetry formula as P(Fj) P(Eij|Fj) = P(Ei) P(Fji|Ei). This illustrates that given 2nd stage probability P(Fj), we can reliably compute first stage conditional probability P(Eij|Fj) by rewrite it again as:

P(Eij|Fj) = P(Ei) P(Fji|Ei)/P(Fj) = P(Ei) P(Fji|Ei) / [sigma(P(Ei) * P(Fji), i = 0..I)]

Suse 10.2, how to load a kernel module during system boot automatically

Different linux distros have different ways of loading kernel module during system boot. The way it works in Suse Linux is by adding module name into MODULES_LOADED_ON_BOOT in /etc/sysconfig/kernel. This configuration file is read by /etc/init.d/boot.loadmodules and specified modules will be loaded on demand.

Monday, February 18, 2008

makefile, dependencies, and single target multiple rules

To work with a large software system, a reliable and fast build system is one of the most important pieces. Without a reliable and fast build system, development can ground to a halt. Imagine if it takes 10-15 minutes to build a new version of the software when there were one or two changes. Unfortunately this happens more often than one might have expected. GNU make provides a framework to construct a build system through makefiles. Most of us use it on a daily basis.

There is one important and interesting feature of makefile often overlooked, automatic dependency generation and inclusion. This feature is enabled by the fact that gmake allows a single target having multiple dependences but only one rule can have associated action. (Rules, dependencies, it's like lex/yacc, isn't it?)



objects = foo.o bar.o
foo.o : defs.h
bar.o : defs.h test.h
$(objects) : config.h

In this quoted example, foo.o depends on both defs.h and config.h.

Thus automatic dependency is typically done like this:

foo.o: f1.c f2.h
gcc -c f1.c
dep:
gcc -E -M -MF dep/foo.d -c f1.c
-include dep/foo.d

in this example, a dependency file dep/foo.d is generated and included in the makefile. Initially the include file does not exist and an error will be reported without '-' before include. '-' suppresses error reporting in gmake commands. The first time gmake runs through this makefile with 'make dep', it will generate the dependency file and compile f1.c. Alternatively, one can force the dependency generation as in the following example with 'make' (if all is the first rule in this makefile) but since its effect is not immediately available in compiling foo.o it's generally considered a bad practice. It forces generation of dependencies files every time during 'make all' (this can be somewhat alleviated by using make or shell conditionals). Unless a user of the makefile has a good reason to do so, avoid it:

all: dep foo.o
foo.o: f1.c f2.h
gcc -c f1.c
dep:
gcc -E -M -MF dep/foo.d -c f1.c
-include dep/foo.d

The various flags used in this example taken from gcc manual:


       -E  Stop after the preprocessing stage; do not run the compiler proper.
           The output is in the form of preprocessed source code, which is
           sent to the standard output.

           Input files which don't require preprocessing are ignored.

       -M  Instead of outputting the result of preprocessing, output a rule
           suitable for make describing the dependencies of the main source
           file.  The preprocessor outputs one make rule containing the object
           file name for that source file, a colon, and the names of all the
           included files, including those coming from -include or -imacros
           command line options.

           Unless specified explicitly (with -MT or -MQ), the object file name
           consists of the basename of the source file with any suffix
           replaced with object file suffix.  If there are many included files
           then the rule is split into several lines using \-newline.  The
           rule has no commands.

           This option does not suppress the preprocessor's debug output, such
           as -dM.  To avoid mixing such debug output with the dependency
           rules you should explicitly specify the dependency output file with
           -MF, or use an environment variable like DEPENDENCIES_OUTPUT.
           Debug output will still be sent to the regular output stream as
           normal.

           Passing -M to the driver implies -E, and suppresses warnings with
           an implicit -w.

       -MF file
           When used with -M or -MM, specifies a file to write the dependen-
           cies to.  If no -MF switch is given the preprocessor sends the
           rules to the same place it would have sent preprocessed output.

           When used with the driver options -MD or -MMD, -MF overrides the
           default dependency output file.

-M implies -E, but for illustration '-E' is explicitly specified in this example. This will speed up the make process before first gcc command will stop right after preprocessing is done.

The second time make command is issued, the dependencies files are all available and will participate dependency check on its target. This is why with old software build systems, we often see instructions such as do 'make dep' first, then do 'make' (e.g the pre-2.4 linux kernel build system). The first step generates all the dependencies and the second step does the actual compilation. Of course, we can take out '-E' and '-M' and use '-MD' instead which will produce dependency and object files in a single step. The following is taken from autoconf/make output using 'Makefile' generated from 'configure' command:


/bin/sh ../libtool --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H  -I../include -I../include -I/usr/include    -g -O2 -Werror -MT binarystring.lo -MD -MP -MF .deps/binarystring.Tpo -c -o binarystring.lo binarystring.cxx

g++ -DHAVE_CONFIG_H -I../include -I../include -I/usr/include -g -O2 -Werror -MT binarystring.lo -MD -MP -MF .deps/binarystring.Tpo -c binarystring.cxx -o binarystring.o

The dependency file foo.d will always depend on the source file foo.c from which it's generated. So whenever we update foo.c and issue 'make', a new dependency file will be generated. Viola, we remember now we were required to do 'make dep' and 'make' with this kind of makefile system.

References:
1. man gcc
2. http://www.gnu.org/software/make/manual/make.html#Multiple-Rules
3. http://www.gsp.com/cgi-bin/man.cgi?section=1&topic=mkdep
4. http://www.linuxdriver.net/make3/make3-CHP-8-SECT-3.html

Wednesday, February 13, 2008

Compilers Part 7, theoretical considerations of the parsing process

There are a few key concepts and data structures commonly utilized in a parser: the abstract syntax tree (AST)/parse tree/syntax tree, type and value stacks, action execution.

It's not initially obvious, but a parser is essentially a stack machine operating on an abstract syntax tree with post-order action execution. The AST graph (constructed by observing predefined hierarchical shift/reduce rules embedded in grammars during parsing) is traversed in a breath-first order and the action associated with each node executed post-order. Incidentally, a stack is a perfect data structure when breath-first traversing a graph, another candidate is deque. A deque allows stack operations at its either end and thus a stack can be considered a half-closed (usually head closed) deque. By definition stack operates on a first in first out (FIFO) order while deque allows either FIFO or first in last out (FILO).

A parse tree construction process explains why it's mandatory that predefined grammars assisted with associativity and precedence order rules cannot have any unresolved syntactical ambiguity or such that only one parse tree can be constructed according to the grammatical rules as in the case of a GLR parser. A syntactical ambiguity translates into an undefined parse tree and does not constitute a regular grammar (in fact a relaxed form of regular grammar because LR parsers allows the production rule expressed as TERMINAL NONTERMINAL) that's commonly implemented in most computer programming language.

The parsing process becomes alive once it's understood, 1) AST or parse tree is constructed based on the hierarchical (defined by hierarchical structure and shift/reduce rules) grammars; 2) the parse tree is then traversed breath-first and terminals/non-terminals(products) evaluated post order; 3) result of evaluation is pushed back into the traverse stack, equivalent to tree leaf folding upward (try to visualize the content of an internal vertex replaced by the result of evaluation its children vertexes according to user defined actions).

Interestingly hierarchical (which implies a tree) grammars can also be used generatively to produce strings of arbitrary length that conforms to production rules. Although the generation process can produce any string, a parser effectively defines the generation process bound by input to produce one and only one syntax tree. This is thus a deterministic finite state machine when considering only the start state (empty syntax tree) and end state (a syntax tree generated based on a predefined input). Not to be confused with the state transition process within a GLR parser during parsing, a GLR parser can be a nondeterministic finite state machine.

References
1. http://en.wikipedia.org/wiki/Formal_grammar

Flavors of Linux, the Gentoo distro

If you have a computer with 3G harddrive and 128M memory. What do you do if you want to install a linux distro on it? This kind of computer configuration is probably considered archaic and unusable. Even the main stream linux distros are having trouble with it, including Fedora core, Opensuse, Ubunto. Here comes Gentoo for the rescue.

The gentoo quick installation guide shows the user the full control on how the system can be built. With a network connection established, the system can be built from a remote shell by executing a bunch of command line commands! This convenience is invaluable, considering that you don't have to sit next to a noisy box and stare at its screen. The base system uses only about 1G harddrive space with most of the useful development tools installed.

The gentoo 'emerge' system provides decent package management features that most other small linux distros don't have. Therefore you pretty much get the best of both worlds (the all-in-one big desktop distro and down-to-the-earth barebone small distro). The only drawback is emerge usually needs to compile a package from source, the '-k' flag rarely works. :( But fortunately for what it's good for, you shouldn't have to emerge big apps all the time.

One of the headaches of using gentoo is what I call 'dead patch' problem. Sometimes, the ebuild info came with a distro or portage does not get updated to reflect patch changes and you end up with dead patch files that are no longer available from gentoo distfiles sources. In this kind of rare situations, one has to create new ebuild files and update the package ebuild database. The steps are:

1. after emerge failure, figure out a live patch file and note the difference between dead patch number and live patch number
2. cd /usr/portage/category/package-name/
3. cp deadpatch.ebuild livepatch.ebuild, make necessary changes in livepatch.ebuild, as of 2007.1 distro, it is no longer necessary to modify the newly created livepatch.ebuild as the ebuild system automatically deduct the patch number from the ebuild file name.
4. issue command: ebuild livepatch.ebuild digest This will update the ebuild database
5. redo emerge and repeat the process if there is remaining errors related to bad ebuild info

References:
1. http://www.gentoo.org/doc/en/gentoo-x86-quickinstall.xml
2. http://www.gentoo.org/doc/en/handbook/handbook-x86.xml?part=1&chap=10

Monday, February 11, 2008

Compilers Part 6, DSEL interpreter in a tcp/ip server

We are finally ready to port our interpreter into a tcp/ip server. DSEL stands for domain specific embedded language. In our project, we created a simple sql like scripting language to directly manipulate data stored in memory.

There are a few technicalities apart from lexer/parser to successfully build a tcp/ip server. We use the TCP_server implementation from STLplus library, this implementation provides us a non-blocking poll based tcp server implementation. Combined with forking or boost thread, it can be easily adapted in a multiplexing tcp/ip server.

We introduce a new output string stream as our global (or per thread if necessary) output buffer because we no longer have yyout (stdout) to display result from interpreter. The result of interpreting user input is put into the output string stream and sent back to client. We simply rename the 'main' method in the bison grammar file to 'parse' and invoke it from a serverlet thread as the server processing function, it simply calls 'yyparse' to parse client input. Before the server calls parse, it calls set_yybuffer(TCP_connection & conn) to set up flex in memory string buffer using techniques discussed in the previous entry on compiler construction.

In set_yybuffer, we send the server output to client and accepts client input and creates a flex string buffer from it. Every time client input is exhausted, yywrap is called and current flex buffer is released and set_yybuffer is called again to read input from client. Because we use poll based non-blocking tcp IO provided by STLplus, we don't have to worry much about synchronizing between client and server. The library provides convenient interface to operate IO based on the socket status.

As usual, the complete listing including its makefile is posted here:



#ifndef MAP_H
#define MAP_H
#include < string>
#include < iostream>
#include < sstream>
#include < iomanip>
#include < map>

#include < boost/lambda/lambda.hpp>
typedef std::map< std::string, std::string> map_t;

#include "tcp.hpp"
extern int set_yybuffer(TCP_connection & );
extern int parse();
// a value type that describe the value of a symbol table entry
// Essentially the symbol table entry data structure
struct value{
    unsigned char type; // the first letter of corresponding union type
    union {
        int ival;
        float fval;
        double dval;
        char * sval;
        map_t * mval;
    } ivalue;
};

// The symbol table data structure
// Symbol table entry is keyed by the symbol string value
typedef std::map< std::string, value> symtab_t;

extern symtab_t symtab;
extern std::ostringstream os;
extern int yyerror(char *);
#endif

%{
#include "map.tab.h"
#include "map.h"

extern "C"{
#include < unistd.h>
#include < fcntl.h>
#include < time.h>
#include < string.h>
}
TCP_connection conn_yy;
std::string data;
std::ostringstream os;

bool report_err;
int lineno;
int tokenpos;
%}
D   [0-9]
N   {D}+
L   [a-zA-Z]
A   [a-zA-Z0-9]
ID  ({L}{A}*)

%option yylineno
%%

select      { tokenpos += yyleng; return SELECT; }
insert      { tokenpos += yyleng; return INSERT; }
into        { tokenpos += yyleng; return INTO; }
from        { tokenpos += yyleng; return FROM; }
create      { tokenpos += yyleng; return CREATE; }
table       { tokenpos += yyleng; return TABLE; }
list        { tokenpos += yyleng; return LIST; }
where       { tokenpos += yyleng; return WHERE; }
key         { tokenpos += yyleng; return KEY; }
value       { tokenpos += yyleng; return VALUE; }
quit        { tokenpos += yyleng; return QUIT; }

${ID}       { tokenpos += yyleng; yylval.text = strdup(yytext+1); return OBJECT; }
{ID}        { tokenpos += yyleng; yylval.text = strdup(yytext); return TEXT; }
[ \t]       { tokenpos += yyleng; /* ignore white space */ }
.           { tokenpos += yyleng; return yytext[0]; }
\n          { return '\n'; }

%%

int yyerror(char * s){
    extern int yylineno;
    os << yylineno << " : " << s << " at \n" << data;
    for(int i = 0; i < tokenpos; i ++) os << ' ';
    os << "^\n";
}

YY_BUFFER_STATE cur_buffer;

int set_yybuffer(TCP_connection & conn){
    conn_yy = conn;
    tokenpos = 0;
    int ntry = 0; // time out after 120 seconds

    while(!conn.send_ready(100000)) ;
    std::string send_data = os.str();
    std::cout << "send to client: " << send_data;
    if(!conn.send(send_data)) return 1;
    os.str("");

    while(!conn.receive_ready(100000) && ntry ++ < 1200) ;
    data = "";
    if(ntry >= 1200 || !conn.receive(data)) return 1;
    os << data;

    std::cout << "analyze: " << data;
    cur_buffer = yy_scan_string(data.c_str());

    return 0;
}

int yywrap(){
    yy_delete_buffer(cur_buffer);
    return set_yybuffer(conn_yy);
}

%{
extern "C"{
#include < stdio.h>
#define YYDEBUG 1
}
extern int yyerror(char *);
extern int yylex();

#include "map.h"

symtab_t symtab;
bool where_by_key = false;
bool where_by_value = false;
std::string tablename;
std::string where;
%}

%union{
    char * text;
}

%token  INSERT SELECT INTO TEXT OBJECT FROM CREATE TABLE LIST
%token  WHERE KEY VALUE
%token  QUIT
%%

statements: statements statement
    | statement
    ;
statement: insert_stmt opt_semicolon '\n'
    | select_stmt opt_semicolon '\n'
    | create_stmt opt_semicolon '\n'
    | assign_stmt opt_semicolon '\n'
    | list_stmt opt_semicolon '\n'
    | QUIT opt_semicolon '\n'
    | '\n'
    | error '\n' { yyclearin; yyerrok; }
    ;
opt_semicolon:
    | ';'
    ;
assign_stmt:
    OBJECT '=' TEXT
            {
                // string variable assignment
                symtab_t::iterator it = symtab.find($1);
                if(it != symtab.end() && it->second.type == 's') // Symbol found and type is correct
                    it->second.ivalue.sval = $3;
                else{  // New symbol, add to symbol table
                    value v;
                    v.ivalue.sval = $3;
                    v.type = 's';
                    symtab[$1] = v;
                }
            }
    ;
create_stmt:
    CREATE TABLE OBJECT
            {
                // Create a new dictionary
                std::string symbol = $3;
                symtab_t::iterator it = symtab.find(symbol);
                if(it != symtab.end() && it->second.type == 'm'){ // Symbol found and type is correct
                    os << "symbol: " << symbol << " already exists\n";
                }else{ // New symbol, create new map(table), add to symbol table
                    value v;
                    v.ivalue.mval = new(map_t);
                    v.type = 'm';
                    symtab[symbol] = v;
                }
            }
    ;
insert_stmt:
    INSERT INTO OBJECT '(' TEXT ',' TEXT ')'
            {
                // insert key, value pair into an existing dictionary
                symtab_t::const_iterator it = symtab.find($3);
                if(it != symtab.end() && it->second.type == 'm'){ // Symbol found and type is correct
                    (*(it->second.ivalue.mval))[std::string($5)] = std::string($7);
                }else
                    os << "unknown symbol: " << $3 << " create first\n";
            }
    ;
select_stmt: simple_select_stmt
            {
                // go through all key, value pair of a dictionary
                symtab_t::const_iterator it = symtab.find(tablename);
                if(it != symtab.end() && it->second.type == 'm'){
                    map_t::const_iterator mit = it->second.ivalue.mval->begin();
                    for(; mit != it->second.ivalue.mval->end(); ++ mit)
                        os << "key = " << mit->first << ' '
                            << "value = " << mit->second << '\n';
                }else
                    os << "invalid object\n";

            }
    | simple_select_stmt opt_where_stmt
            {
                // go through all key, value pair of a dictionary
                // based on where criteria, search by key or value
                symtab_t::const_iterator it = symtab.find(tablename);
                if(it != symtab.end() && it->second.type == 'm'){
                    map_t::const_iterator mit = it->second.ivalue.mval->begin();
                    for(; mit != it->second.ivalue.mval->end(); ++ mit)
                        if( (where_by_key && mit->first == where) ||
                            (where_by_value && mit->second == where) ||
                            (!where_by_key && !where_by_value) )
                            os << "key = " << mit->first << ' '
                                << "value = " << mit->second << '\n';
                }else
                    os << "invalid object\n";

                where_by_key = where_by_value = false;

            }
    ;
simple_select_stmt:
    SELECT '*' FROM OBJECT  { tablename = $4; }
    ;
opt_where_stmt: WHERE KEY '=' TEXT { where_by_key = true; where = $4; }
    | WHERE VALUE '=' TEXT         { where_by_value = true; where = $4; }
    ;
list_stmt:
    LIST
            {
                // Dump the entire symbol table
                // For dictionaries, dump all key, value pairs as well
                //
                // Iterate through the symbol table
                symtab_t::const_iterator it = symtab.begin();
                for(; it != symtab.end(); ++it){
                    os << "symbol: " << it->first << ' ' << it->second.type << '\n';
                    switch(it->second.type){
                        case 's':    os << "value = " << it->second.ivalue.sval << '\n';
                                break;
                        case 'm':    {
                                // iterate through the dictionary
                                map_t::const_iterator mit = it->second.ivalue.mval->begin();
                                for(; mit != it->second.ivalue.mval->end(); ++ mit)
                                    os << "key = " << mit->first << ' '
                                        << "value = " << mit->second << '\n';
                                }
                                break;
                        default:
                                os << "Unknown data type\n";
                                break;
                    }
                }
            }
    ;
%%

int parse(){
    extern int yydebug;
    yydebug = 0;
    yyparse();
}

#include < vector>
#include < iostream>
#include < algorithm>
#include < functional>

#include "map.h"

#include "tcp.hpp"
#include "fileio.hpp"
#include "debug.hpp"
using namespace std;

int main (int argc, char* argv[])
{
    DEBUG_TRACE;
    if (argc != 2)
        ferr << "usage: " << argv[0] << " " << endl;
    else
    {
        // create a client connection
        // the address is specified by command argument 1 and the port
        // specified by argument 2. Use a timeout of 10s.
        TCP_server main_server((unsigned short)atoi(argv[1]), 5);
        // test to see if the connection completed OK within the timeout
        if (!main_server.initialised())
        {
            ferr << "server failed to initialise" << endl;
            return -1;
        }
        if (main_server.error())
        {
            ferr << "server initialisation failed with error " << main_server.error() << endl;
            return -1;
        }
        while(!main_server.connection_ready(1000000)) ;
        TCP_connection server = main_server.connection();

        std::cout << "Got a new connection.\n";
        if(!set_yybuffer(server))
            parse();
    }
}

A few notes, the program misses stringent memory manage, there are memory leaks associated with strdup usage (fix is simple, add free in grammar action code); server does not finalize without proper QUIT action code; turn the server into a multiplexing server and add proper synchronization on shared objects. These are important for a real world application but they are not the focus of our project.

We have successfully create a DSEL interpreter living inside a tcp/ip server utlizing powerful C++ STL library. This is a good starting point to implement more robust and useful server side DSEL interpreters.

Friday, February 08, 2008

Bash Programming Cheat Sheet (Repost)

Help File Library: Bash Programming Cheat Sheet

Written By: ph34r

A quick cheat sheet for programmers who want to do shell scripting. This is not intended to teach programming, etc. but it is intended for a someone who knows one programming language to begin learning about bash scripting.

Basics

All bash scripts must tell the o/s what to use as the interpreter. The first line of any script should be:
#!/bin/bash

You must make bash scripts executable.
chmod +x filename

Variables

Create a variable - just assign value. Variables are non-datatyped (a variable can hold strings, numbers, etc. with out being defined as such).
varname=value

Access a variable by putting $ on the front of the name
echo $varname

Values passed in from the command line as arguments are accessed as $# where #= the index of the variable in the array of values being passed in. This array is base 1 not base 0.
command var1 var2 var3 .... varX
$1 contains whatever var1 was, $2 contains whatever var2 was, etc.

Built in variables:

Variable Use
$1-$N Stores the arguments (variables) that were passed to the shell program from the command line.
$? Stores the exit value of the last command that was executed.
$0 Stores the first word of the entered command (the name of the shell program).
$* Stores all the arguments that were entered on the command line ($1 $2 ...).
"$@" Stores all the arguments that were entered on the command line, individually quoted ("$1" "$2" ...).

Quote Marks
Regular double quotes ("like these") make the shell ignore whitespace and count it all as one argument being passed or string to use. Special characters inside are still noticed/obeyed.

Single quotes 'like this' make the interpreting shell ignore all special characters in whatever string is being passed.

The back single quote marks (`command`) perform a different function. They are used when you want to use the results of a command in another command. For example, if you wanted to set the value of the variable contents equal to the list of files in the current directory, you would type the following command: contents=`ls`, the results of the ls program are put in the variable contents.

Logic and comparisons
A command called test is used to evaluate conditional expressions, such as a if-then statement that checks the entrance/exit criteria for a loop.

test expression
Or
[ expression ]

Numeric Comparisons
int1 -eq int2 Returns True if int1 is equal to int2.
int1 -ge int2 Returns True if int1 is greater than or equal to int2.
int1 -gt int2 Returns True if int1 is greater than int2.
int1 -le int2 Returns True if int1 is less than or equal to int2
int1 -lt int2 Returns True if int1 is less than int2
int1 -ne int2 Returns True if int1 is not equal to int2

String Comparisons
str1 = str2 Returns True if str1 is identical to str2.
str1 != str2 Returns True if str1 is not identical to str2.
str Returns True if str is not null.
-n str Returns True if the length of str is greater than zero.
-z str Returns True if the length of str is equal to zero. (zero is different than null)

File Comparisons
-d filename Returns True if file, filename is a directory.
-f filename Returns True if file, filename is an ordinary file.
-r filename Returns True if file, filename can be read by the process.
-s filename Returns True if file, filename has a nonzero length.
-w filename Returns True if file, filename can be written by the process.
-x filename Returns True if file, filename is executable.

Expression Comparisons
!expression Returns true if expression is not true
expr1 -a expr2 Returns True if expr1 and expr2 are true. ( && , and )
expr1 -o expr2 Returns True if expr1 or expr2 is true. ( ||, or )

Logic Con't.

If...then

if [ expression ]
then
commands
fi

If..then...else

if [ expression ]
then
commands
else
commands
fi

If..then...else If...else

if [ expression ]
then
commands
elif [ expression2 ]
then
commands
else
commands
fi

Case select

case string1 in
str1)
commands;;
str2)
commands;;
*)
commands;;
esac

string1 is compared to str1 and str2. If one of these strings matches string1, the commands up until the double semicolon (; ;) are executed. If neither str1 nor str2 matches string1, the commands associated with the asterisk are executed. This is the default case condition because the asterisk matches all strings.

Iteration (Loops)

for var1 in list
do
commands
done

This executes once for each item in the list. This list can be a variable that contains several words separated by spaces (such as output from ls or cat), or it can be a list of values that is typed directly into the statement. Each time through the loop, the variable var1 is assigned the current item in the list, until the last one is reached.

while [ expression ]
do
commands
done

until [ expression ]
do
commands
done

Functions

Create a function:

fname(){
commands
}

Call it by using the following syntax: fname

Or, create a function that accepts arguments:

fname2 (arg1,arg2...argN){
commands
}

And call it with: fname2 arg1 arg2 ... argN

Thursday, February 07, 2008

Compilers Part 5, working with in memory data buffers

In the previous entries, we were able to set up the spec for the scripting language. To port the interpreter into a tcp/ip server, the first task is to allow the lexer to work with in memory data buffers instead of stdin and stdout. The reason is simple, user input will come from a network client and there are subtle differences between a network socket and stdin.

Fortunately, flex provides several interface to set up in memory data buffer as token input. The following lex source code demonstrates how to use the relevant interface:



%{
extern "C"{
#include < sys/stat.h>
#include < fcntl.h>
#include < string.h>
}
#include < iostream>
#include < sstream>
#include < fstream>
#include < string>
#include < vector>
#include < algorithm>
using namespace std;

unsigned int line = 0;
std::vector< std::string> text;

%}

extern int yywrap();
%%

\/\/.*$    { cout << "comment: " << yytext; }
.|\n        ;
%%

YY_BUFFER_STATE cur_buffer;
int main(int argc, char * argv[]){

    cout << argv[1] << '\n';
    ifstream ifs(argv[1]);

    char buf[256];
    int len;
    while(ifs.good()){
        memset(buf, 0, 256);
        ifs.getline(buf, 254);
        len = strlen(buf);
        buf[len] = '\n';
        text.push_back(buf);
        cout << buf;
    }
    cout << "\nlines read: " << text.size() << '\n';

    cur_buffer = yy_scan_string(text[line].c_str());
    extern int yylex();
    yylex();

    return 0;
}

int yywrap(){

    yy_delete_buffer(cur_buffer);
    if(line+1 > text.size()) return 1;
    cur_buffer = yy_scan_string(text[line].c_str());
    line ++;
    return 0;
}

yywrap gets called by yylex whenever a input buffer is exhausted, if yywrap returns 1, yylex will return; Therefore, it's a common technique to set up another available data buffer and return 0 to allow yylex continue processing as done in this example.

References:
1. http://flex.sourceforge.net/manual/Multiple-Input-Buffers.html#Multiple-Input-Buffers

Friday, February 01, 2008

Compilers Part 4, Symbol tables

Continue from last entry, we will add a symbol table to our little sql parser that interacts with in memory dictionaries. We have chosen to support dictionaries of the type std:map< std::string, std::string> because coupled with serialization/deserialization techniques, a dictionary is fit to describe a hierarchical organization of data objects. Theoretically, all things can be represented by strings in a Turing machine. We may cover automatic type deduction and more generic (in terms of coding convenience) solution in future entries on symbol tables.

The reason a map is used instead of using a plain std::set is coding convenience. For std::set we have to supply user defined ordering functions to work with properly. Since we know that symbol table entries are always keyed by symbol name, a straight forward std::string key type works well. Our symbol table itself is a std::map type with key_type std::string and value_type value. struct value is user defined in map.h that contains a union ivalue type in which we hold multiple types our scripting language supports. One of the ivalue type is the dictionary type std::map, typdef map_t. Our mini sql language revolves around manipulating this type of data structure.

We will also make significant enhancement to the syntax of our little scripting language, allowing greater freedom in dictionary creation, object manipulation. Let's first see a transcript of the interpreter in action:


./map
create table $sss;
list
symbol: sss m
insert into $sss (abc, efg)
list
symbol: sss m
key = abc value = efg
insret into $sss( ddd, hik)
4 : syntax error at
insret into $sss( ddd, hik)
      ^
insert into $sss (ddd, hik)
Enter another command
list
symbol: sss m
key = abc value = efg
key = ddd value = hik
select * from $sss;
key = abc value = efg
key = ddd value = hik
select * from $sss where key = ddd
key = ddd value = hik

We have added more reserved keywords, more grammars to enhance the scripting language. Here is the complete code listing:



#ifndef MAP_H
#define MAP_H
#include < string>
#include < iostream>
#include < map>

#include < boost/lambda/lambda.hpp>
typedef std::map< std::string, std::string> map_t;

// a value type that describe the value of a symbol table entry
// Essentially the symbol table entry data structure
struct value{
    unsigned char type; // the first letter of corresponding union type
    union {
        int ival;
        float fval;
        double dval;
        char * sval;
        map_t * mval;
    } ivalue;
};

// The symbol table data structure
// Symbol table entry is keyed by the symbol string value
typedef std::map< std::string, value> symtab_t;

extern int yyerror(char *);
#endif


%{
#include "map.tab.h"
#include "map.h"

extern symtab_t symtab;
std::string line;
bool report_err;
int lineno;
int tokenpos;
%}
D   [0-9]
N   {D}+
L   [a-zA-Z]
A   [a-zA-Z0-9]
ID  ({L}{A}*)
%%

select      { tokenpos += yyleng; return SELECT; }
insert      { tokenpos += yyleng; return INSERT; }
into        { tokenpos += yyleng; return INTO; }
from        { tokenpos += yyleng; return FROM; }
create      { tokenpos += yyleng; return CREATE; }
table       { tokenpos += yyleng; return TABLE; }
list        { tokenpos += yyleng; return LIST; }
where       { tokenpos += yyleng; return WHERE; }
key         { tokenpos += yyleng; return KEY; }
value       { tokenpos += yyleng; return VALUE; }

${ID}       { tokenpos += yyleng; yylval.text = strdup(yytext+1); return OBJECT; }
{ID}        { tokenpos += yyleng; yylval.text = strdup(yytext); return TEXT; }
[ \t]       { tokenpos += yyleng; /* ignore white space */ }
.           { tokenpos += yyleng; return yytext[0]; }
\n.*        { report_err = true; tokenpos = 0; line = yytext+1; yyless(1); lineno++; return '\n'; }

%%

int yyerror(char * s){
    if(report_err){
        std::cout << lineno << " : " << s << " at \n" << line << '\n';
        printf("%*s\n", 1+tokenpos, "^");
    }
}

%{
extern "C"{
#include < stdio.h>
#define YYDEBUG 1
}
extern int yyerror(char *);
extern int yylex();

#include "map.h"

symtab_t symtab;
bool where_by_key = false;
bool where_by_value = false;
std::string tablename;
std::string where;
%}

%union{
    char * text;
}

%token < text> INSERT SELECT INTO TEXT OBJECT FROM CREATE TABLE LIST
%token < text> WHERE KEY VALUE
%%

statements: statements statement
    | statement
    ;
statement: insert_stmt opt_semicolon '\n'
    | select_stmt opt_semicolon '\n'
    | create_stmt opt_semicolon '\n'
    | assign_stmt opt_semicolon '\n'
    | list_stmt opt_semicolon '\n'
    | '\n'
    | error '\n' { yyclearin; yyerrok; std::cout << "Enter another command\n"; }
    ;
opt_semicolon:
    | ';'
    ;
assign_stmt:
    OBJECT '=' TEXT
            {
                // string variable assignment
                symtab_t::iterator it = symtab.find($1);
                if(it != symtab.end() && it->second.type == 's') // Symbol found and type is correct
                    it->second.ivalue.sval = $3;
                else{  // New symbol, add to symbol table
                    value v;
                    v.ivalue.sval = $3;
                    v.type = 's';
                    symtab[$1] = v;
                }
            }
    ;
create_stmt:
    CREATE TABLE OBJECT
            {
                // Create a new dictionary
                std::string symbol = $3;
                symtab_t::iterator it = symtab.find(symbol);
                if(it != symtab.end() && it->second.type == 'm'){ // Symbol found and type is correct
                    std::cerr << "symbol: " << symbol << " already exists\n";
                }else{ // New symbol, create new map(table), add to symbol table
                    value v;
                    v.ivalue.mval = new(map_t);
                    v.type = 'm';
                    symtab[symbol] = v;
                }
            }
    ;
insert_stmt:
    INSERT INTO OBJECT '(' TEXT ',' TEXT ')'
            {
                // insert key, value pair into an existing dictionary
                symtab_t::const_iterator it = symtab.find($3);
                if(it != symtab.end() && it->second.type == 'm'){ // Symbol found and type is correct
                    (*(it->second.ivalue.mval))[std::string($5)] = std::string($7);
                }else
                    std::cerr << "unknown symbol: " << $3 << " create first\n";
            }
    ;
select_stmt: simple_select_stmt
            {
                // go through all key, value pair of a dictionary
                symtab_t::const_iterator it = symtab.find(tablename);
                if(it != symtab.end() && it->second.type == 'm'){
                    map_t::const_iterator mit = it->second.ivalue.mval->begin();
                    for(; mit != it->second.ivalue.mval->end(); ++ mit)
                        std::cout << "key = " << mit->first << ' '
                            << "value = " << mit->second << '\n';
                }else
                    std::cerr << "invalid object\n";

            }
    | simple_select_stmt opt_where_stmt
            {
                // go through all key, value pair of a dictionary
                // based on where criteria, search by key or value
                symtab_t::const_iterator it = symtab.find(tablename);
                if(it != symtab.end() && it->second.type == 'm'){
                    map_t::const_iterator mit = it->second.ivalue.mval->begin();
                    for(; mit != it->second.ivalue.mval->end(); ++ mit)
                        if( (where_by_key && mit->first == where) ||
                            (where_by_value && mit->second == where) ||
                            (!where_by_key && !where_by_value) )
                            std::cout << "key = " << mit->first << ' '
                                << "value = " << mit->second << '\n';
                }else
                    std::cerr << "invalid object\n";

                where_by_key = where_by_value = false;

            }
    ;
simple_select_stmt:
    SELECT '*' FROM OBJECT  { tablename = $4; }
    ;
opt_where_stmt: WHERE KEY '=' TEXT { where_by_key = true; where = $4; }
    | WHERE VALUE '=' TEXT         { where_by_value = true; where = $4; }
    ;
list_stmt:
    LIST
            {
                // Dump the entire symbol table
                // For dictionaries, dump all key, value pairs as well
                //
                // Iterate through the symbol table
                symtab_t::const_iterator it = symtab.begin();
                for(; it != symtab.end(); ++it){
                    std::cout << "symbol: " << it->first << ' ' << it->second.type << '\n';
                    switch(it->second.type){
                        case 's':    std::cout << "value = " << it->second.ivalue.sval << '\n';
                                break;
                        case 'm':    {
                                // iterate through the dictionary
                                map_t::const_iterator mit = it->second.ivalue.mval->begin();
                                for(; mit != it->second.ivalue.mval->end(); ++ mit)
                                    std::cout << "key = " << mit->first << ' '
                                        << "value = " << mit->second << '\n';
                                }
                                break;
                        default:
                                std::cerr << "Unknown data type\n";
                                break;
                    }
                }
            }
    ;
%%

int main(){
    extern int yydebug;
    yydebug = 0;
    yyparse();
}

We do not provide symlookup kind of functions because it's fairly straightforward to find symtable entry with our symtab_t data structure and the returned iterator is of immediate use in the action code. The complex object chaining action code may seem a little taunting at first but they are just regular C++ STL ways of getting/setting std::map data. We also let STL to handle all the memory issues, computing efficiency (std::map is typically implemented as a red-black tree, a kind of balanced binary search tree with very decent insertion, search, deletion speed O(lgN)), APIs. This is primarily why we would like to get lex and yacc to work with C++.

Perhaps most telling is the 'list' command covered by list_stmt grammar, whose action code performs a complete dump of the entire symbol table. We first iterate the symbol table data structure, upon seeing a map_t type, we also iterate all key, value pairs of this symbol. This code illustrate the data structures we use to contain data.