Meditation, The Art of Exploitation

Thinking? At last I have discovered it--thought; this alone is inseparable from me. I am, I exist--that is certain. But for how long? For as long as I am thinking. For it could be, that were I totally to cease from thinking, I should totally cease to exist....I am, then, in the strict sense only a thing that thinks.

Thursday, November 15, 2007

Trier Tree distilled

If you walk into any interview that deals with text processing. You will inevitably be asked a question that involves a Trier Tree (TT). A TT is a special tree with its edge marked by identification symbol and its vertex (node) marked by success/failure of predetermined condition.

Take the typical definition of a TT whose edge is a alphabet letter and vertex indicating if a word can be found in a dictionary connecting all edge letters from root to here. This is the most common use of a TT. Another scenario involves having edges marked by numeric letter and node indicating if a telephone number can be found in a phone book connecting all edge numbers from root to here.

It can be summarized that a TT is a tree to describe a structure of composition from elemental building blocks to some known super structures. Symbolically, let E{x | x belongs to E} denote elemental building blocks and S{y | y belongs to S} denote super structures. All edge symbols of TT belongs to E and connecting edge symbols along a path results in c that either belongs to S {y} or not.

A easy way to understand TT is to think of it as if it's composed of many paths, some path terminates with a success symbol and some path terminates with a failure symbol.

Take this example, banana, we know that ban and banana are both valid words by looking up a dictionary. R is root, F is failure, S is success, ---(alpha)--- is a edge. According to the dictionary we construct a TT such as following:

R---b--->F---a--->F---n--->S---a--->F---n--->F---a--->S

Now given the challedge 'is ba a word?', by following such a TT, we know that it ends up with a vertex symbol F indicating 'ba' is not a word.

More complicated TT can be built by following such a principle, first there is a dictionary of all super structures S that can be broken down into elemental building blocks E. Start from root R, create a path for each y in S using its building blocks x in E as edge symbols, mark each vertex on its path F if it's a new vertex and mark its ending vertex S. After creating such a TT, then it can be used to verify any y in S in O(length(y)) time.