Building tokenizers with Ragel

Wednesday, April 10, 2013, at 05:36PM

By James Dabbs

One of the problems we've been wrestling with lately here at Emcien is finding patterns in network traffic (with an eye towards detecting intrusion attempts). We've exerted considerable effort to make our pattern detection engine run as fast as possible, but recently did some profiling and found that significantly more time was being spent tokenizing the input before handing it off for analysis. I've been building a new tokenization tool chain using Ragel and wanted to share a little of my experience by walking through the design of a simple Ragel parser for converting Apache log lines to JSON.

This article assumes a passing familiarity with finite state machines (FSMs) and regular expressions. Those of you who prefer code to text can check out the finished product on Github.

Why Ragel?

I had two main design constraints for this tokenizer project:

  • Make it fast
  • Make it work consistently in the engine (C), app (Rails) and on the client-side (Backbone)

Ragel works by defining a state machine directly and then using that definition to generate code in any of several host languages - C, D, Go and Java are supported by default, but there are projects to allow generation in other languages, including javascript. That nicely covers point 2, with the incidental benefit that each tokenizer is essentially a declarative specification of the input it expects to see.

As for the first point - speed - Ragel exposes code generation strategies that can produce some really fast C code, as we'll see shortly.

How it works

Broadly, using Ragel to write a program consists of an abstract machine specification and code to integrate that machine into a host language:

The input specification

We'll want to generate a machine that can read in lines like this (clearly old) log entry

71.109.86.22 - - [19/Apr/2006:02:27:44 -0700] "GET /bg/henna-take1.jpg HTTP/1.1" 200 133028 "http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=6980420" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"

Here's the specification I ended up writing:

      machine clf;
      action mark { mark(); }

      host    = [0-9\.]+       >mark %{ emit("host");    };
      user    = alpha+         >mark %{ emit("user");    };
      date    = [^\]]+         >mark %{ emit("date");    };
      request = [^"]+          >mark %{ emit("request"); };
      status  = digit+         >mark %{ emit("status");  };
      size    = (digit+ | '-') >mark %{ emit("size");    };
      url     = [^"]*          >mark %{ emit("url");     };
      agent   = [^"]*          >mark %{ emit("agent");   };

      line = (
        host            space
        '-'             space
        '-'             space
        '[' date ']'    space
        '"' request '"' space
        status          space
        size            space
        '"' url '"'     space
        '"' agent '"'   '\n'
      );

Let's step through that -

machine clf declares the clf (common log format) machine and allows us to include it in other machines later.

Code in a {} block is embedded code in the host language, so action mark { mark(); } declares the mark action, but delegates its actual implementation to whatever code we end up embedding this machine in.

Ragel features several simple built-in machines and knows how to generate a machine from a basic regex. host = [0-9\.]+ defines a machine named host that accepts a string of one or more digits or periods (you could, of course, specify this more carefully). The other named machines are similar (recall that e.g. [^"] matches anything except a ").

The interesting part of the simple machine definitions is the >mark %{ emit("type"); }. > and % define actions to take on certain types of state transitions - in this case, marking the parser's position when entering (>) one of these machines and emitting the parsed token upon leaving (%). Again, the { emit("type"); } block is just code in the host language; we'll have to define an emit function in every program we write using this machine - but fortunately, that syntax works for calling functions in C, Ruby and Javascript.

line is a compound state machine built up from the simpler ones we've just defined. The default operation is concatenation, so line just looks for these parts one after the other.

Fun aside: Ragel integrates nicely with Graphviz. Here's the machine we just built (as generated by ragel -Vp).

Now that we've declared a machine, we should write a program to take it for a spin:

A Ruby debugger

In fairness, nailing down your input format can be a delicate and error-prone process. I found it helpful to write a Ruby tokenizer first so that I could stop and introspect the parser's state at the point of failure. Here's what that might look like:

    %%{
      machine debugger;
      include clf "clf.rl";

      variable data @data;
      variable p    @p;

      main := line $!{ error };
    }%%

%% deliniates Ragel directives. Here we're defining a new main machine which includes our previously defined line and adds a call to the error function if the parser hits an error in any state ($!). Internally, Ragel uses the variables data for the text to parse and p for the symbol currently being read. We want these to be instance variables for a Tokenizer class, but fortunately Ragel allows us to specify what names those variables actually use. Here's the class definition:

    class Tokenizer
      def initialize path
        @path = path
        # This writes the actual state transitions into the generated .rb file
        %% write data;
        #% # Fix syntax highlighting
      end

      def run
        File.foreach @path do |line|
          # Store the current line for later reference
          @line  = line
          # Ragel expects data to be an array of integers and needs a little 
          # extra bookkeeping to know where to start and stop parsing
          @data  = line.unpack 'c*'
          @p, pe = 0, line.length
          eof    = pe

          # This writes the actual machine instructions into the file - this
          # code will consume input and and call `mark` and `emit` when applicable
          %% write init;
          %% write exec;
        end
      end

      def mark
        # Simply store the current position
        @ts = @p
      end

      def emit type
        # Print the type and text of the last read token
        puts "#{type}: #{@data[@ts...@p].pack('c*')}"
      end

      def line
        # Print the current line, highlighting the current character
        pre, curr, post = @line[0 ... @p], @line[@p], @line[@p+1 .. -1]
        highlight = curr =~ /\s/ ? curr.on_light_red : curr.light_red
        pre + highlight + post
      end

      def error
        # Print out the erroneous line, and halt so we can inspect the state of
        # the machine
        puts "ERROR: ".red + line
        binding.pry && raise  # so exit 1 will halt execution immediately
      end
    end

    Tokenizer.new(ARGV.first).run

Run

ragel -R <filename> -o debugger

to interpolate these directives into actual Ruby source code. You'll be doing this a lot, so I'd recommend making a Makefile for it - something like this. Once the code is generated, you can run it with ruby debugger /path/to/infile. Now if we try to parse some unexpected input, we'll get something like:

and be dropped into a pry terminal to poke around. Very convenient.

How are we doing on speed?

    $ time ruby debugger log.100k > out

    real  0m47.341s
    user  0m47.098s
    sys 0m0.201s

About 2100 lines/s. Tolerable for small files. Ragel does have the -T1 optimization option for Ruby code, but we can do better.

A C binary

The C program is fairly similar to the Ruby one, just with more fiddly string details and less robust error handling. In this case, we'll be outputting the results as a JSON-formatted array of objects:

    #define MAX_LINE_LENGTH 4096
    char inbuffer[MAX_LINE_LENGTH], outbuffer[MAX_LINE_LENGTH];

    %%{
      machine parser;
      include clf "clf.rl";
      main := line;
      write data;
    }%%

    // Declare these in global scope so that the machine and our action
    // definitions can access them.
    int cs;
    char *p, *pe, *ts;

    // Mark is similar, but C has no notion of instance variables
    // To emit, we write this key:value pair to our output buffer
    static inline void mark() { ts = p; }
    static inline void emit(char *type) {
      *p = '\0';
      char fmtbuffer[MAX_LINE_LENGTH];
      sprintf(fmtbuffer, "\"%s\":\"%s\",", type, ts);
      strcat(outbuffer, fmtbuffer);
    }

    int main() {
      int lines = 0;

      fputc('[', stdout); // Start the list
      while(fgets(inbuffer, MAX_LINE_LENGTH, stdin) != NULL) {
        // Start writing the JSON object to the buffer 
        // Include a leading comma if we need to separate it from its predecessor
        if (lines) { 
          strcpy(outbuffer, ",\n{");
        } else {
          strcpy(outbuffer, "{");
        }

        // Start the machine running on the input buffer
        p  = inbuffer;
        pe = inbuffer + strlen(inbuffer);
        ts = p;
        %% write init;  
        %% write exec;

        // Finalize the object, chomping off the last comma
        int len = strnlen(outbuffer, MAX_LINE_LENGTH);
        outbuffer[len - 1] = '\0';
        fprintf(stdout, "%s}", outbuffer);
        lines ++;
      }
      fprintf(stdout, "]\n");
      return 0;
    }

Easy, modulo C's usual string-handling headaches. Compilation is similar to the Ruby case, only without the -R flag. There are several optimization options available - I'm using -G2 which generates a "really fast goto driven FSM".

And just how fast is the finished product? Compiled with -O3:

    $ time ./parser < log.100k > out

    real  0m0.418s
    user  0m0.318s
    sys 0m0.045s

~240k lines/s and two orders of magnitude faster than the Ruby version. Not bad.

Parting thoughts

This machine barely scratches the surface of what Ragel can do - it also supports really fine-grained action specifications and tools for writing scanners, backtracking and controlling non-determinism. Full details are available in the Ragel user guide.

My favorite thing about using Ragel is the "write once, run anywhere" workflow. With this machinery in place, tokenizing a new input format is just a matter of declaring a new grammar and running any of your tokenizers in a new language is just a matter of writing a new adapter. I'll leave writing a Javascript version of this machine as an exercise for the reader.

Happy parsing!