[RFC] Tokenizer API

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Mon, 09 Dec 2013 12:00:31 -0700

Hello,

    The promised Tokenizer API proposal is attached. Compared to earlier
proposals, this interface is simplified by focusing on finding tokens
(requires knowledge of allowed and prohibited character sets), not
parsing (requires knowledge of input syntax) and by acknowledging that
real parsing rules are often too complex to be [efficiently] supported
with a single set of delimiters. The parser (rather than a tokenizer) is
a better place to deal with those complexities.

The API supports checkpoints and backtracking by ... copying Tokenizers.

I believe the interface allows for an efficient implementation,
especially if the CharacterSet type is eventually redefined as a boolean
array, providing us a constant time lookup complexity.

Here is a sketch on how a Tokenizer "tk" might be used to build a
primitive HTTP Request-Line parser (a part of the incremental HTTP
header parser):

> // Looking at something like GET /index.html HTTP/1.0\r\n
>
> SBuf method, uri, proto, vMajor, vMinor;
> if (tk.prefix(method, Http::MethodChars) &&
> tk.token(uri, Http::HeaderWhitespace) &&
> tk.prefix(proto, Http::ProtoChars) &&
> tk.skip('/') &&
> tk.prefix(vMajor, DecimalDigits) &&
> tk.skip('.') &&
> tk.prefix(vMinor, DecimalDigits) &&
> (tk.skip(Http::Crs) || true) && // optional CRs
> tk.skip('\n')) {
> ... validate after successfully parsing the request line
> } else ...

And this sketch illustrates the part of squid.conf parser dealing with
quoted strings:

> if (tk.skip('\\')) ...
> else if (tk.skip('"')) ...
> else if (tk.token(word, SquidConfWhitespace)) ...

HTH,

Alex.

Received on Mon Dec 09 2013 - 19:00:57 MST

This archive was generated by hypermail 2.2.0 : Tue Dec 10 2013 - 12:00:10 MST