Re: [RFC] Tokenizer API

From: Francesco Chemolli <gkinkie_at_gmail.com>
Date: Tue, 10 Dec 2013 06:46:18 +0100

On 09 Dec 2013, at 20:00, Alex Rousskov <rousskov_at_measurement-factory.com> wrote:

> Hello,
>
> The promised Tokenizer API proposal is attached. Compared to earlier
> proposals, this interface is simplified by focusing on finding tokens
> (requires knowledge of allowed and prohibited character sets), not
> parsing (requires knowledge of input syntax) and by acknowledging that
> real parsing rules are often too complex to be [efficiently] supported
> with a single set of delimiters. The parser (rather than a tokenizer) is
> a better place to deal with those complexities.
>
> The API supports checkpoints and backtracking by ... copying Tokenizers.
>
> I believe the interface allows for an efficient implementation,
> especially if the CharacterSet type is eventually redefined as a boolean
> array, providing us a constant time lookup complexity.

Hi,
   SBuf supplies a few find() variants which could help which are not constant time but rely on lower-level primitives and related optimizations. My suggestion is to have CharacterSet be a SBuf and rely on them, at least for now. In any case having them be a SBuf promotes better interface decoupling and abstraction.

> Here is a sketch on how a Tokenizer "tk" might be used to build a
> primitive HTTP Request-Line parser (a part of the incremental HTTP
> header parser):

SBuf was not really designed to be passed by nonconst reference. But this sketch is very compelling, so it's worth to try it and see.

>> // Looking at something like GET /index.html HTTP/1.0\r\n
>>
>> SBuf method, uri, proto, vMajor, vMinor;
>> if (tk.prefix(method, Http::MethodChars) &&
>> tk.token(uri, Http::HeaderWhitespace) &&
>> tk.prefix(proto, Http::ProtoChars) &&
>> tk.skip('/') &&
>> tk.prefix(vMajor, DecimalDigits) &&
>> tk.skip('.') &&
>> tk.prefix(vMinor, DecimalDigits) &&
>> (tk.skip(Http::Crs) || true) && // optional CRs
>> tk.skip('\n')) {
>> ... validate after successfully parsing the request line
>> } else ...
>
>
> And this sketch illustrates the part of squid.conf parser dealing with
> quoted strings:
>
>> if (tk.skip('\\')) ...
>> else if (tk.skip('"')) ...
>> else if (tk.token(word, SquidConfWhitespace)) ...

About the interface itself:

const SBuf &remaining() const

I'd change the signature to
SBuf remaining() const

copying a SBuf is easy, returning one puts a lower requirement on the caller and is less constrained

I'd also add to the interface a few constants to describe common character sets such as ALPHA, ALNUM, LOWERALPHA, UPPERALPHA etc. (I'd use the predefined character classes from grep(1) as a refetence for common patterns).

   Kinkie
Received on Tue Dec 10 2013 - 05:46:26 MST

This archive was generated by hypermail 2.2.0 : Tue Dec 10 2013 - 12:00:10 MST