Re: [RFC] Tokenizer API

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Tue, 10 Dec 2013 19:13:39 +1300

On 10/12/2013 6:51 p.m., Francesco Chemolli wrote:
>> Hi,
>> SBuf supplies a few find() variants which could help which are not constant time but rely on lower-level primitives and related optimizations. My suggestion is to have CharacterSet be a SBuf and rely on them, at least for now. In any case having them be a SBuf promotes better interface decoupling and abstraction.
>
> Oh, one more argument for having the low-level matching primitives in SBuf: it's a pet peeve of mine to use some form of compact tries and/or FSM to do single-pass low-level string matching in SBuf, possibly by lifting code from GNU grep (it's very efficient but complex). Redoing find_first_of() and startsWith() here would duplicate code and undermine that possibility and qualifies as premature optimisation IMO :)
>
> Kinkie
>

The problem with comparing input strings to a SBuf of characters is that
parsing a input of length N againt charset of size M takes O(N*M) time.

Making the charset a boolean array like Alex mentioned cuts that down to
O(N) parsing time.

This Tokeniser is also for the cases where the input is a MemBuf or
similar non-SBuf array. The data copy to get it into SBuf is the output
of the Tokenizer. If we copy the entire buffer into an SBuf first then
pare we face either growing SBuf more than otherwise necessary if it was
not all received, or cropping them down after a useless data copy.

Amos
Received on Tue Dec 10 2013 - 06:13:47 MST

This archive was generated by hypermail 2.2.0 : Tue Dec 10 2013 - 12:00:10 MST