Re: [RFC] Tokenizer API

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Tue, 10 Dec 2013 09:06:05 -0700

On 12/09/2013 04:13 PM, Amos Jeffries wrote:

> Two requests for additional scope:

> * can we place this is a separate src/parse/ library please?
> - we have other generic parse code the deserves to all be bundled up
> together instead of spread out. Might as well start that collection
> process now.

No objections.

> * Lets do the charset boolean array earlier rather than later.

No objections.

I hope I will not be the one implementing the proposed API or the above
additions though. :-)

> CharacterSet(const char * const c, size_t len)

You do not want the len argument. These character sets will be nearly
always initialized once, from constant hard-coded c-strings.

For esoteric cases, I would add an add(const char c) method to add a
single character to the set (and use the merge operation to produce more
elaborate sets as needed, see below for a sketch).

It is a good idea to add a public label/name argument/member though, for
debugging/triaging:

  CharacterSet(const char *label, const char *characters);
  ...
  const char *name; ///< a label for debugging

> /// add all characters from the given CharacterSet to this one
> void merge(const CharacterSet &src) const {

Please provide a '+=' operator [that will use merge()]. Here is how I
envision some sets might be created:

const CharacterSet &MySpaces() {
    static CharacterSet cs;
    if (!cs.name) {
        cs = CharacterSet("MySpaces", "\n\t "); // RFC 12345
        cs += OtherProtocolSpaces();
        if (Config.weNeedToBeSpecial)
            cs.add('\r');
    }
    return cs;
}

> private:
> bool match_[256];

I suggest using std::vector<bool> to avoid sprinkling the code with 256
constants or adding another member to store the hard-coded length.

I would also call this member chars_ instead of match_ because some sets
will be used in negative sense and "match" will be a little confusing in
that negative context.

> NP: most of the time we will be wanting to define these CharacterSet as
> global once-off objects. So I'm not sure if the merge() method is
> useful, but shown here for completeness in case we want it for
> generating composite character sets.

Agreed on all counts.

Cheers,

Alex.
Received on Tue Dec 10 2013 - 16:06:33 MST

This archive was generated by hypermail 2.2.0 : Wed Dec 11 2013 - 12:00:11 MST