Re: pseudo-specs for a String class: tokenization

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Fri, 05 Sep 2008 09:02:01 -0600

On Fri, 2008-09-05 at 10:19 +0200, Kinkie wrote:
> On Fri, Sep 5, 2008 at 4:43 AM, Alex Rousskov
> <rousskov_at_measurement-factory.com> wrote:
> > Just like String, the iterator interface is pretty standard. For our
> > Tokenizer, we can simplify it a little unless others think that
> > compatibility with standard library algorithms is worth the trouble.
> > Here is a sketch:
> >
> > class Tokenizer {
> > public:
> > Tokenizer(); // immediately atEnd
>
> I'd avoid the default constructor entirely.

Bad idea. The default constructor does not hurt in this case. It does
help when you want another method to initialize the tokenizer or when
you want to reset the already initialized tokenizer.

> I'd rather add a version whcih takes the String but not the delimiters.

I would recommend avoiding implicit conversions from String to anything
and I doubt there is a reasonable set of default delimiters.

> > Tokenizer(const String &aString, const String &delimiters);
>
> String arg must be passed by value (which translates to refcounted ref
> to the data). Passing by reference will alias the String, falling back
> into the original problem.

I am sorry, but you are mistaken. String arguments like that should be
passed by reference. I do not know what "alias the String" means, but it
is perfectly safe and noticeably more efficient to pass Strings by
reference in contexts like that. This has been discussed recently
already.

Tokenizer will store a copy of that passed string, of course (but that's
implementation, not API).

> >
> > // current token, named and STL-like interfaces
> > String token() const;
> > String operator *() const { return token(); }
> >
> > // move to the next token, named and STL-like interfaces
> > Tokenizer &operator ++() { advance(); return *this; }
> > void advance();
>
> I'd add a nextToken interface which combines the two, for convenience.

You can already do that with STL-like interfaces:

        *(++tokenizer)

If we stick with named interfaces, then do this:

    // move to the next token, named and STL-like interfaces
    Tokenizer &operator ++() { return next(); }
    Tokenizer &next();

so that you can write:

    tokenizer.next().token()

> > // end-of-file condition
> > bool atEnd() const;
> >
> > // current delimiter (optional)
> > String delimiter() const;
>
> And I'd allow changing the delimiters during parsing.

Sure, just add a setDelimiters(const String &delimiters) method.

You may also want to add position(), tail(), print(), and
originalString() (or source()?) methods.

You might add a previous() method (or "--" operator) as well, but I
would probably wait until we know it is needed.

Finally, when the data members are known, you may need to add an
assignment operator, a copy constructor, and a destructor. Depending on
the data members, the defaults may work just fine though so I did not
provide them in the initial API sketch.

> Would anyone think it'd be useful to have non-single-char delimiters?
> It'd complicate the called code quite a bit, but if it's useful and it
> simplifies the calling code...

I think we will eventually have a DelimiterSet or StrFinder class so
that we can support string delimiters, RE delimiters, and arbitrary code
delimiters. I would not change the API though. Currently, our
DelimiterSet or StrFinder is a String class, which is interpreted
internally as a set of chars. Eventually, that interpretation would be
up to the passed finder object...

You can typedef Tokenizer::Finder to String right now. I did not propose
that to keep things simple. It would be easy to change later without any
affects on the caller code.

HTH,

Alex.
Received on Fri Sep 05 2008 - 15:02:32 MDT

This archive was generated by hypermail 2.2.0 : Fri Sep 05 2008 - 12:00:06 MDT