Re: pseudo-specs for a String class: tokenization

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Thu, 04 Sep 2008 20:43:47 -0600

On Fri, 2008-09-05 at 00:43 +0200, Kinkie wrote:
> On Thu, Sep 4, 2008 at 7:30 PM, Alex Rousskov
> <rousskov_at_measurement-factory.com> wrote:
> > On Thu, 2008-09-04 at 18:00 +0200, Kinkie wrote:
> >
> >> > BTW, this is yet another case where a Tokenizer class would be better
> >> > than let-String-do-everything approach because a tokenizer object can at
> >> > any time return the current token, the current delimiter, and/or both,
> >> > without performance overhead or design complications.
> >>
> >> There is no such thing as "current delimiter"; it's supplied by the
> >> caller each time.
> >
> > According to your documentation the caller supplies a _set_ of delimiter
> > characters. Thus, the current or actual delimiter (i.e., the actual
> > character at the end of the returned token, if any) is unknown to the
> > caller if the caller used a multi-character delimiter set:
>
> Yes.
>
> You convinced me, but for a different reason.
> If the Tokenizer is a separate object, it must hold a reference to the
> KBuf it's parsing.
> If rather than a reference it holds a copy, this will have the
> practical effect of making the KBuf being parsed immutable.

Storing a reference to a String is prohibited!

Tokenizer has to keep a const copy of a String object. The underlying
memory buffer is refcounted and, hence, the buffer is not copied when
the String is.

> Any preferences for the Tokenizer interface?

Just like String, the iterator interface is pretty standard. For our
Tokenizer, we can simplify it a little unless others think that
compatibility with standard library algorithms is worth the trouble.
Here is a sketch:

 class Tokenizer {
 public:
     Tokenizer(); // immediately atEnd
     Tokenizer(const String &aString, const String &delimiters);

     // current token, named and STL-like interfaces
     String token() const;
     String operator *() const { return token(); }

     // move to the next token, named and STL-like interfaces
     Tokenizer &operator ++() { advance(); return *this; }
     void advance();

     // end-of-file condition
     bool atEnd() const;

     // current delimiter (optional)
     String delimiter() const;
     ...
 };

I would not provide both named and STL-like interfaces. We should pick
one approach as it would simplify code maintenance and documentation. If
there are no strong opinions, let's use the interface with explicit
names.

HTH,

Alex.
Received on Fri Sep 05 2008 - 02:44:32 MDT

This archive was generated by hypermail 2.2.0 : Fri Sep 05 2008 - 12:00:06 MDT