Re: pseudo-specs for a String class: tokenization

From: Kinkie <gkinkie_at_gmail.com>
Date: Fri, 5 Sep 2008 17:47:40 +0200

On Fri, Sep 5, 2008 at 5:02 PM, Alex Rousskov
<rousskov_at_measurement-factory.com> wrote:
> On Fri, 2008-09-05 at 10:19 +0200, Kinkie wrote:
>> On Fri, Sep 5, 2008 at 4:43 AM, Alex Rousskov
>> <rousskov_at_measurement-factory.com> wrote:
>> > Just like String, the iterator interface is pretty standard. For our
>> > Tokenizer, we can simplify it a little unless others think that
>> > compatibility with standard library algorithms is worth the trouble.
>> > Here is a sketch:
>> >
>> > class Tokenizer {
>> > public:
>> > Tokenizer(); // immediately atEnd
>>
>> I'd avoid the default constructor entirely.
>
> Bad idea. The default constructor does not hurt in this case. It does
> help when you want another method to initialize the tokenizer or when
> you want to reset the already initialized tokenizer.

A tokenizer only has meaning when attached to a KBuf (String,
whatever), that's what I ment by not having a constructor without an
attached KBuf.

>> I'd rather add a version whcih takes the String but not the delimiters.
>
> I would recommend avoiding implicit conversions from String to anything
> and I doubt there is a reasonable set of default delimiters.

Why there would be an implicit conversion?
And you're right, just as a Tokenizer has no meaning without a KBuf,
then it also has none without delimiters.

>> > Tokenizer(const String &aString, const String &delimiters);
>>
>> String arg must be passed by value (which translates to refcounted ref
>> to the data). Passing by reference will alias the String, falling back
>> into the original problem.
>
> I am sorry, but you are mistaken. String arguments like that should be
> passed by reference. I do not know what "alias the String" means, but it
> is perfectly safe and noticeably more efficient to pass Strings by
> reference in contexts like that. This has been discussed recently
> already.

The point is that passing an object by (c++) reference does not create
a new copy. No new copy means that the (Kbuf-level) refcounts do not
get increased, it is "just" an alias for the original KBuf, with all
that it means (no content freezing, can be appended to while the
Tokenizer is running, etc).
As long as we're dealing with the KBuf class itself, that's no problem
and is a welcome opimization. But here we must make sure that
somewhere a copy of the KBuf object is created. Granted, it may be
done WITHIN the call; I'll just have to make sure that it gets done.

> Tokenizer will store a copy of that passed string, of course (but that's
> implementation, not API).

Agreed.

>> >
>> > // current token, named and STL-like interfaces
>> > String token() const;
>> > String operator *() const { return token(); }
>> >
>> > // move to the next token, named and STL-like interfaces
>> > Tokenizer &operator ++() { advance(); return *this; }
>> > void advance();
>>
>> I'd add a nextToken interface which combines the two, for convenience.
>
> You can already do that with STL-like interfaces:
>
> *(++tokenizer)

Yes.

>
> If we stick with named interfaces, then do this:
>
> // move to the next token, named and STL-like interfaces
> Tokenizer &operator ++() { return next(); }
> Tokenizer &next();

Agreed.

> so that you can write:
>
> tokenizer.next().token()

No.
If knowing what the actual separator was is important I'd rather:
    bool next(); //returns false @end-of-string
    KBuf& token();
    char separator();

so it becomes:
while (tokenizer.next()) {
   KBuf t=tokenizer.token();
}

>
>> > // end-of-file condition
>> > bool atEnd() const;
>> >
>> > // current delimiter (optional)
>> > String delimiter() const;
>>
>> And I'd allow changing the delimiters during parsing.
>
> Sure, just add a setDelimiters(const String &delimiters) method.

Agreed.

> You may also want to add position(), tail(), print(), and
> originalString() (or source()?) methods.

originalString() i'd avoid. It's up to the caller to remember if she wishes.
Unless there is some actual case where that information is relevant.

> You might add a previous() method (or "--" operator) as well, but I
> would probably wait until we know it is needed.

I'd leave that task to the caller.

> Finally, when the data members are known, you may need to add an
> assignment operator, a copy constructor, and a destructor. Depending on
> the data members, the defaults may work just fine though so I did not
> provide them in the initial API sketch.
>
>> Would anyone think it'd be useful to have non-single-char delimiters?
>> It'd complicate the called code quite a bit, but if it's useful and it
>> simplifies the calling code...
>
> I think we will eventually have a DelimiterSet or StrFinder class so
> that we can support string delimiters, RE delimiters, and arbitrary code
> delimiters. I would not change the API though. Currently, our
> DelimiterSet or StrFinder is a String class, which is interpreted
> internally as a set of chars. Eventually, that interpretation would be
> up to the passed finder object...
>
> You can typedef Tokenizer::Finder to String right now. I did not propose
> that to keep things simple. It would be easy to change later without any
> affects on the caller code.

Nah. Might as well do that now.
And the class hierarchy makes sense. A pure-virtual class which
ignores the actual matching method.
How would Kbuf::Tokenizer::Finder sound like? Would even avoid the
wart of friend classes.

-- 
 /kinkie
Received on Fri Sep 05 2008 - 15:47:50 MDT

This archive was generated by hypermail 2.2.0 : Fri Sep 05 2008 - 12:00:06 MDT