Re: pseudo-specs for a String class: tokenization

From: Kinkie <gkinkie_at_gmail.com>
Date: Fri, 5 Sep 2008 10:19:28 +0200

On Fri, Sep 5, 2008 at 4:43 AM, Alex Rousskov
<rousskov_at_measurement-factory.com> wrote:
> On Fri, 2008-09-05 at 00:43 +0200, Kinkie wrote:
>> On Thu, Sep 4, 2008 at 7:30 PM, Alex Rousskov
>> <rousskov_at_measurement-factory.com> wrote:
>> > On Thu, 2008-09-04 at 18:00 +0200, Kinkie wrote:
>> >
>> >> > BTW, this is yet another case where a Tokenizer class would be better
>> >> > than let-String-do-everything approach because a tokenizer object can at
>> >> > any time return the current token, the current delimiter, and/or both,
>> >> > without performance overhead or design complications.
>> >>
>> >> There is no such thing as "current delimiter"; it's supplied by the
>> >> caller each time.
>> >
>> > According to your documentation the caller supplies a _set_ of delimiter
>> > characters. Thus, the current or actual delimiter (i.e., the actual
>> > character at the end of the returned token, if any) is unknown to the
>> > caller if the caller used a multi-character delimiter set:
>>
>> Yes.
>>
>> You convinced me, but for a different reason.
>> If the Tokenizer is a separate object, it must hold a reference to the
>> KBuf it's parsing.
>> If rather than a reference it holds a copy, this will have the
>> practical effect of making the KBuf being parsed immutable.
>
> Storing a reference to a String is prohibited!
>
> Tokenizer has to keep a const copy of a String object. The underlying
> memory buffer is refcounted and, hence, the buffer is not copied when
> the String is.

If the tokenizer IS the String object, then it can't really be const.
So I agree, an external tokenizer is needed.

>> Any preferences for the Tokenizer interface?
>
> Just like String, the iterator interface is pretty standard. For our
> Tokenizer, we can simplify it a little unless others think that
> compatibility with standard library algorithms is worth the trouble.
> Here is a sketch:
>
> class Tokenizer {
> public:
> Tokenizer(); // immediately atEnd

I'd avoid the default constructor entirely. I'd rather add a version
whcih takes the String but not the delimiters.

> Tokenizer(const String &aString, const String &delimiters);

String arg must be passed by value (which translates to refcounted ref
to the data). Passing by reference will alias the String, falling back
into the original problem.

>
> // current token, named and STL-like interfaces
> String token() const;
> String operator *() const { return token(); }
>
> // move to the next token, named and STL-like interfaces
> Tokenizer &operator ++() { advance(); return *this; }
> void advance();

I'd add a nextToken interface which combines the two, for convenience.

> // end-of-file condition
> bool atEnd() const;
>
> // current delimiter (optional)
> String delimiter() const;

And I'd allow changing the delimiters during parsing.
Would anyone think it'd be useful to have non-single-char delimiters?
It'd complicate the called code quite a bit, but if it's useful and it
simplifies the calling code...

> ...
> };
>
> I would not provide both named and STL-like interfaces. We should pick
> one approach as it would simplify code maintenance and documentation. If
> there are no strong opinions, let's use the interface with explicit
> names.

I agree.

-- 
 /kinkie
Received on Fri Sep 05 2008 - 08:25:56 MDT

This archive was generated by hypermail 2.2.0 : Fri Sep 05 2008 - 12:00:06 MDT