Re: pseudo-specs for a String class: tokenization

From: Kinkie <gkinkie_at_gmail.com>
Date: Thu, 4 Sep 2008 11:29:12 +0200

On Thu, Sep 4, 2008 at 6:59 AM, Alex Rousskov
<rousskov_at_measurement-factory.com> wrote:
> On Thu, 2008-09-04 at 13:47 +1200, Amos Jeffries wrote:
>
>> >> On an unrelated issue, since it was of interest to some of us, here's
>> >> a sample of the caller code for tokenization functions (actual live
>> >> code):
>> >>
>> >> KBuf s1;
>> >> cout << "tokenization: \n";
>> >> {
>> >> s1="The quick brown fox jumped over the lazy dog";
>> >> char *needle=" ";
>> >> KBuf cs1(needle);
>> >> while (!s1.isNull()) {
>> >> cout << "token: " << s1.nextToken(cs1) << endl;
>> >> }
>> >> }
>> >> cout << endl;
>> >
>> > FWIW, I still think that tokenization should be a external to the buffer
>> > or string and should not modify them. Please see my earlier posts for
>> > details.
>
>> Alex, the basic buffer is not altered, only where the s1 offset is
>> pointing at.
>
>> From what he mentioned on IRC last night....
>>
>> Making s1 a duplicate reference to another KBuf (ie the actual in put
>> buffer) should show that the base KBuf is unchanged, but the parsing with
>> nextToken() will only spew off a child sub-string and increment the s1
>> start offset one token down the string.
>>
>> I'm in favor, it can be tuned for very efficient Parsing. And in
>> inefficient usage of it can be fixed easily.
>
> I see what you mean.
>
> The above example does not illustrate the usage you propose: It
> obviously does modify s1 and there is no way to reliably recover the
> original buffer after the iteration.

From the documentation:
    /**
     * strtok()-equivalent, without (some of) the brain-damage.
     * Will extract a token from the beginning of the KBuf, containing
     * all chars up to the first occurrence of any of the chars in
     * <i>delim</i> or end-of-KBuf if no delimiter can be found.
     * The tokenized KBuf is modified, by removing off its head
     * the returned token AND the delimiter.
     * \param delim a KBuf containing the characters delimiting the token.
     * \throw nullKBufException the KBuf to be tokenized is NULL
     * \note if the caller wishes to retain a pristine copy of the KBuf,
     * she has to save it BEFORE tokenizing.
     */

This interface is a bit of a hybrid, and it retains most of the basic
design choices of C, including the approach of not distinguishing
between what is a string (char *) and an iterator over the string
(char *).
This is by explicit design choice, because squid is using C string
manipulation functions. Removing some of the brain-damage is good,
especially if doing so also buys performance, but rewriting clients
from scratch something I'd reserve as a second step.
Also it is (at least to me) a very natural way of seeing things,
although it's probably not extremely clean. I took the pragmatic
approach over the "clean design" approach.

> This is a good example why poor design leads to confusing code (and
> other problems). We have lots of other examples in Squid code where an
> API can be used five different ways, only one of them being correct, but
> three can be found spreading through the actual code.

252 lines of the source out of 917 are documentation. If there's
brain-damage, I hope it's well-documented and at least consistently
applied.

> I still think an external class that stores the delimiter and current
> position would be a better tokenization API that would be easier/safer
> to use now and will eventually allow for more optimizations, non-trivial
> or varying delimiters, backing up, etc. And I do not think it introduces
> any performance overheads compared to your API.

Some hit, yes. Huge hit, I agree it doesn't.
Any other opinions?

In the meantime I've implemented the storage change you asked, to use
(offset,len) in place of (char*, len).
The impact on code complexity was unexpectedy low, and so I agree to
it and I'm not going back.

I'll update the wiki documentation as soon as possible.

Thanks

-- 
 /kinkie
Received on Thu Sep 04 2008 - 09:29:17 MDT

This archive was generated by hypermail 2.2.0 : Thu Sep 04 2008 - 12:00:04 MDT