Re: pseudo-specs for a String class: tokenization

From: Kinkie <gkinkie_at_gmail.com>
Date: Thu, 4 Sep 2008 18:00:37 +0200

On Thu, Sep 4, 2008 at 5:24 PM, Alex Rousskov
<rousskov_at_measurement-factory.com> wrote:
> On Thu, 2008-09-04 at 11:29 +0200, Kinkie wrote:
>
>> /**
>> * strtok()-equivalent, without (some of) the brain-damage.
>> * Will extract a token from the beginning of the KBuf, containing
>> * all chars up to the first occurrence of any of the chars in
>> * <i>delim</i> or end-of-KBuf if no delimiter can be found.
>
> What is returned when the string contains nothing but two delimiters?
> Two empty tokens? One empty token? end-of-KBuf? The above needs to be
> polished to make this clear.

Expected result is three zero-length tokens, then end-of-KBuf.

>
>> * The tokenized KBuf is modified, by removing off its head
>
> Do you modify (and hence, usually copy) the underlying shared memory
> buffer or just the string offset? Since your design morphs the two
> classes together, it is difficult to say what "modifying KBuf" actually
> means...

Just the offset. So far the only real modification performed on the
underlying memory is appending, and then only in certain circumstances
(generally speaking, when the KBuf is at the tail of the underlying
memory storage).

>> * the returned token AND the delimiter.
>
> Do not return the delimiter, at least not by default. A typical user
> wants a stream of tokens, not the stuff that separates them. Also, there
> is often no delimiter at the end of the string...

The delimiter is not returned; it's just removed from the "remaining"
part. In other words, it is lost (unless the KBuf is saved before
starting tokenization).

> BTW, this is yet another case where a Tokenizer class would be better
> than let-String-do-everything approach because a tokenizer object can at
> any time return the current token, the current delimiter, and/or both,
> without performance overhead or design complications.

There is no such thing as "current delimiter"; it's supplied by the
caller each time.

>> This is by explicit design choice, because squid is using C string
>> manipulation functions. Removing some of the brain-damage is good,
>> especially if doing so also buys performance, but rewriting clients
>> from scratch something I'd reserve as a second step.
>
> No String APIs discussed so far require "rewriting clients from
> scratch". All APIs will require a lot of mundane client changes. I would
> rather suffer through all those changes once (with a good API) than do
> them twice, first with a partially brain-damaged API and then with a
> damage-free API.
>
>> Also it is (at least to me) a very natural way of seeing things,
>> although it's probably not extremely clean. I took the pragmatic
>> approach over the "clean design" approach.
>
> Clean design can be pragmatic. It is better to avoid such arguments for
> the sake of a constructive debate. It is almost guaranteed that others
> will feel that their approach is more pragmatic and natural than yours.
>
> You can say things like "my design models strtok() API" or "my design
> models standard or boost library APIs" or "my design is a new wheel" but
> do not call your approach "pragmatic" implying that others are not.
>
>> > This is a good example why poor design leads to confusing code (and
>> > other problems). We have lots of other examples in Squid code where an
>> > API can be used five different ways, only one of them being correct, but
>> > three can be found spreading through the actual code.
>>
>> 252 lines of the source out of 917 are documentation. If there's
>> brain-damage, I hope it's well-documented and at least consistently
>> applied.
>
> If we are to suffer through String/char*/MemBuf replacement, I want to
> suffer for the best String API we can get, an API that does not need to
> be replaced for the foreseeable future. I do not care much if other
> replacement APIs are well-documented and internally consistent.
>
>> In the meantime I've implemented the storage change you asked, to use
>> (offset,len) in place of (char*, len).
>> The impact on code complexity was unexpectedy low, and so I agree to
>> it and I'm not going back.
>
> Glad we are making progress, at least with the little internal things.
>
> I must say that this is not a normal review process but more like "let's
> build a String API together" project, where developers design
> preferences differ significantly.
>
> I would recommend freezing implementation until the API is fixed and
> accepted. Otherwise, you are going to waste a lot of time on
> [re]documenting and [re]implementing the throw-away code.
>
> Initially, I was under impression that you want us to review the API.
> The "pseudo-specs" subject of the thread confused me, I guess. Now it
> sure feels like you place a lot of value (and working hard) on the
> internal code, which worries me because I do not want you to waste time
> on redoing the implementation.

I believe I'm almost API-complete now.
I'll update the wiki page with an extract of the doxygen-generated
reference, which can be used as the basis of an actual review. Then
it's collective interface cut 'n paste until we're happy. Then the
real work (integration) starts.

-- 
 /kinkie
Received on Thu Sep 04 2008 - 16:00:45 MDT

This archive was generated by hypermail 2.2.0 : Thu Sep 04 2008 - 12:00:04 MDT