Re: pseudo-specs for a String class: tokenization from Alex Rousskov on 2008-09-05 (squid-dev)

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Fri, 05 Sep 2008 11:44:11 -0600

On Fri, 2008-09-05 at 17:47 +0200, Kinkie wrote:
> On Fri, Sep 5, 2008 at 5:02 PM, Alex Rousskov
> <rousskov_at_measurement-factory.com> wrote:
> > On Fri, 2008-09-05 at 10:19 +0200, Kinkie wrote:
> >> On Fri, Sep 5, 2008 at 4:43 AM, Alex Rousskov
> >> <rousskov_at_measurement-factory.com> wrote:
> >> > Just like String, the iterator interface is pretty standard. For our
> >> > Tokenizer, we can simplify it a little unless others think that
> >> > compatibility with standard library algorithms is worth the trouble.
> >> > Here is a sketch:
> >> >
> >> > class Tokenizer {
> >> > public:
> >> > Tokenizer(); // immediately atEnd
> >>
> >> I'd avoid the default constructor entirely.
> >
> > Bad idea. The default constructor does not hurt in this case. It does
> > help when you want another method to initialize the tokenizer or when
> > you want to reset the already initialized tokenizer.
>
> A tokenizer only has meaning when attached to a KBuf (String,
> whatever), that's what I ment by not having a constructor without an
> attached KBuf.

>From practical point of view, you may not have the right string to
"attach" to at the time of construction and attaching to the wrong
string is worse than meaningless.

>From design point of view, a basic tokenizer that is atEnd() with or
without the attached buffer is perfectly fine and meaningful because you
cannot do much with atEnd tokenizer.

As we add more bells and whistles to the Tokenizer class, the meaning of
some methods may indeed become vague for unattached tokenizer. For
example, what should the originalString() or source() method return if
we have one? For simplicity sake, we can solve that problem by declaring
that the default constructor has the same visible effect as the
Tokenizer(String(), String()) constructor.

> >> I'd rather add a version whcih takes the String but not the delimiters.
> >
> > I would recommend avoiding implicit conversions from String to anything
> > and I doubt there is a reasonable set of default delimiters.
>
> Why there would be an implicit conversion?

Ask Amos -- he has suffered enough from it to give an entertaining
answer :-). Or see the attached source file.

> And you're right, just as a Tokenizer has no meaning without a KBuf,
> then it also has none without delimiters.

Tokenizer may have meaning when it has nothing. We could assign some
meaning to a Tokenizer that has a string but not delimiters (e.g., treat
that as an empty delimiter set), but I think such unusual usage should
be explicit: Tokenizer(myString, String()).

> >> > Tokenizer(const String &aString, const String &delimiters);
> >>
> >> String arg must be passed by value (which translates to refcounted ref
> >> to the data). Passing by reference will alias the String, falling back
> >> into the original problem.
> >
> > I am sorry, but you are mistaken. String arguments like that should be
> > passed by reference. I do not know what "alias the String" means, but it
> > is perfectly safe and noticeably more efficient to pass Strings by
> > reference in contexts like that. This has been discussed recently
> > already.
>
> The point is that passing an object by (c++) reference does not create
> a new copy. No new copy means that the (Kbuf-level) refcounts do not
> get increased, it is "just" an alias for the original KBuf, with all
> that it means

That is exactly what we want for a method parameter.

> (no content freezing, can be appended to while the
> Tokenizer is running, etc).

Tokenizer itself cannot append to the string parameter because it is a
const parameter (being also a reference is irrelevant here). Code that
has access to a non-const copy of the same string is free to modify the
string, of course. It all "just works" and is a standard practice.

If you are thinking about threads, then references must not be passed
across thread boundaries. However, Tokenizer will never be a thread so,
again, there is no API problem here either.

> As long as we're dealing with the KBuf class itself, that's no problem
> and is a welcome opimization. But here we must make sure that
> somewhere a copy of the KBuf object is created. Granted, it may be
> done WITHIN the call; I'll just have to make sure that it gets done.

If it is not done within the call, then there is still no danger (but no
string to work with either!).

The only danger here is that somebody will declare a Tokenizer class
data member of the reference type and store a reference. That danger
exists regardless of the constructor parameter type and no design can
eliminate it. Hopefully, such bugs will be caught by review (or by
compiler, if we have code that assigns Tokenizers).

> > If we stick with named interfaces, then do this:
> >
> > // move to the next token, named and STL-like interfaces
> > Tokenizer &operator ++() { return next(); }
> > Tokenizer &next();
>
> Agreed.
>
> > so that you can write:
> >
> > tokenizer.next().token()
>
> No.
> If knowing what the actual separator was is important I'd rather:
> bool next(); //returns false @end-of-string
> KBuf& token();
> char separator();
>
> so it becomes:
> while (tokenizer.next()) {
> KBuf t=tokenizer.token();
> }

The above loop misses the first token and I think you are switching
topics (the original question was how to design nextToken and not how to
loop).

For nextToken, next().token() can hardly be improved. It is not perfect
because next() may end up at the end of the string, but that is the
problem with nextToken idea itself, not the implementation.

For looping, I have already posted the correct looping sketch. Here it
is with named interfaces:

  for (Tokenizer tMaker(str, dels); !tMaker.atEnd(); tMaker.next()) {
    String token = tMaker.token();
    ...
  }

> >> Would anyone think it'd be useful to have non-single-char delimiters?
> >> It'd complicate the called code quite a bit, but if it's useful and it
> >> simplifies the calling code...
> >
> > I think we will eventually have a DelimiterSet or StrFinder class so
> > that we can support string delimiters, RE delimiters, and arbitrary code
> > delimiters. I would not change the API though. Currently, our
> > DelimiterSet or StrFinder is a String class, which is interpreted
> > internally as a set of chars. Eventually, that interpretation would be
> > up to the passed finder object...
> >
> > You can typedef Tokenizer::Finder to String right now. I did not propose
> > that to keep things simple. It would be easy to change later without any
> > affects on the caller code.
>
> Nah. Might as well do that now.
> And the class hierarchy makes sense. A pure-virtual class which
> ignores the actual matching method.

Are you sure you want to dive into that now? With so much time it takes
us to agree on trivial/standard things, would it be better to start with
something simple, well-designed, and immediately useful? And then add
Finder if we need it? Again, given correct implementation, the callers
will most likely see no difference when we add Finder support.

> How would Kbuf::Tokenizer::Finder sound like? Would even avoid the
> wart of friend classes.

If you insist on starting to complicate things now, then:

- Tokenizer and StringFinder should be stand-alone classes. There is no
reason to place them inside a String or buffer class. Classes are not
namespaces. Same for placing StringFinder inside Tokenizer.

- StringFinder will have virtual find() function that determines where
the matching substring is. It will be used by Tokenizer to find
delimiters (not tokens!). There may be a couple of ways to design this
right, but I would rather not spend time on it now, in hope that you
will agree that we should focus on the basics first.

Thank you,

Alex.

text/x-c++src attachment: stored

Received on Fri Sep 05 2008 - 17:45:11 MDT

This archive was generated by hypermail 2.2.0 : Mon Sep 08 2008 - 12:00:04 MDT