Re: StringNG merge?

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Sun, 11 Nov 2012 15:21:34 -0700

On 11/03/2012 05:55 AM, Kinkie wrote:

> Feature-branch is at lp:~kinkie/squid/stringng

I still find serious bugs in that branch, unfortunately (but not all of
the comments below are serious bugs, of course):

> === added file 'src/HttpHdrCc.h.moved'
> --- src/HttpHdrCc.h.moved 1970-01-01 00:00:00 +0000
> +++ src/HttpHdrCc.h.moved 2012-08-13 20:56:10 +0000
> @@ -0,0 +1,56 @@
> +/*
> + * $Id$

When I run "bzr send -o /tmp/bundle ../trunk" from the stringng branch
and then look at the resulting bundle, it contains the above blob. If
this is not some kind of bzr fluke, please remove src/HttpHdrCc.h.moved
from your branch or double check that you are not modifying the code
that is no longer there.

> === modified file 'src/Makefile.am'
> --- src/Makefile.am 2012-10-29 23:12:04 +0000
> +++ src/Makefile.am 2012-11-01 21:35:44 +0000
...
> @@ -142,7 +153,7 @@
> UNLINKDSOURCE = unlinkd.h unlinkd.cc
> UNLINKD = unlinkd
> else
> -UNLINKDSOURCE = unlinkd.h
> +UNLINKDSOURCE =
> UNLINKD =
> endif
>
> @@ -259,6 +270,7 @@
> CommCalls.h \
> DescriptorSet.cc \
> DescriptorSet.h \
> + SquidConfig.cc \
> SquidConfig.h \
> SquidConfig.cc

I see these and many other changes to src/Makefile.am, some of which are
clearly buggy (like adding SquidConfig.cc twice above) and some quite
significant and seemingly unrelated to StringNG scope. What is going on
here? Is it possible that something when wrong during merging of
Makefile.am changes? I know they are often difficult to get right.

> MemBlobStats::MemBlobStats(): alloc(0), live(0), append(0)
> {}
>
> +MemBlobStats&
> +MemBlobStats::operator += (const MemBlobStats& s)
> +{
> + alloc+=s.alloc;
> + live+=s.live;
> + append+=s.append;
> + liveBytes+=s.liveBytes;
> +
> + return *this;
> +}
> +

This is not a bug in this patch, but I noticed that MemBlobStats
constructor does not initialize liveBytes (but the increment operator
does increment it). Please fix in trunk.

> +class OutOfBoundsException : public TextException
> +{
> +protected:
> + SBuf _buf;
> + SBuf::size_type _pos;
> +

Please move protected members after public ones (and private after
protected).

> + debugs(SBUF_DEBUGSECTION,DBG_DATA,
> + id << " created, copyfrom="<<S.id);
...
> + debugs(SBUF_DEBUGSECTION,DBG_DATA,
> + id << " created, copyfrom string");
...
> + debugs(SBUF_DEBUGSECTION,DBG_DATA,
> + id << " created from c-string" <<
> + ", pos=" << pos << ", n=" << n);
...
> + debugs(SBUF_DEBUGSECTION,DBG_DATA,
> + id << " destructed");

These and probably others should not use DBG_DATA because the latter is
reserved for a "large data dump" and they do not dump any string data at
all.

> + idx=s1.rfind(s5);
> + rv=s1.scanf("%s , %d , %f",s,&i,&f);
> + s=foxb;
> + s=st.token();
> + sng=sng.toLower();
> + sng=foxb;
> + const char *ref=t.rawContent();
> + const char *match=t.rawContent();
> + ref=match;
> + for (SBufList::iterator i=res.begin(),j=control.begin(); i!=res.end(); i++,
...
> + SBuf& append(const char * S, size_type pos = 0, size_type n = npos);
> + SBuf& append(const std::string &string, size_type pos = 0, size_type n = npos);
> + message = static_cast<char*>(xmalloc(buf.length()+1));
> + for (i = begin(); i != end(); i++)
> + virtual int_type overflow(int_type aChar = traits_type::eof()) {
> *gross_size = pool ? StrPoolsAttrs[i].obj_size : net_size;
> + foo = s1.rawContent();
> + foo = s1.c_str();

The above two groups of randomly selected lines that show different
style of using "=" and other operators. Some of the patch code is using
compact "a=b" style, the other is using readable "a = b" style (same for
other operators). Sometimes a single line mixes two styles! Can you
change it to use the readable style throughout? I understand when old
Squid code contains different styles from different people, but should
not we expect better consistency from a single-person patch?

Same for separating call arguments. Sometimes you use compact "a,b" and
sometimes "a, b".

I think most of these can be fixed with sed or similar, without a lot of
manual work.

> +SBuf::SBuf(const char *S, size_type pos, size_type n)
> + : store_(GetStorePrototype()),off_(0),len_(0)
> +{
> + Must(pos==npos || pos >= 0);
> + Must(n==npos || n >= 0);
> +
> + if (S==NULL)
> + n=0;
> + if (n==npos)
> + n=strlen(S)-pos;
> +
> + const char *actual_start=S+pos;
> + debugs(SBUF_DEBUGSECTION,DBG_DATA,
> + id << " created from c-string" <<
> + ", pos=" << pos << ", n=" << n);
> + store_=new MemBlob(estimateCapacity(n));
> + if (n > 0)
> + store_->append(actual_start,n);
> + len_=n;
> + ++stats.alloc;
> + ++stats.allocFromCString;
> + ++stats.live;

and

> +SBuf&
> +SBuf::append(const char * S, size_type pos, size_type n)
> +{
> + Must(pos==npos || pos >= 0);
> + Must(n==npos || n >= 0);
> +
> + if (S==NULL)
> + return *this;
> + if (n==npos)
> + n=strlen(S)-pos;
> +
> + debugs(SBUF_DEBUGSECTION,7,
> + id << MYNAME << ",S=c-string" <<
> + ",pos=" << pos << ",n=" << n);
> +
> + reserve(n);
> + const char *actual_start=S+pos;
> + store_->append(actual_start,n);
> + len_+=n;

These are almost identical but there are subtle differences (including
difference in problems!). Please make the constructor call
SBuf::append() instead of repeating the same code. I do not think saving
a couple of CPU cycles is worth introducing this difference in behavior
(not to mention code duplication).

The next two comments can be ignored if the above problem is addressed
(because the problematic code will disappear).

> + if (S==NULL)
> + n=0;
> + if (n==npos)
> + n=strlen(S)-pos;
> +
> + const char *actual_start=S+pos;

It is not kosher to add something to a NULL pointer and S may be NULL
when we are calculating actual_start. I recommend removing actual_start
because it is not REused. If you want to keep it, move it to where it is
used.

> + store_=new MemBlob(estimateCapacity(n));
> + if (n > 0)
> + store_->append(actual_start,n);

Let's not allocate new store_ if we are not going to use it.

> + void reserve(size_type minsize);
> +SBuf::reserve(size_type howmuch)

Please use camelCase for minsize and howmuch. You may want to name them
the same.

> + SBuf& trim(const SBuf &toremove, bool atBeginning=true, bool atEnd=true);

Please use camelCase for toRemove, especially if other parameters are
using camelCase.

Please check other method parameter names for similar camelCase
problems. "npos" should stay as is though, for STL compatibility reasons.

> +#else
> + //enabling this code path, we try to release the store without deallocating
> it.
> + // will be lazily reallocated if needed.
> + if (store_->LockCount() == 1)
> + store_->size=0;
> +#endif

Please use store_->clear() instead of modifying the member we "must
treat as read-only" according to the MemBlob API documentation.

> +SBuf&
> +SBuf::append(const std::string &s, SBuf::size_type pos, SBuf::size_type n)
> +{
> + return append(s.data(),pos,(n==npos ? npos : s.length()));
> +}

Why are we overwriting caller-supplied value of "n" (not equal to npos)
with s.length()? It looks like the third parameter calculation is wrong
on many counts. Please fix carefully.

> + debugs(SBUF_DEBUGSECTION,DBG_DATA,HERE << "same memhandle and contents:
> ");

A column at the end implies that something will follow, but nothing does.

Are you going to remove HERE from your code since trunk deprecates it now?

> + debugs(SBUF_DEBUGSECTION,8,
> + MYNAME << "this=" << (void *)this << "other=" << (void *)&S);

Please do not use MYNAME in level-8 debugging -- it is reserved for
levels zero and one. Please check the rest of the patch for this.

"this" and "other=" are missing a space between them.

You usually use "id" and not "this" for debugging. If there is no good
reason to be inconsistent with the above code, please use "id".

> + if (n==npos)
> + n=length();
> + n=min(n,length());

Please add "else" between the above two lines in SBuf::consume().

> +SBuf::size_type
> +SBuf::copy(char *aBuf, SBuf::size_type aBufLen) const
> +{
> + Must(aBufLen>0);
> +

This Must() is too restrictive. Copying an empty string into an empty
buffer should be allowed (to avoid checks in callers for this special
condition). You need aBufLen to be sufficient, not positive.

Also, we usually use "a" prefix for constructor or setter parameters. In
this case, aBuf and aBufLen are neither. Consider using "dest" and "n"
instead.

> + /** Request to extend the SBuf's available store space.
> + *
> + * After the reserve request, the SBuf is guaranteed to have at least minsize
> + * bytes of append-able backing store (on top of the currently-used portion)
> + * \throw SBufTooBigException if the user tries to allocate too big a SBuf
> + */
> + void reserve(size_type minsize);

I think it is a bad idea to make Sbuf::reserve() behave very differently
from std::string::reserve(), especially when we are trying to make other
methods behave the same. I suggest:

1) Add reserveCapacity() that behaves like std::string::reserve(). Do
not change any callers.

2) Rename reserve() method to reserveSpace(). Do not change any callers.

3) Go and replace all reserve() calls with the right one.

4) Optionally, rename reserveCapacity() to reserve(). I would not do
that unless you remove reserveSpace() completely as well.

> + * \param pos start sub-stringing from this byte. If it is
> + * greater than the SBuf length, an empty SBuf is returned

Please s/an empty SBuf is returned/Sbuf is emptied/ or similar to
indicate that we do not simply return an empty string but clear the
buffer itself.

> +SBuf::size_type
> +SBuf::find(char c, SBuf::size_type startpos) const
> +{
> + if (startpos==npos)
> + startpos=length();
> + startpos=min(startpos,length());

If startpos is npos, we can just return npos. If you do not want to do
that for some reason, please add "else" between the above two last
statements.

Same for the other find() method.

> + char *begin=buf()+startpos,*end=buf()+length(), *tmp;
> + char *begin=buf(),*end, *tmp;
etc.

Please do not do that. One declaration per line, especially when
initialization and pointers are involved. It is very difficult to read
the above!

Please do not declare tmp until you need it. Same for other finds() and
rfinds(). As you know, in some cases, you may not need it at all.

> +SBuf::size_type
> +SBuf::find(const SBuf &str, size_type startpos) const
> +{
> + if (startpos==npos)
> + startpos=length();
> + else
> + startpos=min(startpos,length());
> + char *begin=buf()+startpos,*end=buf()+length(), *tmp;
> + char needle=str[0];

What if "str" is empty? We should not call str[0] then.

I do not know what the correct behavior when searching for an empty
string is (0 or npos). I suspect it is zero, but please implement
whatever std::string() does in this case.

Same for rfind(Sbuf...). And please remove Must(not empty) in find
methods that have it (after checking or implementing the right
empty-string/set behavior there).

> +SBuf::size_type
> +SBuf::find(const SBuf &str, size_type startpos) const
> +{
...
> + while (end > begin) {
...
> + begin=tmp+1;

Please rewrite the condition as "begin < end" for clarity sake.

> +SBuf::size_type
> +SBuf::rfind(const SBuf &str, SBuf::size_type endpos) const
> +{
> + char *begin=buf(),*end, *tmp;
> + if (endpos==npos) {
> + end=buf()+length();
> + } else {
> + endpos=min(endpos,length());
> + end=buf()+endpos;
> + }

For consistency sake, please calculate endpos and then set "end",
outside the if-statement. It may also help you address the problem below:

> + if (buf()+length() < tmp+str.length()) { //not enough chars to match the whole needle
> + debugs(SBUF_DEBUGSECTION,8,HERE << "needle too big for haystack");
> + end=tmp-1;
> + continue;
> + }

If I interpret STL documentation correctly, the limit is not buf() +
length() but endpos. We should not match against characters beyond
endpos, even if the match _starts_ before endpos. Please double check me
on that.

Perhaps we should always start our search with endpos-str.length() or
equivalent? Why search where we cannot find anything?

> +SBuf::size_type
> +SBuf::rfind(const SBuf &str, SBuf::size_type endpos) const
> +{
...
> + while (end > begin) {

These names are unfortunate, because you actually start with what you
call "end". Perhaps poor naming explains why the above condition is
buggy? We are missing matches at the very beginning of the string, where
end == begin.

BTW, you probably should add a few unit test cases to cover bugs that
deal with wrong borderline conditions such as a matching string at the
very beginning/end, string with a match ending outside lastpos, empty
string, etc. If I am finding basic bugs by visual analysis after so many
reviews, I suspect there are probably more there.

> + char *cur=buf()+startpos, *end=bufEnd();
> + ++stats.find;
> + while (cur < end) {
> + char *s1=set.buf(), *s2=set.buf()+set.length();
> + while (s1 < s2) {
> + if (*s1==*cur) {
> + debugs(SBUF_DEBUGSECTION,7,HERE << "found at position " << cur-buf());
> + return (cur-buf());
> + }
> + ++s1;
> + }
> + ++cur;
> + }

Should the inner loop be replaced with a memchr() call? It might be
faster than or "manual" loop. Or do you think memchr() is slower when
the buffer is short (a common case here)? I do not know.

If you do not use memchr(), please save "*cur" before the loop in a
const char variable and compute s2 before the outer loop as it does not
depend on "cur".

Out of time again. More to come.

Thank you,

Alex.
Received on Sun Nov 11 2012 - 22:21:41 MST

This archive was generated by hypermail 2.2.0 : Fri Nov 16 2012 - 12:00:10 MST