Re: StringNG merge?

From: Kinkie <gkinkie_at_gmail.com>
Date: Sun, 16 Dec 2012 23:02:24 +0100

>> + char *begin=buf()+startpos,*end=buf()+length(), *tmp;
>> + char *begin=buf(),*end, *tmp;
> etc.
>
> Please do not do that. One declaration per line, especially when
> initialization and pointers are involved. It is very difficult to read
> the above!

Ok, fixed.

> Please do not declare tmp until you need it. Same for other finds() and
> rfinds(). As you know, in some cases, you may not need it at all.

Done.

>> +SBuf::size_type
>> +SBuf::find(const SBuf &str, size_type startpos) const
>> +{
>> + if (startpos==npos)
>> + startpos=length();
>> + else
>> + startpos=min(startpos,length());
>> + char *begin=buf()+startpos,*end=buf()+length(), *tmp;
>> + char needle=str[0];
>
> What if "str" is empty? We should not call str[0] then.

Yup, there'd be an exception thrown here.

> I do not know what the correct behavior when searching for an empty
> string is (0 or npos). I suspect it is zero, but please implement
> whatever std::string() does in this case.

std::string::find() returns 0, rfind() returns length(). Handling explicitly.

> Same for rfind(Sbuf...). And please remove Must(not empty) in find
> methods that have it (after checking or implementing the right
> empty-string/set behavior there).

Must(startPos == npos || startPos >= 0) is an artifact of the fact
that SBuf::npos is -1 and size_type is an int32_t. If we changed
size_type to uint32_t and npos to something like MAXINT, we could do
away with lots of these changes. It should be relatively
straightforward, would it be wise at this stage in development?

>> +SBuf::size_type
>> +SBuf::find(const SBuf &str, size_type startpos) const
>> +{
> ...
>> + while (end > begin) {
> ...
>> + begin=tmp+1;
>
> Please rewrite the condition as "begin < end" for clarity sake.

Done.

>> +SBuf::size_type
>> +SBuf::rfind(const SBuf &str, SBuf::size_type endpos) const
>> +{
>> + char *begin=buf(),*end, *tmp;
>> + if (endpos==npos) {
>> + end=buf()+length();
>> + } else {
>> + endpos=min(endpos,length());
>> + end=buf()+endpos;
>> + }
>
> For consistency sake, please calculate endpos and then set "end",
> outside the if-statement. It may also help you address the problem below:

Done

>> + if (buf()+length() < tmp+str.length()) { //not enough chars to match the whole needle
>> + debugs(SBUF_DEBUGSECTION,8,HERE << "needle too big for haystack");
>> + end=tmp-1;
>> + continue;
>> + }
>
> If I interpret STL documentation correctly, the limit is not buf() +
> length() but endpos. We should not match against characters beyond
> endpos, even if the match _starts_ before endpos. Please double check me
> on that.

I am not 100% sure I understand what you mean, but this simple test:
std::string haystack("foo bar gazonk");
std:;string needle("gazonka");
cout << "find overflow: " << haystack.find(needle) << endl;

returns npos (as I'd expect)

> Perhaps we should always start our search with endpos-str.length() or
> equivalent? Why search where we cannot find anything?

yup.
pity memmem() is not standard (nor reliable on glibc).

>> +SBuf::size_type
>> +SBuf::rfind(const SBuf &str, SBuf::size_type endpos) const
>> +{
> ...
>> + while (end > begin) {
>
> These names are unfortunate, because you actually start with what you
> call "end". Perhaps poor naming explains why the above condition is
> buggy? We are missing matches at the very beginning of the string, where
> end == begin.

Reworked the code, renamed the variable and added specific unit tests.

> BTW, you probably should add a few unit test cases to cover bugs that
> deal with wrong borderline conditions such as a matching string at the
> very beginning/end, string with a match ending outside lastpos, empty
> string, etc. If I am finding basic bugs by visual analysis after so many
> reviews, I suspect there are probably more there.

Yes.
I've tried to highlight those cases as comments in the unit tests.
More are welcome.

>> + char *cur=buf()+startpos, *end=bufEnd();
>> + ++stats.find;
>> + while (cur < end) {
>> + char *s1=set.buf(), *s2=set.buf()+set.length();
>> + while (s1 < s2) {
>> + if (*s1==*cur) {
>> + debugs(SBUF_DEBUGSECTION,7,HERE << "found at position " << cur-buf());
>> + return (cur-buf());
>> + }
>> + ++s1;
>> + }
>> + ++cur;
>> + }
>
> Should the inner loop be replaced with a memchr() call? It might be
> faster than or "manual" loop. Or do you think memchr() is slower when
> the buffer is short (a common case here)? I do not know.

memchr is better (I checked glibc's implementation). Switching to it.

> If you do not use memchr(), please save "*cur" before the loop in a
> const char variable and compute s2 before the outer loop as it does not
> depend on "cur".
>
>
> Out of time again. More to come.

Sorry for the big delay. I had little time to work on this for a
while; for the next 10 days I have plenty, I'd like to try a sprint.

Thanks!

-- 
    /kinkie
Received on Sun Dec 16 2012 - 22:02:32 MST

This archive was generated by hypermail 2.2.0 : Mon Dec 17 2012 - 12:00:11 MST