Re: [PATCH] SBuf c-string comparisons

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Sun, 13 Apr 2014 11:53:13 -0600

On 04/13/2014 05:13 AM, Amos Jeffries wrote:
>>> + /// comparison with a C-string
>>> + int compare(const char *s, SBufCaseSensitive isCaseSensitive, const size_type n) const;
>> ...
>>> + const size_type byteCompareLen = min(n, length());
>>> + ++stats.compareSlow;
>>> + int rv = 0;
>>> + if (isCaseSensitive == caseSensitive) {
>>> + rv = memcmp(buf(), s, byteCompareLen);
>>> + } else {
>>> + rv = memcasecmp(buf(), s, byteCompareLen);
>>> + }
>>
>> The code above may access s bytes beyond s's 0-terminator (if any). The
>> header file claims to provide a C-string [i.e., strncmp()] interface,
>> but the .cc file implements something close to a memcmp() interface instead.
>>
>> The difference between memcmp and strcmp APIs is that the former stops
>> at n while the latter also stops at the first terminator character
>> (whichever comes first). The difference is not important for readable
>> text, of course, but it becomes critical when dealing with "binary" data.
>>
>
> memcmp() is documented at stopping at the byte when the strings differ.

Both memcmp and strncmp do that, of course. The difference is in where
they stop when characters they looked at so far do _not_ differ, as
summarized above.

> The if-statement guaranteeing length() <= n,

Yes.

> will truncate this SBuf so min(n, length()) is never greater than n

Sure (by the definition of min).

> and thus should never be strlen(s)+terminator.

This part (if I interpret it right) is incorrect. The "n" parameter in a
strncmp API may exceed both length() and strlen(s) by an arbitrary
amount. Consider the following for a c-string-based API:

  SBuf: 8192 bytes: "GET" followed by 8189 null characters.
  s: 4 bytes: "GET" followed by a single null character.
  n: 16384

  Expected SBuf.compare(s,, n) result: 0 (per c-string API **)
  Actual SBuf.compare(s,, n) result: random with occasional segfaults

The key here is that n does not tell strncmp() that s has at least n
characters. It only tells strncmp() that it should not go beyond n-th
character when performing a comparison. strncmp() must stop:

  * at the of end of the left c-string
  * at the end of the right c-string
  * at the first different character
  * after n equal characters have been compared

whichever happens first. The proposed code implements a different API.

> NOTE that strn*cmp variants will produce ==0 results in the above broken
> case whereas SBuf comparison will produce ==1 or ==-1 (input sizes differ).

Let's investigate that after strncmp variants are fixed so that we are
not comparing broken code.

>>> + // pretend we have a terminator and check against s.
>>> + return s[length()] == '\0' ? 0 : -1;
>>
>> Similar to the above, what if s[length()] does not exist? For example,
>> in pseudo code:
>>
>> SBuf buf("1234");
>> char s[3] = { '1', '2', '3', }; // s[3] does not exist
>> buf.cmp(s, 3);
>>
>> Please note that the above problem applies to both strCmp() and memCmp()
>> flavors of the code.
>
> The SBuf truncation path is taken on buf. At which point it is
> memcmp("123", s, 3) ==> 0.

You are right. I cannot find an example where s[length()] does not exist
in the current code, sorry.

Please note that this is separate from the "C-string comparison must
stop at the first null-terminator" problem discussed above. That problem
is still open and may result in invalid accesses during memcmp() and
memcasecmp() calls.

>> Also, it is not clear (to me) what the above code tries to accomplish.
>> The comment does not really clarify things [for me], especially since
>> "we" usually refers to the "this" object but the expression does not
>> check the "this" object buffer. If this code stays, please rephrase.
>
> C-string comparisons are including the terminator bytes for their final
> EOS results **.
>
> This handles all the cases where:
> SBuf left = "foo"
> char* right = "foo"
> left.cmp(right, 4);
>
> In c-string comparisions these produce ==0. Using the X-longer-then-Y
> logic pattern it would produce ==1 like "foo" vs "fooz".
>
> Since SBuf has no terminator we cannot compare the terminator byte like
> c-string comparisons do. Accessing buf[length()]==s[length()] would be a
> buffer overrun on buf. So we check '\0' == s[length()] instead.
>
> ie. we are guaranteed to be at the end of 'buf', check if 's' is the
> c-string terminator.

Let's come back to this once the primary search code is fixed. I am sure
we can find a way to clarify this code description if it stays.

Thank you,

Alex.
(**) The exact return value is debatable in this case, but I think that
the c-string-based comparison API should treat SBuf contents as a
c-string. Thus, any characters after the first null terminator (if any)
should be ignored while implementing this API. This is a little messier
than I would prefer, but we are paying the price for using a single SBuf
class for both "bytes-based" and "text-based" operations.
Received on Sun Apr 13 2014 - 17:53:26 MDT

This archive was generated by hypermail 2.2.0 : Mon Apr 14 2014 - 12:00:12 MDT