Re: [PATCH] SBuf c-string comparisons

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Mon, 14 Apr 2014 22:49:52 +1200

On 14/04/2014 5:53 a.m., Alex Rousskov wrote:
> On 04/13/2014 05:13 AM, Amos Jeffries wrote:
>>>> + /// comparison with a C-string
>>>> + int compare(const char *s, SBufCaseSensitive isCaseSensitive, const size_type n) const;
>>> ...
>>>> + const size_type byteCompareLen = min(n, length());
>>>> + ++stats.compareSlow;
>>>> + int rv = 0;
>>>> + if (isCaseSensitive == caseSensitive) {
>>>> + rv = memcmp(buf(), s, byteCompareLen);
>>>> + } else {
>>>> + rv = memcasecmp(buf(), s, byteCompareLen);
>>>> + }
>>>
>>> The code above may access s bytes beyond s's 0-terminator (if any). The
>>> header file claims to provide a C-string [i.e., strncmp()] interface,
>>> but the .cc file implements something close to a memcmp() interface instead.
>>>
>>> The difference between memcmp and strcmp APIs is that the former stops
>>> at n while the latter also stops at the first terminator character
>>> (whichever comes first). The difference is not important for readable
>>> text, of course, but it becomes critical when dealing with "binary" data.
>>>
>>
>> memcmp() is documented at stopping at the byte when the strings differ.
>
> Both memcmp and strncmp do that, of course. The difference is in where
> they stop when characters they looked at so far do _not_ differ, as
> summarized above.
>
>
>> The if-statement guaranteeing length() <= n,
>
> Yes.
>
>
>> will truncate this SBuf so min(n, length()) is never greater than n
>
> Sure (by the definition of min).
>
>
>> and thus should never be strlen(s)+terminator.
>
> This part (if I interpret it right) is incorrect. The "n" parameter in a
> strncmp API may exceed both length() and strlen(s) by an arbitrary
> amount. Consider the following for a c-string-based API:
>
> SBuf: 8192 bytes: "GET" followed by 8189 null characters.
> s: 4 bytes: "GET" followed by a single null character.
> n: 16384
>
> Expected SBuf.compare(s,, n) result: 0 (per c-string API **)
> Actual SBuf.compare(s,, n) result: random with occasional segfaults
>
> The key here is that n does not tell strncmp() that s has at least n
> characters. It only tells strncmp() that it should not go beyond n-th
> character when performing a comparison. strncmp() must stop:
>
> * at the of end of the left c-string
> * at the end of the right c-string
> * at the first different character
> * after n equal characters have been compared
>
> whichever happens first. The proposed code implements a different API.
>
>
>
>> NOTE that strn*cmp variants will produce ==0 results in the above broken
>> case whereas SBuf comparison will produce ==1 or ==-1 (input sizes differ).
>
> Let's investigate that after strncmp variants are fixed so that we are
> not comparing broken code.
>
>
>
>>>> + // pretend we have a terminator and check against s.
>>>> + return s[length()] == '\0' ? 0 : -1;
>>>
>>> Similar to the above, what if s[length()] does not exist? For example,
>>> in pseudo code:
>>>
>>> SBuf buf("1234");
>>> char s[3] = { '1', '2', '3', }; // s[3] does not exist
>>> buf.cmp(s, 3);
>>>
>>> Please note that the above problem applies to both strCmp() and memCmp()
>>> flavors of the code.
>>
>> The SBuf truncation path is taken on buf. At which point it is
>> memcmp("123", s, 3) ==> 0.
>
> You are right. I cannot find an example where s[length()] does not exist
> in the current code, sorry.
>
> Please note that this is separate from the "C-string comparison must
> stop at the first null-terminator" problem discussed above. That problem
> is still open and may result in invalid accesses during memcmp() and
> memcasecmp() calls.
>
>
>
>
>>> Also, it is not clear (to me) what the above code tries to accomplish.
>>> The comment does not really clarify things [for me], especially since
>>> "we" usually refers to the "this" object but the expression does not
>>> check the "this" object buffer. If this code stays, please rephrase.
>>
>> C-string comparisons are including the terminator bytes for their final
>> EOS results **.
>>
>> This handles all the cases where:
>> SBuf left = "foo"
>> char* right = "foo"
>> left.cmp(right, 4);
>>
>> In c-string comparisions these produce ==0. Using the X-longer-then-Y
>> logic pattern it would produce ==1 like "foo" vs "fooz".
>>
>> Since SBuf has no terminator we cannot compare the terminator byte like
>> c-string comparisons do. Accessing buf[length()]==s[length()] would be a
>> buffer overrun on buf. So we check '\0' == s[length()] instead.
>>
>> ie. we are guaranteed to be at the end of 'buf', check if 's' is the
>> c-string terminator.
>
> Let's come back to this once the primary search code is fixed. I am sure
> we can find a way to clarify this code description if it stays.

I've added a unit test to catch the rare \0 mid-string case I spoke of:
 * SBuf("foo\0test").compare("foo", 9);

It works fine up to the point N(4) > strlen("foo"). After that point our
function returns 1 to indicate that the SBuf is a longer string, whereas
strncmp/strncasecmp return 0 up to infinity.

The code as presented earlier copes with the weirdness fine.

NOTE: I can swap memcmp/strncmp and memcasecmp/strncasecmp functions
interchangeably with the same test results. So I'm fine with using any
of them as the backing scan.

Amos
Received on Mon Apr 14 2014 - 10:49:59 MDT

This archive was generated by hypermail 2.2.0 : Mon Apr 14 2014 - 12:00:12 MDT