Re: parsing quoted-string HTTP header fields

From: Tsantilas Christos <chtsanti_at_users.sourceforge.net>
Date: Thu, 26 May 2011 19:23:37 +0300

On 05/26/2011 06:01 PM, Alex Rousskov wrote:
> On 05/26/2011 06:14 AM, Amos Jeffries wrote:
>> On 26/05/11 19:02, Tsantilas Christos wrote:
>>> On 05/26/2011 07:42 AM, Amos Jeffries wrote:
>>>> On Wed, 25 May 2011 20:07:13 +0300, Tsantilas Christos wrote:
>>>>> The second patch does not look slower, but maybe we do not want to be
>>>>> strict for many reasons. One reason is that this function used to
>>>>> "Configurable SSL error details messages" patch, to parse the "descr:"
>>>>> and "detail:" fields :-)
>>>>> The other reason is that it is usual to find web clients or servers
>>>>> which uses a simple "\n" instead of the "\r\n"...
>
>
>>>>> On 05/25/2011 02:51 PM, Amos Jeffries wrote:
>>>>>> On 25/05/11 22:44, Tsantilas Christos wrote:
>>>>>>> The quoted string fields parsing done in squid3 using the
>>>>>>> httpHeaderParseQuotedString function (HttpHeaderTools.cc file)
>>>>>>>
>>>>>>> I found that we do not support multiline quoted string fields, but
>>>>>>> according the rfc2616 multiline quoted string fields should
>>>>>>> supported:
>>>>>>>
>>>>>>> quoted-string = (<"> *(qdtext | quoted-pair )<"> )
>>>>>>> qdtext =<any TEXT except<">>
>>>>>>> TEXT =<any OCTET except CTLs,
>>>>>>> but including LWS>
>>>>>>> LWS = [CRLF] 1*( SP | HT
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I am planning to fix the httpHeaderParseQuotedString function, using
>>>>>>> one
>>>>>>> of the following rules:
>>>>>>>
>>>>>>> 1) Just ignore any "\r" or "\n" character. This is the fastest and
>>>>>>> simpler approach
>>>>>>> 3) Require "\r\n " or "\r\n\t" as line separator and just remove the
>>>>>>> "\r\n"
>>>>>>> 3) Require "\r\n " or "\r\n\t" as line separator and replace it
>>>>>>> with a
>>>>>>> space
>>>>>>> 4) Also support "\n\t" or "\n\n\t" as line separators.
>>>>>>
>>>>>> (4) case #2 "\n\n\t" is also non-compliant.
>>>>>>
>>>>>>>
>>>>>>> Any comments or suggestions?
>>>>>>
>>>>>> RFC states "A recipient MAY replace any linear white space with a
>>>>>> single
>>>>>> SP".
>>>>>>
>>>>>> So (1) and the second (3) {replace with 0x20} are the valid options.
>>>>>>
>>>>>> It may be worth having two quoted-string parsers. One for each option.
>>>>>> The preference being (1) for its higher performance.
>>>>>>
>>>>>> Amos
>>
>>
>> Ah, okay. I forgot the quoted string un-escapes as it goes. So there is
>> no extra penalty from (3). In which case I prefer (3) over (1), it has a
>> lot more error-checking for free.
>>
>>
>> Patch for case (3) looks great and is RFC compliant. Do we want to be
>> lenient and allow the absence of \r ?
>
> Sorry, I really do not like both the policing intent of #3 and #3
> implementation.
>
> Policing intent: RFC 2616 recommends that we ignore [the lack of] CR
> when interpreting CRLF in HTTP headers:
>
>> The line terminator for message-header fields is the sequence CRLF.
>> However, we recommend that applications, when parsing such headers,
>> recognize a single LF as a line terminator and ignore the leading CR.
>
> Granted, the above can be strictly interpreted to apply to the final
> header terminator and not LWS, but I think it is reasonable to extend
> the above logic to header continuation (and to LWS in general).
>
>
> Implementation:
>
>> while (*pos != '"'&& len> (pos-start)) {
>> +
>> + if (*pos =='\r') {
>> + pos ++;
>> + if (*(pos++) != '\n' || (*pos != ' '&& *pos != '\t')) {
>
> Can the above double increment lead to *pos pointing beyond the string
> boundaries?

Yes it can, but we do not care because we do not modifying anything we
only have a double increment (not a "while(..) pos++") and after 4-5
lines there is the check:
if (!*pos || (pos-start) > len) {
...//error
}

>
> Will the above incorrectly accept CR x HT sequence, where "x" is any
> character other than LF?
I think no....

>
> Finally, if we are going to replace LWS with a single SP, we should
> replace the whole LWS and not just the initial CRLF (SP|HT) prefix.

OK but it is faster and safer to not (and does not require a lot of code).
But we can add a small block of code "while((*pos == ' ' || *pos ==
'\t') && pos-start < len) pos++".

>
> Overall, I think we should go for #1 for now because it is likely to be
> more compatible with real traffic and to have fewer bugs.
OK, This is the easiest.
>
> Eventually, we should probably do something close to #3 but without bugs
> and without rejecting the whole header when a field has a quoted string
> with malformed LWS. We should be careful about enabling message
> smuggling attacks when policing messages so it may be tricky to get this
> right for all cases.

Also the question here is if the "\r\n\t" should replaced only for
quoted header values, or for all headers. In the second case we should
move this code in HttpHeader::parse....

>
>
>> + bool parse_error = (*end<= 0x1F&& *end != '\r'&& *end != '\n') || *end == 0x7F;
>
> This variable can be declared const.
>
> I do not fully understand what the above code and the while loop above
> is really doing. If you know what it does, please add a comment.

Well yes. Just "parse" the string until an escaped char, a CTL char, or
the end of string found.

>
>
> Thank you,
>
> Alex.
>
Received on Thu May 26 2011 - 16:23:46 MDT

This archive was generated by hypermail 2.2.0 : Fri May 27 2011 - 12:00:05 MDT