Re: parsing quoted-string HTTP header fields

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Thu, 26 May 2011 09:01:24 -0600

On 05/26/2011 06:14 AM, Amos Jeffries wrote:
> On 26/05/11 19:02, Tsantilas Christos wrote:
>> On 05/26/2011 07:42 AM, Amos Jeffries wrote:
>>> On Wed, 25 May 2011 20:07:13 +0300, Tsantilas Christos wrote:
>>>> The second patch does not look slower, but maybe we do not want to be
>>>> strict for many reasons. One reason is that this function used to
>>>> "Configurable SSL error details messages" patch, to parse the "descr:"
>>>> and "detail:" fields :-)
>>>> The other reason is that it is usual to find web clients or servers
>>>> which uses a simple "\n" instead of the "\r\n"...

>>>> On 05/25/2011 02:51 PM, Amos Jeffries wrote:
>>>>> On 25/05/11 22:44, Tsantilas Christos wrote:
>>>>>> The quoted string fields parsing done in squid3 using the
>>>>>> httpHeaderParseQuotedString function (HttpHeaderTools.cc file)
>>>>>>
>>>>>> I found that we do not support multiline quoted string fields, but
>>>>>> according the rfc2616 multiline quoted string fields should
>>>>>> supported:
>>>>>>
>>>>>> quoted-string = ( <"> *(qdtext | quoted-pair ) <"> )
>>>>>> qdtext = <any TEXT except <">>
>>>>>> TEXT = <any OCTET except CTLs,
>>>>>> but including LWS>
>>>>>> LWS = [CRLF] 1*( SP | HT
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am planning to fix the httpHeaderParseQuotedString function, using
>>>>>> one
>>>>>> of the following rules:
>>>>>>
>>>>>> 1) Just ignore any "\r" or "\n" character. This is the fastest and
>>>>>> simpler approach
>>>>>> 3) Require "\r\n " or "\r\n\t" as line separator and just remove the
>>>>>> "\r\n"
>>>>>> 3) Require "\r\n " or "\r\n\t" as line separator and replace it
>>>>>> with a
>>>>>> space
>>>>>> 4) Also support "\n\t" or "\n\n\t" as line separators.
>>>>>
>>>>> (4) case #2 "\n\n\t" is also non-compliant.
>>>>>
>>>>>>
>>>>>> Any comments or suggestions?
>>>>>
>>>>> RFC states "A recipient MAY replace any linear white space with a
>>>>> single
>>>>> SP".
>>>>>
>>>>> So (1) and the second (3) {replace with 0x20} are the valid options.
>>>>>
>>>>> It may be worth having two quoted-string parsers. One for each option.
>>>>> The preference being (1) for its higher performance.
>>>>>
>>>>> Amos
>
>
> Ah, okay. I forgot the quoted string un-escapes as it goes. So there is
> no extra penalty from (3). In which case I prefer (3) over (1), it has a
> lot more error-checking for free.
>
>
> Patch for case (3) looks great and is RFC compliant. Do we want to be
> lenient and allow the absence of \r ?

Sorry, I really do not like both the policing intent of #3 and #3
implementation.

Policing intent: RFC 2616 recommends that we ignore [the lack of] CR
when interpreting CRLF in HTTP headers:

> The line terminator for message-header fields is the sequence CRLF.
> However, we recommend that applications, when parsing such headers,
> recognize a single LF as a line terminator and ignore the leading CR.

Granted, the above can be strictly interpreted to apply to the final
header terminator and not LWS, but I think it is reasonable to extend
the above logic to header continuation (and to LWS in general).

Implementation:

> while (*pos != '"' && len > (pos-start)) {
> +
> + if (*pos =='\r') {
> + pos ++;
> + if (*(pos++) != '\n' || (*pos != ' ' && *pos != '\t')) {

Can the above double increment lead to *pos pointing beyond the string
boundaries?

Will the above incorrectly accept CR x HT sequence, where "x" is any
character other than LF?

Finally, if we are going to replace LWS with a single SP, we should
replace the whole LWS and not just the initial CRLF (SP|HT) prefix.

Overall, I think we should go for #1 for now because it is likely to be
more compatible with real traffic and to have fewer bugs.

Eventually, we should probably do something close to #3 but without bugs
and without rejecting the whole header when a field has a quoted string
with malformed LWS. We should be careful about enabling message
smuggling attacks when policing messages so it may be tricky to get this
right for all cases.

> + bool parse_error = (*end <= 0x1F && *end != '\r' && *end != '\n') || *end == 0x7F;

This variable can be declared const.

I do not fully understand what the above code and the while loop above
is really doing. If you know what it does, please add a comment.

Thank you,

Alex.
Received on Thu May 26 2011 - 15:01:57 MDT

This archive was generated by hypermail 2.2.0 : Thu May 26 2011 - 12:00:06 MDT