Re: [PATCH] HTTP Parser upgrade

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Sat, 04 Jan 2014 01:28:17 +1300

On 3/01/2014 8:08 a.m., Alex Rousskov wrote:
> On 01/02/2014 06:45 AM, Amos Jeffries wrote:
>>>> Http::Http1ParserPointer
>>>
>>> The proposed naming approach results in awkward names with too many
>>> "HTTP"s in them. I suggest a different arrangement:
>>>
>>> * namespace Http // stuff common to HTTP/1 and HTTP/2 handling
>>> * namespace Http::One // stuff specific to HTTP/1 handling
>>> * namespace Http::Two // stuff specific to future HTTP/2 handling
>>> * No Http prefixes for names inside the above three namespaces.
>>>

<snipped most of the discussion to focus on the outstanding points>

>>>
>>> The final sketch, putting all the naming-related pieces together:
>>>
>>>> namespace Http {
>>>> namespace One {
>>>> class Parser ...
>>>> };
>>>> namespace Two {
>>>> class Parser ...
>>>> };
>>>> };
>>>> namespace Http1 = Http::One;
>>>> namespace Http2 = Http::Two;
>>>
>>

Ah, Understood finally. Thank you for your patience and explanation.

>
>> In order to make the parser incremental and not be reset on every input
>> byte (worst case slow-loris attack) the Parser object is allocated
>> dynamically by ConnStateData when a new request is expected.
>
> Sounds like the right design to me. We need that state to keep track of
> offsets and such.
>
>
>>>> - if (*url != '/')
>>>> + if (hp->requestUri()[0] != '/')
>>>> return; /* already in good shape */
>>>
>>> The old code "worked" for empty URLs (""). The new does not. I do not
>>> know whether empty URLs are possible at this point. If they are not
>>> possible, perhaps add a comment or even a Must()?
>>>
>>
>> The parseFirstLine logics ensure there is always a URL. Either by
>> treating the version tag as the URL, or aborting with a parse error.
>>
>> Do you really think we need to add an unused check for length() ?
>
> If you are confident, a "we assume URL is never empty" comment would
> suffice, I guess, but please check that there are no checks for the URL
> being empty then. Personally, I think I would feel more comfortable with
> a Must().
>

Okay. Added a temporary extra check and TODO about the Must when throw
support is ready.

>
>>>> === renamed file 'src/HttpParser.h' => 'src/http/Http1Parser.h'
>>>
>>>> + /// the request-line URI if this is a request, or an empty string.
>>>> + SBuf requestUri() const {return uri_;}
>>>
>>> Consider returning "const SBuf &" to avoid refcounting overheads. Why is
>>
>>
>> const SBuf &uri() const {retun uri_;}
>>
>> const SBuf uri() const {retun uri_;}
>
> only one should be declared (I am surprised the compiler does not catch
> the ambiguity here, but I am not an expert on these matters).
>

No I meant I tried all these combinations of definition for the one
accessor. All of them had the mentioned compiler error(s).

>
>> SBuf &uri() {retun uri_;}
>
> If you are going to expose a writeable uri_, you can just make a public
> "uri" member. No need for any accessors at all!
>
>
>> ==> error: passing âconst SBufâ as âthisâ argument of âconst char*
>> SBuf::c_str()â discards qualifiers
>
> This is not an error caused by the declaration. It is an error caused by
> the calling code not creating an SBuf copy. The caller should do:
>
> SBuf(parser.uri()).c_str()
>
>
>> Whereas return via copy-construct provides a non-const SBuf using the
>> non-allocating SBuf copy against the shared MemBlob.
>
> ... hiding two performance penalties from the coder: refcounting plus
> buffer-altering c_str. Explicit SBuf creation is a pain, but it exposes
> those performance problems. Pick your poison.
>

Okay this does seem to be the least nasty of the bunch. Added an
explicit copy to the caller(s).

>
>>> this method in the general message parser? Should not it (and the
>>> corresponding data member) be placed in HTTP *request* parser?
>>>
>>
>>
>> The request parsing is in the generic HttpParser. The response parsing
>> is in the generic HttpMsg class but the intention is to move both into
>> here and:
>
> I think we should start with Http1::RequestParser and
> Http1::ResponseParser classes. If, during conversion, you find that they
> need the same code, place that shared code in Http1::Parser.
>

Ack. Okay will do this with the namespace shuffling.

>
>> 1) avoid the current design of 2-pass parsing: a) generic-parse just to
>> determine it is a request-line, then b) re-parse with HttpRequest to get
>> the fields already found by HttpParser.
>
> I do not think we need two passes. The caller knows what it is parsing
> and will use the right parser. That is, the caller will do either
>
> Http1::RequestParser parser(...)
> if (parser.parse(...))
>
> or
>
> Http1::ResponseParser parser(...)
> if (parser.parse(...))
>
> The above sketch omits many details, of course; for example, the
> incremental parsers will be created once per message rather than once
> per parse call. However, there is just one pass in the above sketch.
>

"how" is the problem. The first pass I'm talking of is the discovery of
that information used to run the pass these parsers will do. It may or
may not be a complete pass or just several bytes preview-style into the
data. You may recall we had to add a separate layer of memcmp() to
distinguish between HTTP/1 and HTTP/2 alone.

>> 3) avoid having to allocate a HttpRequest/HttpResponse object just to
>> test parsing for invalid request/responses.
>
> I think there are two valid approaches here:
>
> 1) Always create the right empty HttpR* message object. Always create
> the right parser. Give the message object to the parser so that the
> parser can use it for storage during parsing.
>
> 2) Always create the right parser. Allow the parser to create the right
> empty HttpR* message object if/when needed, so that the parser can use
> the message object for storage during parsing.
>
> Approach (1) is probably simpler: No logic to decide when the message
> object should be created. No code to check whether the message object
> has been created already. It can be modified so that the parser creates
> the right message object unconditionally, to simplify the caller's code.
>
> Approach (2) is more efficient if and only if there are a lot of cases
> where the parser determines that no message object is needed to handle a
> parse() call. Those are the cases where message object creation will be
> delayed or even avoided.
>
> I suspect that there are not enough such cases to warrant the complexity
> of (2), but you probably know this area better than I do.
>

We have enough abort cases (Request parser generates Response object) to
make me pick (2) ealier as the end goal. Avoids replicating validation
outside the parser which has already been done within and has less info
outside.

Amos
Received on Fri Jan 03 2014 - 12:28:26 MST

This archive was generated by hypermail 2.2.0 : Fri Jan 03 2014 - 12:00:10 MST