Re: [PATCH] HTTP Parser upgrade

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Thu, 02 Jan 2014 12:08:34 -0700

On 01/02/2014 06:45 AM, Amos Jeffries wrote:
>>> Http::Http1ParserPointer
>>
>> The proposed naming approach results in awkward names with too many
>> "HTTP"s in them. I suggest a different arrangement:
>>
>> * namespace Http // stuff common to HTTP/1 and HTTP/2 handling
>> * namespace Http::One // stuff specific to HTTP/1 handling
>> * namespace Http::Two // stuff specific to future HTTP/2 handling
>> * No Http prefixes for names inside the above three namespaces.
>>
>> This alternative approach would result in
>>
>> Http::One::Parser
>> Http::Two::ParserPointer
>>
>> instead of
>>
>> Http::Http1Parser
>> Http::Http2ParserPointer
>>
>
> I was wondering about Parser1 / Parser1Pointer , Parser2 /
> Parser2Pointer as an alternative?

I think namespaces is a much cleaner solution, probably because I
disagree with your next statement about this problem being limited to
just four types.

> Only the parser class seems to be needing the version difference.
> Everything else seems to be generic or relevant only to one of the versions.

Even in your patch scope (or very close to it), there should several
parsers for HTTP/1 (message, request, response). Not sure about HTTP/2,
but even if it does not need directional parsers, that is already at
least seven types. We can manage by adding 1s and 2s, but there seems to
be enough type names to warrant a namespace.

And in the future, I would be surprised if no types other than parsers
and their pointers will implement similar concepts for both HTTP
versions, and implement them differently, raising the number of classes
needing version distinction.

Furthermore, even when a class exists only for one HTTP version, it is
often handy to know which version it serves! Namespaces give us that
knowledge. For example, I am pretty sure that by the time you are done,
we will have request line parser, header parser, and possibly other
parsing-related types in HTTP/1 namespace. It would be nice to make it
easy to easily differentiate those classes from HTTP/2 code, even if
HTTP/2 code will not have similar class names.

>> Same length, but better looking names without repetition of information.
>> Moreover, after adding these global namespace aliases (to http/forward.h):
>>
>> namespace Http1 = Http::One;
>> namespace Http2 = Http::Two;
>>
>> we can use
>>
>> Http1::Parser
>> Http2::ParserPointer
>>
>> which are even shorter that the proposed names.
>>
>> Note that declaring Http1 and Http2 namespaces outside of the Http
>> namespace (rather than namespace aliases) is probably a bad idea because
>>
>> * it will force us to use Http:: prefixes inside Http1 and Http2
>> namespaces, which is kind of wrong, OR
>>
>> * force us to add blank "use Http" statements, which opens another can
>> of warms, especially if used in header files.
>>
>
> Um..
>
>>
>> The final sketch, putting all the naming-related pieces together:
>>
>>> namespace Http {
>>> namespace One {
>>> class Parser ...
>>> };
>>> namespace Two {
>>> class Parser ...
>>> };
>>> };
>>> namespace Http1 = Http::One;
>>> namespace Http2 = Http::Two;
>>
>
> You just said that was a bad idea?

No. I tried to say that the following simpler "HttpN outside Http"
design is probably a bad idea:

> namespace Http {
> ...
> }
> namespace Http1 {
> class Parser ...
> }
> namespace Http2 {
> class Parser ...
> }

for the reasons itemized above.

> In order to make the parser incremental and not be reset on every input
> byte (worst case slow-loris attack) the Parser object is allocated
> dynamically by ConnStateData when a new request is expected.

Sounds like the right design to me. We need that state to keep track of
offsets and such.

>>> - if (*url != '/')
>>> + if (hp->requestUri()[0] != '/')
>>> return; /* already in good shape */
>>
>> The old code "worked" for empty URLs (""). The new does not. I do not
>> know whether empty URLs are possible at this point. If they are not
>> possible, perhaps add a comment or even a Must()?
>>
>
> The parseFirstLine logics ensure there is always a URL. Either by
> treating the version tag as the URL, or aborting with a parse error.
>
> Do you really think we need to add an unused check for length() ?

If you are confident, a "we assume URL is never empty" comment would
suffice, I guess, but please check that there are no checks for the URL
being empty then. Personally, I think I would feel more comfortable with
a Must().

>>> === renamed file 'src/HttpParser.h' => 'src/http/Http1Parser.h'
>>
>>> + /// the request-line URI if this is a request, or an empty string.
>>> + SBuf requestUri() const {return uri_;}
>>
>> Consider returning "const SBuf &" to avoid refcounting overheads. Why is
>
>
> const SBuf &uri() const {retun uri_;}
>
> const SBuf uri() const {retun uri_;}

only one should be declared (I am surprised the compiler does not catch
the ambiguity here, but I am not an expert on these matters).

> SBuf &uri() {retun uri_;}

If you are going to expose a writeable uri_, you can just make a public
"uri" member. No need for any accessors at all!

> ==> error: passing âconst SBufâ as âthisâ argument of âconst char*
> SBuf::c_str()â discards qualifiers

This is not an error caused by the declaration. It is an error caused by
the calling code not creating an SBuf copy. The caller should do:

  SBuf(parser.uri()).c_str()

> Whereas return via copy-construct provides a non-const SBuf using the
> non-allocating SBuf copy against the shared MemBlob.

... hiding two performance penalties from the coder: refcounting plus
buffer-altering c_str. Explicit SBuf creation is a pain, but it exposes
those performance problems. Pick your poison.

>> this method in the general message parser? Should not it (and the
>> corresponding data member) be placed in HTTP *request* parser?
>>
>
>
> The request parsing is in the generic HttpParser. The response parsing
> is in the generic HttpMsg class but the intention is to move both into
> here and:

I think we should start with Http1::RequestParser and
Http1::ResponseParser classes. If, during conversion, you find that they
need the same code, place that shared code in Http1::Parser.

> 1) avoid the current design of 2-pass parsing: a) generic-parse just to
> determine it is a request-line, then b) re-parse with HttpRequest to get
> the fields already found by HttpParser.

I do not think we need two passes. The caller knows what it is parsing
and will use the right parser. That is, the caller will do either

  Http1::RequestParser parser(...)
  if (parser.parse(...))

or

  Http1::ResponseParser parser(...)
  if (parser.parse(...))

The above sketch omits many details, of course; for example, the
incremental parsers will be created once per message rather than once
per parse call. However, there is just one pass in the above sketch.

> 2) avoid adding parser overheads to the HttpRequest and HttpResponse
> classes.

Agreed, but unrelated to the issue being discussed IMO. Once you are
done, those two classes should know nothing about parsing.

> 3) avoid having to allocate a HttpRequest/HttpResponse object just to
> test parsing for invalid request/responses.

I think there are two valid approaches here:

1) Always create the right empty HttpR* message object. Always create
the right parser. Give the message object to the parser so that the
parser can use it for storage during parsing.

2) Always create the right parser. Allow the parser to create the right
empty HttpR* message object if/when needed, so that the parser can use
the message object for storage during parsing.

Approach (1) is probably simpler: No logic to decide when the message
object should be created. No code to check whether the message object
has been created already. It can be modified so that the parser creates
the right message object unconditionally, to simplify the caller's code.

Approach (2) is more efficient if and only if there are a lot of cases
where the parser determines that no message object is needed to handle a
parse() call. Those are the cases where message object creation will be
delayed or even avoided.

I suspect that there are not enough such cases to warrant the complexity
of (2), but you probably know this area better than I do.

> 4) make HttpRequest and HttpResponse generic enough for use in both
> protocols and be spawned by the parser(s).

Agreed, but unrelated to the issue being discussed IMO.

> This accessor is the one to retrieve the (1a) request-line data field
> for use in (1b) creating the HttpRequest object without re-parsing. It
> can also be used anywhere the raw-URI is needed before HttpRequest
> exists - which is what you see in the current patch.
>
> (Same goes for the method accessor.)

Sure, but since responses do not have methods and URIs, the response
parser does not need them.

> We also need to work around two other traffic flow properties:
>
> * in HTTP/2 the response parse needs to handle PUSH_PROMISE "requests",
> so the API separation becomes more con than pro.

I do not know much about HTTP/2, but either there is a single
structure/parser for both requests and responses, or there are two (or
more) structures/parsers. The design I propose accommodates both cases
AFAICT. In the former case, you need a single parser. In the latter
case, you need multiple parsers (which may use each other if needed).

For HTTP/1, we know that we need two different parsers because the
message structures depend on the direction.

> * request parsing may result in a response error object being
> instantiated. And thus a status code results from the request-line.

Unrelated to the issue being discussed IMO: If you want a parser to
create an error response object, it can do that, of course. If you want
a parser to store information necessary to create an error response
object, it can do that instead. In either case, the request and response
parsers remain different.

> I figure its better at this stage to have a set of accessors for each of
> the possible fields (request-line, status-line and mime-header areas)
> and two parse methods to be called as appropriate from the client-side
> or server-side as to which message type is to be interpreted.

I do not see the benefits behind such an approach. If you are creating
new parser classes, I think we should do it the right way from the start.

>>> + /// the HTTP method if this is a request method
>>> + const HttpRequestMethodPointer & method() const {return method_;}

>> Finally, either this method should return a method object (rather than a
>> pointer to it) OR we should get rid of METHOD_NONE. Having two ways to
>> signal a missing method is too much IMO. The former is probably easier
>> and requires fewer code changes.
>
> Tryimg. I think the Pointer was there to avoid data copies. The method
> class is relatively heavy to copy/assign with its String internals.

If we must use a pointer for optimization reasons, then let's make sure
it is never nil (and points to METHOD_NONE object when the method is
missing). This will remove all nil-checking code. You can even make that
METHOD_NONE object a constant that you can reuse without [re]creating it
all the time!

Thank you,

Alex.
Received on Thu Jan 02 2014 - 19:08:40 MST

This archive was generated by hypermail 2.2.0 : Fri Jan 03 2014 - 12:00:10 MST