Re: Ip::Address performance from Amos Jeffries on 2010-11-19 (squid-dev)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Sat, 20 Nov 2010 13:36:25 +1300

On 20/11/10 06:31, Alex Rousskov wrote:
> On 11/19/2010 12:06 AM, Amos Jeffries wrote:
>> On 19/11/10 09:27, Alex Rousskov wrote:
>>> Hello,
>>>
>>> This email summarizes my questions and suggestions related to optimizing
>>> Ip::Address implementation.
>>>
>>> Disclaimer: I am writing from general performance and API principles
>>> point of view. I am not an IP expert and may miss some critical
>>> portability concerns, among other things. I am counting on Amos and
>>> others to audit any ideas worth considering.
>>>
>>>
>>> Ip::Address is actually a socket address. Besides the IP address itself,
>>> the class maintains the socket address family and port (at least).
>>>
>>> Currently, Ip::Address is designed around a single sockaddr_in6 data
>>> member that holds the address family, socket port, the IP address
>>> itself, and some extra info. This is 28 bytes of storage, most of which
>>> are used in case of an IPv6 address.
>>>
>>> If we were to deal with IPv4 addresses only, the corresponding "native"
>>> IPv4 sockaddr_in structure takes 16 bytes, and only the first 8 of which
>>> are normally used.
>>>
>>> Currently, Ip::Address implementation strategy can be summarized as
>>> "convert any input into a full-blown IPv6 address as needed, store that
>>> IPv6 address, and then convert that IPv6 address into whatever the code
>>> actually needs". This is a simple, solid approach that probably helped
>>> eliminate many early IPv6 implementation bugs.
>>>
>>> Unfortunately, it is rather expensive performance-wise because the input
>>> is usually not an IPv6 address and the output is usually not an IPv6
>>> address either. While each single conversion (with the exception of
>>> conversion from and to textual representation) is relatively cheap,
>>> these conversions add up. Besides, the conversions themselves and even
>>> the "what am I really?" tests are often written rather inefficiently.
>>>
>>> For example, to go from an IPv4 socket address to an a.b.c.d textual
>>> representation of that address, the current code may allocate,
>>> initialize, use, and discard several sockaddr_in6 structures, do a dozen
>>> of sockaddr_in6 structure scans for certain values, and copy parts of
>>> sockaddr_in and sockaddr_in6 structures a few times a long the way to
>>> the ntoa() call. After the ntoa() call, scan the resulting string to
>>> find its end.
>>>
>>> The old code equivalent of the above? Initialize one IPv4 sockaddr_in
>>> structure on stack and pass it to ntoa()!
>>>
>>>
>>> I see three ways to optimize this without breaking correctness:
>>>
>>> * one-struct: Keep using a single sockaddr_in6 data member and convert
>>> everything into an IPv6 socket address as we do now. Optimize each
>>> conversion step, remove repeated "who am I?" checks during each step,
>>> and optimize each check itself.
>>
>> All of the storage methods require them to identify the type accessed.
>> The only way to completely avoid them is to revert to v4-only or v6-only
>> compile modes.
>>
>> Yes they seriously need some optimization and reduction.
>
> Yes, some checks are unavoidable, of course. What I was referring to is
> replacing the current
>
> if (I am vX) ... do a little ...
> if (I am vX) ... do a little ...
> if (I am vX) ... do a little ...
> if (I am vX) ... do a little ...
>
> code with something closer to this:
>
> if (I am vX) ... do a lot ...
>
> to the extent possible.

Agree. The NP hard problem though is that "a lot of work" essentially
covers vast swathes of client-side or server-side. Care to duplicate
large sections of that with specific function paramaters? I think not.

The "do a little" are in most cases debugging NtoA() as far as I'm aware.

Note that using the union design and sockaddr family field drops the
v4-mapped type being stored. They MUST be converted to native v4 at
point of storage to avoid a huge amount of pain.

The side effect of this is:
* the memcmp() for ::ffff: is dropped from all the Is*() tests.

* ANY_ADDR for both IPv4 and IPv6 is an identical bitmap in the
address portion. We can store as v4 and use the smaller 32-bit IPv4 test
for both. Whatever gets used for the human-display string.

* same for NO_ADDR.

>
>>> * union: Replace sockaddr_in6 storage with a union of sockaddr_in and
>>> sockaddr_in6 structures (or their wrappers). The Ip::Address class will
>>> check the address family part (that must overlap in all socket
>>> addresses, by design) to know what the address is and then use the right
>>> structure. This will avoid most conversions for the currently dominating
>>> IPv4-to-IPv4 path. If the caller does need a different version of the
>>> address than the one we stored, the "up" or "down" conversion is
>>> unavoidable and is still be handled by the Ip::Address.
>>
>> Otherwise known as the sockaddr_storage.
>> http://www.retran.com/beej/sockaddr_inman.html
>
> Good point. I thought sockaddr_storage is deprecated or unpopular since
> Squid code did not use it :-/. Apparently, that is not the case.
>

It's of the third-generation as I see it. Came it at the same time as
addrinfo which is best used as a linked-list of sockaddr_storage.

> Sockaddr_storage is the right abstraction for this option but it comes
> with a performance penalty:
>
> sizeof(sockaddr_storage) == 128
> sizeof(sockaddr_in6) == 28
> sizeof(sockaddr_in) == 16
>
> I do not care much about pure RAM overhead those 100 extra bytes bring,
> but initializing and copying them is one of those thousand cuts, IMO. We
> may be able to minimize initialization and copying overhead if we use
> custom constructors and assignment operators that only initialize and
> copy what is actually used. We will try to test that theory.

I don't think we need to initialize more than 16 or 32 bits of the
structure. The current code is doing the lot just to be safe for testing
and debugging while the family field is not maintained properly.

The only bit to initialize is:
  * for a nil object - memset the sockaddr (16 or 32 bits).
  * For IPv4 - memcpy/memmove the sockaddr_in space - 16 bytes
  * for IPv6 - memcpy/memmove ahe sockaddr_in6 space - 28 bytes
  * for pipes - memcpy/memmove the sockaddr_un space - 110 bytes

The sockaddr union member needs initializing with the parent object to nil.
The other sockaddr_* only need initializing when set, in which case they
are usually best done with memcpy/memmove from the source data.

>
> I also wanted to wrap version-specific sockaddr_in* union members into
> C++ classes that provide version-specific and version-optimized handling
> of their sockaddr_in* structure. This will, in part, minimize repeated
> "what am I?" checks. It may be possible to do that with a single
> sockaddr_storage member, but it would require placement-new which may
> have performance overheads. We will test that as well.

The problem here is that most of our code is family agnostic by design
and intention. The parts which are specific have to do the type checks
anyway to cast/convert to the right type. Having explicit checks allows
us to produce sane errors or do a (slow) conversion when for example
trying to connect to a WCCP router at [::1].

>
> If placement-new has overheads, we can have a union of three structs:
> sockaddr_storage, sockaddr_in, sockaddr_in6.
>
> There is another reason we may have to deal with large sockaddr
> structures as discussed below.
>
>
>> FWIW: To convert to this the stages are:
>> * change the type to sockaddr_storage
>> * re-add maintenance stages to keep the sa_family field accurate (these
>> were originally intended but dropped to avoid bugs caused by and hidden
>> by premature-optimization).
>> * add casts as required to make the correct sockaddr_* type fields
>> visible.
>>
>> This makes the IsIP*() tests faster single-byte comparisons and the
>> IPv4() output faster.
>
> Looking at the current code, I am not sure single-byte family
> comparisons will be always possible. This is one of the secondary
> questions I had. Currently, we check for "any address", "no address",
> and "mapped address" to answer the "What kind of address am I?" question.
>
> * Can we use zero family for "any address" and FF family for "no
> address"? We can add a data member to store this info if using family is
> wrong, but I hope it is not necessary.

We can't. The family needs to be known. What we can do is make our
ANY_ADDR/NO_ADDR the cheapest to test, right now that is IPv4 32-bits.

>
> * Do we need to check for mapped addresses? Or do we change the code to
> convert as soon as we are given a mapped address, to avoid mapping
> checks? Or do we add a data member to cache the mapping information for
> faster checks?

With the union design no. They die. We may need to convert to mapped for
some things but that should be the minority if any.

>
>>> * two-struct: Similar to union, but keeping two separate structures, not
>>> merged in a union. This costs more memory and whole-object copying
>>> cycles, but allows us to keep both native and converted versions of the
>>> addresses. I do not know whether that is useful.
>>
>> IMO: two-struct is just a waste of memory. We would have to do the same
>> or less maintenance for other methods. We have no situations where the
>> one object has to hold both types.
>>
>>
>> For the record there is a fourth option that could be considered:
>>
>> * addrinfo: This is a descriptor to a dynamically allocated sockaddr_in
>> or sockaddri_in6 with all the what-do-i-hold? details we would need to
>> maintain for a union or two-struct designs anyway.
>>
>> Overall I think this is slightly worse than union though. The size of
>> addrinfo in addition to the sockaddr it points to has little gain over
>> two-struct. With extra maintenance to keep the addrinfo type info
>> accurate in duplicate to the start of sockaddr.
>
> IMO, dynamic allocation and deallocation of addrinfo members is one of
> those thousand cuts. I may be wrong, but I think the current code
> misuses addrinfo and getaddrinfo() function, with measurable performance
> penalties. I hope we can restrict if not eliminate their use.
>

Yes.

>
>>> The "one struct" approach is simpler to implement given the current code
>>> state, but the "union" approach should be faster as long as IPv4-to-IPv4
>>> paths are common. The union approach may also yield overall simpler code
>>> as there will be very few "what am I know?" checks and concerns. The
>>> two-struct approach attempts to minimize back-and-forth conversion costs
>>> by giving access to both v4 and v6 versions at the same time.
>>>
>>> FWIW, on a conceptual level, boost.asio library uses a union approach
>>> (only one address version is valid per Address object) while their
>>> implementation uses the two-struct approach (there is a place to store
>>> both IPv4 and IPv6 versions). This does not surprise me much because
>>> this Boost library does not seem to care about RAM/copying overheads.
>>>
>>> Interestingly enough, boost.asio does not support transparent v4-v6
>>> conversions (AFAICT): To do a conversion, the calling code needs to know
>>> what IP version the generic address is storing, get that
>>> version-specific address, and then ask it to convert to another version.
>>> Perhaps that means that nobody needs a general "give me IPv6 regardless
>>> of what you really have" API?
>>> https://svn.boost.org/trac/boost/browser/trunk/boost/asio/ip/address.hpp
>>>
>>>
>>> There are some secondary optimizations and API changes that I would like
>>> to discuss, but let's first see if there is a consensus which overall
>>> design approach is the way to go. There may be more options than the
>>> three outlined above.
>>>
>>
>> Phew, I was worried you had uncovered some major fatal flaw in the
>> design. So it seems just the expected culprits which were not optimized
>> for testing and development simplicity are the ones to die.
>
> I am glad you do not consider the implied changes significant, although
> I am afraid you underestimate the total amount of the work required to
> optimize the implementation and polish the API. :-)

Oh, it is a lot of work. What I'm relieved about is that you did not
find anything unknown on top of all this work.

>
>> I think we go with union. The sockaddr_storage type and its usage is
>> relatively standard and can be passed directly to the stack, or on some
>> API via an addrinfo wrapper pointing at it.
>
> I too favor the union or sockaddr_storage approach. Let's wait for
> others to chime in though as this is an important decision.
>

Check.

>
>> PS: If you are still wanting to unify the sockaddr_ip* with sockaddr_un
>> addresses we can do it easily with union {sockaddr_storage + padding;
>> sockaddr_un;} and keep all the existing v4/v6 code unchanged due to the
>> way it uses the initial sockaddr* common fields.
>
> This is a very good question; thank you for reminding me of Unix Domain
> Socket needs!
>
> We can easily add sockaddr_un to the proposed union. No extra
> padding/etc is necessary as far as I can tell. However, sockaddr_un is
> 110 bytes long :-(. Is that kind of overhead worth the cleaner code?
> Perhaps micro-level performance tests of optimized
> constructors/assignments will help answer that question.
>
> If we do end up adding sockaddr_un to the union, then we may want to add
> sockaddr_storage as well because it is just a few bytes longer (or just
> use sockaddr_storage alone).

For union we require either sockaddr_storage or sockaddr types as part
anyway to identify the content safely.

Amos

-- 
Please be using
   Current Stable Squid 2.7.STABLE9 or 3.1.9
   Beta testers wanted for 3.2.0.3

Received on Sat Nov 20 2010 - 00:36:32 MST

This archive was generated by hypermail 2.2.0 : Sat Nov 20 2010 - 12:00:05 MST