Re: benchmarking squid on solaris/x86

From: Henrik Nordstrom <hno@dont-contact.us>
Date: Thu, 21 Mar 2002 14:04:23 +0100

On Thursdayen den 21 March 2002 12.08, Andres Kroonmaa wrote:

> Not only. Socket in TIME_WAIT is holding ephemeral port from
> being reused, (and related file FD is not fully released imho,
> but I may be wrong here) and this TW table must be scanned every
> time socket() call is used.

Right.

> Every socket that is closed goes into TIME_WAIT. Solaris default
> is 240 seconds (as per rfc1122, 4.2.2.13). Way too high.
> Under high loads, like 200 tps, and each lasting <0.5 secs, you
> flood all your ephemeral ports into TIME_WAIT if you don't lower
> the timer. Time to find free ephemeral port is what hits the
> performance, not so much background tasks.

Need to look into how Linux finds a free port. See below.

> 2^5 is neglible load. Given mean session time, and TIME_WAIT
> timer ratio, you can estimate typical number of sockets in
> TIME_WAIT state based on number of open sockets. For eg.
> given 10seconds average session time and 60sec timer, you'd
> expect to have 6x as many sockets in TIME_WAIT as you have
> open TCP sessions. If you do 200 tps, this ratio will be
> way higher. On Solaris, you start to worry only if number
> of sockets in TIME_WAIT is in thousands.

When to worry also depends on the type of application. A server generally do
nothave to worry unless there is a risk of running out of memory.. but a
client (also proxy) has to worry much earlier due to the limited number of
ports available..

> There are valid tricks to reuse sockets in TW state, but
> afaik this depends on number of IP peers that communicate.
> Thus test between few client hosts and servers is way
> different than test between zillions of clients/servers.

Linux had a hack to reuse sockets in the TW state in certain conditions, but
this is currently disabled as some problems was found. This can still be
enabled by the tcp_tw_resycle flag but is not recommended..

> But the tcp hash table tunable of the kernel can have
> direct relation to the matter, infact Sun is said to up
> it to 256K during web performance tests, so my suggestion
> of 8K was very conservative.

Linux uses two hash tables for this purpose

a) TCP connection hash table, hashed on the full address of the socket. Used
when processing packets.

b) Bind hash table, hashed on the port number only. Used when assigning port
numbers.

The size of both is tuned at boottime based on the amount of physical memory.
There do not seem to be a tunable that can affect the size of these. The size
of these is printed at bootup

TCP: Hash tables configured (established 16384 bind 16384)

These values are from a 256MB machine, larger machines scale linearly, the
count is in number of hash buckets, the bind hash size has a upper cap of
65536 for obvious reasons.

And there is an simple algorithm determining the next port to use, attempting
to lessen the risk for lookup collisions.

Given the relatively large size of these hash tables I don't think lookup
times is much of a problem here.

> On solaris, I check sometimes with:
> netstat -na | nawk '{print $NF}' | sort | uniq -c | sort -n
> and make sure that number of sockets in *WAIT is not higher
> than number of sockets in ESTABLISHED state.

Depends on the type of server. To make sense of the values the count you are
really interested in is the amount of ports bound by *WAIT sockets.

> tcp_fin_wait_2_flush_interval
> This values seems to describe the (BSD) timer interval which
> prohibits a connection to stay in the FIN_WAIT_2 state forever.
> FIN_WAIT_2 is reached, if a connection closes actively. The FIN
> is acknowledged, but the FIN from the passive side didn't arrive
> yet - and maybe never will.

Ok. So this is 60 seconds on Linux, from the fact that FIN_WAIT2 is handled
by the same garbage collection as TIME_WAIT.

> Well, rfc-1122 requires TW to be 240sec. As you said, Linux fixes
> it at 60sec, which is more useful, but ignores rfc. This has
> immediate impact on perf when comparing linux to solaris.

True.

> btw, what does tunable tcp_tw_recycle actually do on linux?

I don't know. It is disabled due to problems and should not be enabled unless
you know what it does.

> tcp_max_orphans should imho also be related to the same matter.

tcp_max_orphans - INTEGER
        Maximal number of TCP sockets not attached to any user file handle,
        held by system. If this number is exceeded orphaned connections are
        reset immediately and warning is printed. This limit exists
        only to prevent simple DoS attacks, you _must_ not rely on this
        or lower the limit artificially, but rather increase it
        (probably, after increasing installed memory),
        if network conditions require more than default value,
        and tune network services to linger and kill such states
        more aggressively. Let me to remind again: each orphan eats
        up to ~64K of unswappable memory.

I.e. max number of sockets abandoned by the application but lingering for the
TCP session to complete before closing down. The kernel will complain loudly
if this parameter requires attention.

I.e. sockets still having unacknowledged data at the time when closed by the
application.

> > Linux also drops TIME_WAIT sockets before the period has expired in
> > response to matching RST packets.
>
> I guess you mean FW2 here?

No, but it applies to FW2 also.

This is consistent with RST processing in general. If a TCP RST is received
then the TCP is immediately destroyed as it is known the other endpoint is no
longer there. There is no need to wait for TIME_WAIT in this case.

Not that I can draw any natural condition where RST will be seen for a
TIME_WAIT socket.. exept for stray delayed packets send by us combined with
an unexpected reset of the TCP state of the peer..

Regards
Henrik
Received on Thu Mar 21 2002 - 06:04:30 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:14:52 MST