Observations on Performance with squid 3.1.3 - especially why only writes of 4096 bytes to the http socket

From: Martin Sperl <Martin.Sperl_at_amdocs.com>
Date: Tue, 15 Jun 2010 12:44:03 +0100

Hi!

We have done extensive testing with squid 3.1.3 in a accelerator mode and found some performance "issues" - especially over WAN links.
Essentially we have seen a "cliff" behavior where the logged Response-times of squid (as logged in the access logs "%6tr" with the default squid logformat) is a direct function of the number of bytes transferred.

We have seen (from actual usage over a Internet link between the US and Mexico) that all responses smaller than aproximately 13k (including header) are delivered with a "constant" 0.002 sec while the ones above have a RT of 0.02 second (or a magnitude higher).
We assume that there are OS-Level-TCP-buffers involved, that make it possible to deliver the "smaller" files very efficiently while the bigger files have to wait for acknowledgments from the peer.

Our main points here is that our customer wants us to SLA-report on these numbers and the bigger responses seem to include Internet latencies, which we need to avoid, because we should only report timings within our span of control (so without the internet latencies). I know, that a definition like this is not optimal, but the way the situation with the contract of this project is just that we have to provide this. So we are essentially trying to figure out ways to improve our KPI reportings to exclude the effects that not within our span of control - the motto is: never trust statistics that you have not faked yourself...

So we have been investigating how this "cliff" can get explained.

Our first thought was that the Linux OS parameters for TCP buffers are not set to optimal values.
We have first been investigating the following values:
* net.core.rmem_default
* net.core.wmem_default
* net.core.rmem_max
* net.core.wmem_max

And found out that our settings are already set to "more aggressive" values - the ones that oracle is recommending for their database.

Only now we have found out that there are different parameters as well that may be effecting the TCP performance:
* net.ipv4.tcp_rmem
* net.ipv4.tcp_wmem
(Actually I am surprised that oracle is not requesting to set these parameters!)

These have yet to be applied to the production system (because we just realized the situation today), so I cannot tell if this is effective.

So (as we did not know about these parameters) what we had been investigating as well has been how squid itself is behaving on the OS level.

For this we have attached strace to the squid 3.1.3 process and have made some "interesting" observations when running apache bench against the system for requests that squid is responding to with a "TCP_MEM_HIT" (so from :

* writes to the http socket are 4096 bytes in size
  * this results for bigger files to get transferred in more (e)poll loops.
    * for a 22k file + headers this means that at least 6 epoll loops with writes are needed to deliver the bytes.
    * depending on the link (say with a GBIT network interface,..) it may be possible that squid is
      not even able to keep data streaming continuously.
  * the question here is why does squid only write 4096 bytes at a time and not as much as the OS is accepting with the write call?
    * this is especially true for sockets that have O_NONBLOCK configured.
  * and reads are limited to 4095 bytes
  * one note: the first write can be bigger than 4096 bytes (I assume it is the header)
  * ICAP sockets on the other hand are read 65536 bytes in a "go" (I have not investigated writes)
* each (e)poll loop takes with strace:
  * 0.000229 seconds when ab is running with a concurrency of 1 (CPU load of 5% (3%u+2%s))
  * 0.001818 seconds when ab is running with a concurrency of 10 (CPU load of 25% (20%u+5%s))
  * this is almost a factor of 8 different
  * the (e)poll duration is as far as we can tell a function of file-handles that are returned by (e)poll as "to handle".
    * so more concurrency means longer time that is spent processing the individual file-handles:
      * reading from the file-handle (if waiting for bytes)
      * figuring out the "state machine" for this file-handle
      * writing to the file-handle (if writing)
    * here now again the question comes in if it does not benefit the overall latencies
      if the most number of bytes are written via the write sys-call instead of 4096 bytes
      * especially the calculation of the "state machine" for the next actions is probably expensive,
        so that reducing the number of these calculations (by writing more in one go)
        would benefit the overall system performance.
* it may also be worth to think about handling actions immediately instead of deferring to the next loop.
  Examples here are:
  * a GET request that can get fulfilled from memory could get written to the socket immediately
    instead of deferring to the state-machine with the next poll-loop.
    * Here the example of the current situation with a TCP_MEM_HIT case:
      * Loop 1 Get request comes in
      * Loop 2 content is delivered to requestor
    * Here the example for the "shortcut" implementation:
      * Loop 1
        * Get request comes in
        * content is delivered to requestor
  * ICAP request and response modifications can get passed on in the same poll loop instead of waiting for a poll loop.
    * this could reduce the ICAP introduced latencies from 4 to 2 poll cycles.
      * Here the example of the current situation for a request modification with content in memory:
      * Loop 1 GET request comes in
      * Loop 2 ICAP request is sent to ICAP server
      * Loop 3 ICAP response is received
      * Loop 4 content is delivered to requestor
    * Here how it could look like with the "shortcut":
      * Loop 1:
        * GET request comes in
        * ICAP request is sent to ICAP server (using a connection from an idle pool)
      * Loop 2:
        * ICAP response is received
        * content is delivered to requestor
  * There are probably other examples, but these are the ones I have observed so far.
    Both would definitely reduce latencies for the memory cache hit case.
  * probably this would introduce higher code complexity, but maybe it is worth it...
* at least on Linux a stat("/etc/localtime", is done on each and every request coming in
  * this may not be a very expensive call, but it still could get avoided.

Maybe you have got a comment on these ideas for latency-improvements. Maybe you know how to change the 4096 limitation (I believe it is a single define that needs to get changed) and if this triggers some code-issues...

Finally we have found that when doing an strace on squid with ab and a concurrency of 10 (-c 10) we can trigger in at least 50% of all runs an exit of squid with a "pure virtual method called" - probably a missing handler.

The strace in the end looks like this:
1276508933.516600 read(11, 0x2b795641e010, 4095) = -1 ECONNRESET (Connection reset by peer)
1276508933.516697 read(12, "", 4095) = 0
1276508933.516773 write(33, "REQMOD icap://127.0.0.1:1344/reqmod ICAP/1.0\r\nHost: 127.0.0.1:1344\r\nDate: Mon, 14 Jun 2010 09:48:53 GMT\r\nEncapsulated: req-hdr=0, null-body=146\r\nAllow: 204\r\n\r\nGET http://XXX.ucpag.com/XXX.qpass.co"..., 305) = 305
1276508933.516939 read(23, "", 4095) = 0
1276508933.517014 read(21, "", 4095) = 0
1276508933.517089 read(17, "", 4095) = 0
1276508933.517173 read(19, "", 4095) = 0
1276508933.517246 read(20, "", 4095) = 0
1276508933.517318 read(22, "", 4095) = 0
1276508933.517391 read(24, "", 4095) = 0
1276508933.517464 read(32, "ICAP/1.0 200 OK\r\nDate: Mon, 14 Jun 2010 09:48:53 GMT\r\nConnection: keep-alive\r\nISTag: \"ReqModService\"\r\nEncapsulated: req-hdr=0,null-body=249\r\n\r\nGET http://XXX.qpass.com/css/wap2/framework.css htt"..., 65535) = 394
1276508933.517661 epoll_ctl(6, EPOLL_CTL_MOD, 32, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=32, u64=32}}) = 0
1276508933.517750 read(18, "", 4095) = 0
1276508933.517832 read(34, "ICAP/1.0 200 OK\r\nDate: Mon, 14 Jun 2010 09:48:53 GMT\r\nConnection: keep-alive\r\nISTag: \"ReqModService\"\r\nEncapsulated: req-hdr=0,null-body=249\r\n\r\nGET http://XXX.qpass.com/css/wap2/framework.css htt"..., 65535) = 394
1276508933.517959 epoll_ctl(6, EPOLL_CTL_MOD, 34, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=34, u64=34}}) = 0
1276508933.518103 read(11, "", 65535) = 0
1276508933.518193 read(12, "", 65535) = 0
1276508933.518266 read(23, "", 65535) = 0
1276508933.518332 read(21, "", 65535) = 0
1276508933.518397 read(17, "", 65535) = 0
1276508933.518461 read(19, "", 65535) = 0
1276508933.518526 read(20, "", 65535) = 0
1276508933.518592 read(22, "", 65535) = 0
1276508933.518657 read(24, "", 65535) = 0
1276508933.518871 read(18, "", 65535) = 0
1276508933.519032 epoll_ctl(6, EPOLL_CTL_DEL, 11, {0, {u32=11, u64=4601595353542688779}}) = 0
1276508933.519108 close(11) = 0
1276508933.519208 epoll_ctl(6, EPOLL_CTL_DEL, 12, {0, {u32=12, u64=39075818617438220}}) = 0
1276508933.519274 close(12) = 0
1276508933.519374 epoll_ctl(6, EPOLL_CTL_DEL, 23, {0, {u32=23, u64=39075818617438231}}) = 0
1276508933.519441 close(23) = 0
1276508933.519528 epoll_ctl(6, EPOLL_CTL_DEL, 21, {0, {u32=21, u64=39075818617438229}}) = 0
1276508933.519593 close(21) = 0
1276508933.519679 epoll_ctl(6, EPOLL_CTL_DEL, 17, {0, {u32=17, u64=39075818617438225}}) = 0
1276508933.519745 close(17) = 0
1276508933.519830 epoll_ctl(6, EPOLL_CTL_DEL, 19, {0, {u32=19, u64=39075818617438227}}) = 0
1276508933.519898 close(19) = 0
1276508933.519984 epoll_ctl(6, EPOLL_CTL_DEL, 20, {0, {u32=20, u64=39075818617438228}}) = 0
1276508933.520050 close(20) = 0
1276508933.520180 epoll_ctl(6, EPOLL_CTL_DEL, 22, {0, {u32=22, u64=39075818617438230}}) = 0
1276508933.520266 close(22) = 0
1276508933.520353 epoll_ctl(6, EPOLL_CTL_DEL, 24, {0, {u32=24, u64=39075818617438232}}) = 0
1276508933.520418 close(24) = 0
1276508933.520509 epoll_ctl(6, EPOLL_CTL_DEL, 18, {0, {u32=18, u64=39075818617438226}}) = 0
1276508933.520591 close(18) = 0
1276508933.520699 write(2, "pure virtual method called\n", 27) = 27
1276508933.520797 write(2, "terminate called without an active exception\n", 45) = 45
1276508933.520888 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
1276508933.520974 tgkill(8565, 8565, SIGABRT) = 0
1276508933.521030 --- SIGABRT (Aborted) @ 0 (0) ---

So it is receiving a "connection reset by peer" (one of the http TCP connections) and is subsequently closing all 10 sockets (HTTP+ICAP) and then it is having a strange "situation" with pure virtual methods called, where C++ is complaining.
Unfortunately it is not possible to see the output of those 2 writes to STDERR (without strace), as STDOUT and STDERR are redirected to /dev/null. Also no core dump is produced, so we cannot provide a stack-trace either...

You may want to address this as well...

Thanks,
        Martin

 
     

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp
Received on Tue Jun 15 2010 - 12:22:03 MDT

This archive was generated by hypermail 2.2.0 : Wed Jun 16 2010 - 12:00:06 MDT