Some optimisations from Andres Kroonmaa on 2000-09-15 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Fri, 15 Sep 2000 23:26:12 +0200

I was lately tracing squid and looking at what io pattern it has.
To my surprise I discovered that it uses quite amount of different
buffers for different types of io.
Here is a sample of solaris truss with lwp's:

/1: 0.2581 0.0019 read(323, "BDCF859519 \BFC7\fFE9FB3".., 65536) = 65536
/13: 0.2609 0.1661 write(122, "83C9 6 '04918EC2 s g UCA".., 8192) = 8192
/1: 0.2637 0.0056 write(350, "01B5C1AAF99AED v yD7 (04".., 4096) = 4096
/13: 0.2653 0.0044 write(122, "CEA481 J r1E U g xECE9 ]".., 8192) = 8192
/13: 0.2677 0.0024 write(122, "F61F BF7 0C4 s YC303 vBE".., 8192) = 8192
/13: 0.2693 0.0016 write(122, "EE 39C _191188 tD3AA -".., 8192) = 8192
/13: 0.2709 0.0016 write(122, "1A . !05\0 $D9 KE1 ;A59F".., 8192) = 8192
/13: 0.2726 0.0017 write(122, "12 RB2B4\t89 S8FA4FDF0 /".., 8192) = 8192
/13: 0.2741 0.0015 write(122, " ) A70F }FDEC9A `91A0A6".., 8192) = 8192
/13: 0.2764 0.0023 write(122, " 6 .94ED aAED698 005D6\t".., 8192) = 8192
/1: 0.2820 0.0024 write(350, "B9 C r ZC7 \ zF8EB A bA8".., 4096) = 4096
/1: 0.2896 0.0010 write(350, " ^ R 4E2\bFEDA < I 2 FC7".., 4096) = 4096
/1: 0.2925 0.0029 write(350, " D1605 IABD9D0\n ]A3 \DA".., 4096) = 4096
/1: 0.2967 0.0042 write(350, " HA9CCC4 H10 \95A0 | )C1".., 4096) = 4096
/1: 0.2992 0.0025 write(350, " +DD\b 1\v\0 2 $811F16 v".., 4096) = 4096
/1: 0.3007 0.0015 write(350, " * wB6FFE3 r 9 >0505D1E5".., 4096) = 4096
/1: 0.3022 0.0015 write(350, "EAD7 K8DCFBF88AF15CE81\v".., 4096) = 4096
/1: 0.3050 0.0028 write(350, "C4 @BEB2ECBFB48D84 .B1 2".., 4096) = 4096
/1: 0.3071 0.0021 write(350, " k ; 6 6C7F0D2 Y /D5C1 ?".., 4096) = 4096
/1: 0.3102 0.0012 write(350, "89 ;A7 EE 8 6A3BE (10D3".., 4096) = 4096
/1: 0.3128 0.0026 write(350, "18A7BD80 WEDBAE5D9A914 i".., 4096) = 4096
/1: 0.3165 0.0024 write(350, "FC ^CF rCBE5 59F y [11F6".., 4096) = 4096
/1: 0.3190 0.0025 write(350, "CF9F 0 I L \ yBEA685C590".., 4096) = 4096

/36: 3.6417 0.1122 read(130, "E5 ' O (9489 k o95E7028A".., 4096) = 4096
/1: 3.6649 0.0014 write(365, "E5 ' O (9489 k o95E7028A".., 4096) = 4096

Here we can see that main thread reads 64KB from server socket, writes
8KB to disk, writes 4KB to client socket, and then continues for quite some
time to write buffered 64K to different FD's. Later we can see how squid
reads from disk in 4KB pieces and writes that to client socket. This is
typical IO pattern in squid.

Thinking about it, it seems that it is reasonable to make such buffers
somewhat more uniform in sizes. For each server socket squid mallocs 64KB
buffer that consumes RAM, but is very seldom actually used to its full
and for slow servers parts of such buffers may get paged out and later
contribute to pagefaults. Also, when there are thousands open sockets
to servers, this translates to quite a large amount of ram used.
This is a configure/compile time define, and for systems that have tuned
their TCP for SAT communication, squid may find that 512-1024KB buffers
for each server socket are appropriate. I'm not sure whats the reasoning
behind so large buffers inside squid, but I think TCP stack of the OS is
buffering enough so that we don't need to issue larger read from servers
than we plan to write to disks or client sockets.

It seems that disk reads are related to client socket buffering, and these
at 4096 is seeming too small. 8192 would be my suggested minimum, and
even 16KB would be more appropriate, as this size covers most of object
sizes (90% of web objects fit into 16KB). By upping client socket buffers
to 16KB we potentially get larger writes, meaning less poll()ing and less
system calls overall.

From these observations I decided to limit serverside buffer size to
16KB, increase disk buffers from 8KB to 16KB, and also increase client
buffers from 4KB to 16KB and watch how it matters.
I think the result is more optimal disk utilisation for large object,
less disk ops for moderate sized object, and less read/write syscalls
needed. All this seems to translate into lower CPU usage and better
smoother throughput.

/1: 9.3515 0.0012 read(103, "E6\b @84A51E9BECE0 M MC1".., 16384) = 16384
/1: 9.3535 0.0020 write(53, "E6\b @84A51E9BECE0 M MC1".., 16384) = 16384
/1: 9.3576 0.0003 read(103, " ]A40799 0C483EC9FE0 A B".., 16384) = 16384
/1: 9.3595 0.0019 write(53, " ]A40799 0C483EC9FE0 A B".., 16384) = 16384
/1: 9.3611 0.0016 read(103, "BC F0DD ! N C ( aA2 &".., 16384) = 16384
/1: 9.3629 0.0018 write(53, "BC F0DD ! N C ( aA2 &".., 16384) = 15008
/1: 9.5880 0.0060 read(205, "D613 :EF04DC14 s (F8 I m".., 16384) = 4380
/1: 9.5897 0.0017 write(283, "D613 :EF04DC14 s (F8 I m".., 4380) = 4380
/25: 9.6040 0.0208 read(9, "03 z\0\0\00310\0\0\0E6 G".., 16384) = 16384
/1: 9.6072 0.0175 write(122, " H T T P / 1 . 0 2 0 0".., 16331) = 16331
/30: 9.8635 0.0183 read(116, "037F\0\0\00310\0\0\0DA :".., 16384) = 14293
/1: 9.8678 0.0005 write(132, " H T T P / 1 . 0 2 0 0".., 14234) = 14234
/30: 10.1289 0.0101 write(9, " H T T P / 1 . 1 2 0 0".., 16384) = 16384
/1: 10.1311 0.0262 write(202, " AEB J85D6A9 iB4E0A2A5 M".., 4592) = 4592
/1: 10.1322 0.0011 read(224, " ,E2 ` dC7BF ) _B5DCB31B".., 16384) = 16384
/1: 10.1336 0.0014 write(202, " ,E2 ` dC7BF ) _B5DCB31B".., 16384) = 16384
/1: 10.1348 0.0012 read(224, " Q %93 ]A9AA r 5A902 nCB".., 16384) = 16384
/1: 10.1361 0.0013 write(202, " Q %93 ]A9AA r 5A902 nCB".., 16384) = 16384
/1: 10.1373 0.0012 read(224, " JCE 480AA "91E0 _AA HB9".., 16384) = 16384
/1: 10.1388 0.0015 write(202, " JCE 480AA "91E0 _AA HB9".., 16384) = 16384
/1: 10.1400 0.0012 read(224, "9F n10AE1A899C1DC18B HCA".., 16384) = 16384
/1: 10.1413 0.0013 write(202, "9F n10AE1A899C1DC18B HCA".., 16384) = 8576
/30: 10.2385 0.0018 write(166, " H T T P / 1 . 1 2 0 0".., 16384) = 16384
/30: 10.2738 0.0232 read(167, "03 v\0\0\00310\0\0\0 d ?".., 16384) = 16384
/1: 10.2783 0.0024 write(148, " H T T P / 1 . 0 2 0 0".., 16335) = 16335

Further strangeness in squid is that when it swapouts an object, it first
writes out metadata (storeentry+url) with whatever size is needed, and from
then on writes in diskblocks. This means that _all_ swapouts occur in badly
aligned manner:

/13: 24.0067 0.1316 write(94, "03 v\0\0\00310\0\0\0 ,8E".., 118) = 118
/13: 24.0087 0.0020 write(94, " H T T P / 1 . 1 2 0 0".., 8192) = 8192
/29: 24.3986 0.0000 write(94, "D1 S c83 $E1 s93A3C3 415".., 8192) = 8192
/29: 24.6566 0.1370 write(94, "\0\bBE\0F1 J d T 1131114".., 8192) = 8192
/29: 24.7100 0.0534 write(94, "01 DF10FD7B1189F\tA8011C".., 8192) = 8192
/13: 24.9666 0.5914 write(94, " O0E u\bED8C b0F h\fB0 U".., 3334) = 3334

When writing large files, this is busting disk subsystem, especially with RAID.
To further optimise disk write access we could think about buffering writes
inside squid so that we at least align writes into suitable sizes, and ideally,
(with 16KB) buffers are able to buffer full swapout event into one shot
(open + write + close) for 90% of cases.

config.h
@@ -141,6 +141,18 @@
#if SQUID_UDP_SO_RCVBUF > 16384
#undef SQUID_UDP_SO_RCVBUF
#define SQUID_UDP_SO_RCVBUF 16384
+#endif
+
+/* TCP send buffer size */
+#if SQUID_TCP_SO_SNDBUF > 16384
+#undef SQUID_TCP_SO_SNDBUF
+#define SQUID_TCP_SO_SNDBUF 16384
+#endif
+
+/* TCP receive buffer size */
+#if SQUID_TCP_SO_RCVBUF > 16384
+#undef SQUID_TCP_SO_RCVBUF
+#define SQUID_TCP_SO_RCVBUF 16384
#endif

defines.h
@@ -147,7 +152,8 @@
#define LOG_DISABLE 0

#define SM_PAGE_SIZE 4096
-#define DISK_PAGE_SIZE 8192
+#define DISK_PAGE_SIZE 16384

@@ -270,7 +276,8 @@
*/
#define PEER_TCP_MAGIC_COUNT 10

-#define CLIENT_SOCK_SZ 4096
+#define CLIENT_SOCK_SZ 16384

------------------------------------
Andres Kroonmaa <andre@online.ee>
Delfi Online
Tel: 6501 731, Fax: 6501 708
Pärnu mnt. 158, Tallinn,
11317 Estonia
Received on Fri Sep 15 2000 - 15:29:00 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:37 MST