Re: async-io for 2.4

From: Joe Cooper <joe@dont-contact.us>
Date: Fri, 03 Nov 2000 05:10:04 -0600

Henrik Nordstrom wrote:
>
> Joe Cooper wrote:
> >
> > Yes, the code seems pretty solid at this point.
> >
> > Though it has blown up under the boiler workload, and I don't see why,
> > it wasn't yet being overloaded--though the load was high (190 reqs/sec)
> > the response time and hit rate were still great. Nevermind. Tracked it
> > down to a system error between the chair and the keyboard (log device
> > filled up). Ok..Yep, looks very solid so far.
> >
> > Anyway, I'll have some benchmark results for this version to post in
> > another ~4 hours or so.
>
> Great news.
>
> Will commit the code to 2.4 in a few days unless someone has any reason
> not to.

I'm not going to argue that it shouldn't go in because this is much more
stable than the current 2.3 async code (which is probably unusable), but
I think there are some stability issues still needing to be addressed.
Now that the cache is filled, and I've tried running the rest of the
benchmarks, the box seems to regularly crash pretty hard. I'm going to
try it on a ext2 partition next to see if the problem is with ReiserFS
(the box is freezing due to a ReiserFS deadlock condition...but I think
it is triggered by a Squid issue, since I don't see this problem with
2.2STABLE5+hno--I don't think...I've never run this set of benchmarks on
that Squid before).

Anyway...Here's what happens:

Squid gives these errors in cache.log:

2000/11/03 01:19:46| storeAufsOpenDone: (2) No such file or directory
2000/11/03 01:19:46| /cache1/01/5F/00015F6A
2000/11/03 01:20:15| comm_poll: poll failure: (12) Cannot allocate
memory
2000/11/03 01:20:15| Select loop Error. Retry 1
2000/11/03 01:20:15| comm_poll: poll failure: (12) Cannot allocate
memory
2000/11/03 01:20:15| Select loop Error. Retry 2
2000/11/03 04:14:49| comm_poll: poll failure: (12) Cannot allocate
memory
2000/11/03 04:14:49| Select loop Error. Retry 3
2000/11/03 04:14:49| comm_poll: poll failure: (12) Cannot allocate
memory
2000/11/03 04:14:49| Select loop Error. Retry 4
2000/11/03 04:14:49| comm_poll: poll failure: (12) Cannot allocate
memory
2000/11/03 04:14:49| Select loop Error. Retry 5
2000/11/03 04:14:49| comm_poll: poll failure: (12) Cannot allocate
memory
2000/11/03 04:14:49| Select loop Error. Retry 6
2000/11/03 04:14:49| comm_poll: poll failure: (12) Cannot allocate
memory
2000/11/03 04:14:49| Select loop Error. Retry 7
2000/11/03 04:14:49| comm_poll: poll failure: (12) Cannot allocate
memory
2000/11/03 04:14:49| Select loop Error. Retry 8
2000/11/03 04:14:49| comm_poll: poll failure: (12) Cannot allocate
memory
2000/11/03 04:14:49| Select loop Error. Retry 9
2000/11/03 04:14:49| comm_poll: poll failure: (12) Cannot allocate
memory
2000/11/03 04:14:49| Select loop Error. Retry 10
FATAL: Select Loop failed!

ReiserFS also reports memory allocation failures, and triggers a
deadlock condition (which I'll report to the ReiserFS guys...Chris Mason
over there is a hero on these kinds of problems). Now...memory is not
filled when this condition occurs. There is some stuff in swap (136MB
worth, actually) but it was put there much earlier and didn't cause
problems. CPU saturation seems to be the culprit here.

Here is a snippet of 'vmstat 1' as the deadlock is happening (when
system usage goes to 100% is when you know the box is locked).

   procs memory swap io
system cpu
 r b w swpd free buff cache si so bi bo in cs us
sy id
 1 0 0 136 3188 114200 61472 0 0 44 587 4601 3519 28
57 15
 1 0 0 136 2184 114960 61612 0 0 23 0 2869 1979 42
57 1
 1 0 0 136 2512 114896 61444 0 0 15 0 2661 2248 32
65 3
 1 0 0 136 2904 114516 61432 0 0 18 0 2807 1938 36
60 4
 1 0 0 136 2196 115108 61520 0 0 24 0 2869 2082 39
55 6
 1 19 1 136 2476 114972 61304 0 0 26 188 2935 2606 33
62 5
 1 6 1 136 2064 115860 59672 0 0 7 270 2272 1202 4
21 75
 1 6 1 136 3136 115564 57664 0 0 0 1253 2023 1062
0 1 99
 6 0 0 136 2868 115564 57664 0 0 4 1243 1318 11745 6
51 43
 7 0 0 136 2864 115564 57664 0 0 0 0 738 22634 16
84 0
10 0 0 136 2824 115564 57664 0 0 4 0 822 504 0
100 0
17 0 0 136 2796 115564 57664 0 0 8 0 508 283 1
99 0
18 0 1 136 2792 115564 57664 0 0 2 690 347 164 0
100 0

This happened during two attempts at a memspeed run. Which is simply a
steadily increasing request rate that has a small enough workload to all
be stored in memory. Why there is so much disk activity just before
failing, I'm not sure. Also note that the box doesn't gradually get
overloaded as in butterfly or in the old 2.2 async...this box falls over
within seconds of the response time climbing over 2.5 seconds (while the
other two can hang on for quite a while with response times over 20
seconds, usually). In fact, in this case, the box didn't look
overloaded at all. Response times were perfect (even though it was
under very heavy load)...but then it just died and stopped responding on
the network. I had to console login to fix it.

More info when I've got it.
                                  --
                     Joe Cooper <joe@swelltech.com>
                 Affordable Web Caching Proxy Appliances
                        http://www.swelltech.com
Received on Fri Nov 03 2000 - 04:03:20 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:54 MST