Vexing comm_select(): select failure bug

From: Scott Lystig Fritchie <fritchie@dont-contact.us>
Date: Wed, 08 Apr 1998 23:38:13 -0500

Greetings -- I've been lurking more-or-less until now. Grep'ing
through my squid-users-digest archive (going back to Sep '97), I've
found a few people who have reported the same problem, but there have
been no solutions suggested. {sigh}

The behavior logged in "cache.log" is:

98/04/05 15:16:10| comm_select: select failure: (9) Bad file number
98/04/05 15:16:10| examine_select: Examining open file descriptors...
98/04/05 15:16:10| FD 16: (9) Bad file number
98/04/05 15:16:10| WARNING: FD 16 has handlers, but it's invalid.
98/04/05 15:16:10| FD 16 is a Socket called '/home/squid/etc/redirector #1'
98/04/05 15:16:10| lifetm:0 tmout:0 read:361c4 write:0
98/04/05 15:16:10| FD 17: (9) Bad file number
[... iterate sequentially through file descriptors one-by-one ...]
98/04/05 15:16:17| FD 1042: (9) Bad file number
98/04/05 15:16:17| WARNING: FD 1042 has handlers, but it's invalid.
98/04/05 15:16:17| FD 1042 is a Socket called 'http://www.nicisterling.com/mem/m
/interview.jpg'
98/04/05 15:16:17| lifetm:2cde4 tmout:0 read:2d334 write:0
98/04/05 15:16:17| close handler: 34f78
98/04/05 15:16:17| close handler: 29e7c
98/04/05 15:16:17| Select loop Error. Retry 1
FATAL: MAIN: SHUTDOWN from comm_select, but nothing pending.
Squid Cache (Version 1.1.18): Terminated abnormally.
CPU Usage: user 896 sys 1535
Maximum Resident Size: 0 KB
Page faults with physical i/o: 10194
98/04/05 15:18:03| storeWriteCleanLog: Starting...
[...]
98/04/05 15:18:15| 143360 lines written so far.
98/04/05 15:18:16| Finished. Wrote 146801 lines.
98/04/05 15:18:16| Took 13 seconds (11292.4 lines/sec).
[... RunCache restarts the squid process [...]
98/04/05 15:18:56| Starting Squid Cache version 1.1.18 for sparc-sun-solaris2.5.1...
98/04/05 15:18:56| With 2048 file descriptors available

Each time this happens, the cache is unavailable for several minutes.
I've witnessed the behavior under:

        Solaris 2.5, Squid 1.1.18 in standard proxy mode
        Solaris 2.6, Squid 1.1.15, 1.1.18, and 1.1.20 in HTTP accelerator mode

The former seems to bottom out during busier parts of the day, but no
good estimate of how busy it is when it happens. The latter was
acting as front for the GIF server for www.bigcharts.com and was
getting anywhere from 40-70 hits/second when it would freak out.

I don't have a really busy Web server to try the latter scenario
again, unless I try convincing one of our customers (like the
bigcharts.com folks) to allow me to play host routing tricks again.
:-)

I've just recompiled 1.1.21 for the 2.5 box with USE_POLL defined.
Since examine_select() isn't even defined when USE_POLL is defined,
perhaps all of the time spent logging the "bad" file descriptors will
disappear. (I have logs showing it sometimes complains about a small
handful per second). We'll see if this makes a difference or not.

Since I'm not the first to run into this, what would be helpful in
trying to get this fixed? My hack time is, er, limited to automating
Smurf DoS hunting and INN and thus pretty thin, but I'd be willing to
gather tracing/debugging info, test patches, etc.

-Scott

---
Scott Lystig Fritchie, Senior Systems Administrator
fritchie@mr.net, PGP key #152B8725               MRNet, a MEANS Telcom Company
v: 612/362.5820, p: 612/637.9547                 2829 University Ave SE
"We're in the middle of another commercial-"     Minneapolis, MN  55414 USA
free decade."      -- Minnesota Public Radio     http://www.mr.net/
Received on Wed Apr 08 1998 - 21:43:38 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:39:38 MST