Problems with filedescriptor leaks under Solaris

From: Chris Tilbury <cudch@dont-contact.us>
Date: Mon, 18 Aug 1997 18:38:04 +0100

Howdy

Filedescriptor leaks for Squid running under Solaris are a problem,
it would appear, for people using the NOVM version under reasonable
loads (>250 concurrent requests), causing the Squid process to run
out of FDs and stop working completely after a while.

Not being intimately familar with the squid source, I'm not sure if
this could be a problem, but I'm wondering if anyone (developers,
I suppose) have considered this as a possible cause of the problem.

I've noticed on our web cache (wwwcache.cov.net), when I gave the
NOVM version a try for a couple of days (and ditched it rapidly when the
filedescriptor leaks caused severe problems on the machine), that it
appeared to be happy until it exceeded approximately 200ish concurrent
connections. Specifically (using a hacked version of poll.pl) I watched the
number of open FDs every five minutes and logged them.

The in use figure very happily went up past 200, to around 216, and
then dropped down again, with no problems, at one of the the times of
busiest load. However, on the three occasions that the problems manifested
themselves, the in-use figure rose up to around 280. At this point,
in the space of 20 minutes, the figure rocketed up to around
the 800 mark, at which point Squid stopped working. This is completely
reproducible.

Looking at the filedescriptors usage page with cachemgr.cgi, there were an
awful lot which were stuck on writing to files, as if they'd never been
closed but Squid had thought they were.

There is a limitation with the stdio routines under Solaris - this
limitation is simply that the filedescriptor is stored in an 8 bit number.
You cannot have filedescriptors above 256 open; if you do, then the
behaviour you see (with the stdio routines) is undefined, at best.
The limitation is imposed because of compatibility requirements that
are intended to prevent older applications from having to be re-compiled to
work under newer versions of the operating system.

I've had a brief scan through the source, and squid seems to use fopen()
for certain file operations, although it looks to use some generic
file_open functions for most of the core caching operators, based around
open().

Would anyone (Duane? :-) like to comment on whether or not squid could be
using stdio based routines (perhaps to open log files, write the netdb, etc?)
at a point at which more than 256 filedescriptors were already open,
hence causing this problem? The rule of thumb to be applied, I'm told,
is that it's only safe to use stdio with >256 filedescriptors if you
fopen _all_ files you intend to handle with the stdio routines whilst
the filedescriptor count is below 256, hence ensuring that it can be
stored in an eight bit number (ie, open them at the start of the program,
before anything else).

I'm willing to help debug the problem further if anyone wants to investigate
further who doesn't have access to a Solaris platform (unfortunately, I
don't have the coding experience to take this on myself). I'm going to lower
the max filedescriptors open figure to below 255 to start with, to see
if we can keep the cache up and running for more than the 1 day it's
survived so far with the NOVM software.

Cheers,

Chris

-- 
 Chris Tilbury, UNIX Systems Admin, Computing Services, University of Warwick
 EMAIL: cudch@csv.warwick.ac.uk  PHONE: +44 1203 523365(V)/+44 1203 523267(F)
                            URL: http://www.warwick.ac.uk/staff/Chris.Tilbury
Received on Mon Aug 18 1997 - 10:43:09 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:36:47 MST