Re: Problems with parents under heavy load

From: Christian Grimm <grimm@dont-contact.us>
Date: Wed, 15 Jul 1998 17:49:41 +0200 (MDT)

A few things and some still unsolved mysteries

Karim.Tripodoro@lrz-muenchen.de wrote:

]cache_host muenchen.www-cache.dfn.de sibling 8080 3130
]...
]cache_host_domain muenchen.www-cache.dfn.de !.com
]
]The problem is that muenchen.www-cache.dfn.de is reachable with ICP under
]heavy load, but TCP requests are rejected and my proxy waits for objects
]from this parent (probably that's the well known filehandle-problem under
]Solaris and the parent is not tuned to use more than 1024 filehandles).

It is *NOT* the wkFD problem, as the DFN caches are configured with 4096
filehandles and use poll(), of course. Nevertheless, if more than 2000 FDs
are in use, the squid answers sluggish enough to play dead.

]Although the cachemgr detects that the parent is down, the HTTP-requests
]are forwarded to this reighbor. After I have configured this parent as
]sibling the problem does not occur no more.

It is a well-known (solaris related?) and still unsolved problem that an
unusably busy squid still answers ICP requests, but does not handle HTTP.
The connection gets established and sits there forever (forever is from a
users point of view, and means over 2 minutes). Thus, the whole concept of
ICP finding alive neighbours is nociously undermined, because it is alive
as far as ICP is concerned, but it is dead as far as HTTP is concerned. To
get rid of this misbehaviour, we need squid to frequently check for its
own HTTP connectivity. In case of trouble, squid should stop sending ICP
replies immediately.

]2. or is this a Bug in one of the two squids (the parent is running with
]squid 1.1.16 on Solaris 2.5.1)

The parent did run a 1.NOVM.18-retry since January and does run a
1.NOVM.20 since April. Upgrade to 1.NOVM.22 is immanent, if it turns out
well-behaved during our tests. Solaris 2.5.1 is still valid (and will
continue to be, we are afraid).

]1. Is it possible to configure squid with a TCP-Timeout for parents, so
]that after this timeout the query is forwarded to another parent (or
]direct)?

The 'connect_timeout' will *not* help, as the connection is properly
established. Tweaking the 'read_timeout', which will only be invoked on
idle connection, from 15 minutes to 1 minute does not really help, because
then your squid would throw an error with something like 'read timeout'
instead of fetching the object elsewhere or itself...

You could always tweak your various Solaris TCP timers to drop things on
you (not that we would recommend such a thing!). You might have a look at
http://www.rvs.uni-hannover.de/people/voeckler/tune/EN/tune.html for some
more infos.

Le deagh dhùrachd
Dipl.-Ing. Jens-S. Vöckler (voeckler@rvs.uni-hannover.de)
Institute for Computer Networks and Distributed Systems
University of Hanover, Germany; +49 511 762 4726
Received on Wed Jul 15 1998 - 10:05:25 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:41:07 MST