HELP: Failure of Squid 1.1.11 under Solaris 2.5.1

From: Armistead, Jason <ARMISTEJ@dont-contact.us>
Date: Tue, 28 Oct 1997 21:46:00 -0500

Hi

First of all, thanks for the excellent package that Squid is. It has
served us faithfully for several months now without problem. Until
recently.......

The scenario:

We're running Squid 1.1.11 with the ACL-Proxy-Authentication patch
(thanks for a great extra piece of code - when will it be part of the
standard distribution ?), compiled by GCC 2.7.1 on Solaris 2.5.1
(Sparc). I made one change a *long* time ago, and that was to change
the proxy authentication so that it would prompt the user each day for
their credentials (even if they used Microsoft Internet Explorer with
the "Save this password" box ticked). I did this by hacking the part in
the code where the Authenticate Realm is returned and appended the day
of month and 3 character month name (plus a PM tag to prevent after
hours use). That works really well, and I'd be happy to share it with
the rest of the group if anyone's interested (in fact if it made it into
the distribution, I wouldn't have to worry when (and if ) I upgrade to
some new version.) There was never a core dump as a result of this
change - it ran OK for several months without a hiccup.

Our server has 256MB RAM and plenty of disk space. The "fundamental"
cache configuration items are listed below. These haven't been changed
in a long time.

VM-Max 64 Maximum hot-vm cache (MB)
VM-High 90 High water mark hot-vm cache (%)
VM-Low 75 Low water mark hot-vm cache (%)
Swap-Max 200 Maximum disk cache (MB)
Swap-High 95 High water mark disk cache (%)
Swap-Low 90 Low water mark disk cache (%)

What has changed is that I have implemented a simple Perl redirector
(just to sort out some IP to FQDN mappings), as follows:

#!/usr/local/bin/perl
 $|=1;
 while (<>) {
# ($_) = split (/ /);
     s@http://alpha/@http://alpha.minto.oeca.otis.com/@;
     s@http://xxx.yyy.7.127@http://alpha@;
     s@http://xxx.yyy.20.246@http://intranet.otis.com@;
     s@http://xxx.yyy.8.102@http://dfx.otis.com@;
     s@http://xxx.yyy.8.20@http://wweintra.otis.com@;
     s@http://xxx.yyy.16.222@http://naointra.otis.com@;
     s@http://xxx.yyy.7.5@http://ozm05@;
# print "$_\n";
     print;
 }

As I understand it, the redirector (as I've implemented it) returns the
whole redirector request line with the URL modified only if it sees any
of the host name portions that match the search and reply operator.
Basically as for the one out of the Release-Notes-1.1.txt file ** HINT
** anyone want to improve the redirector documentation (it's a bit thin)
????

The redirector stats on Squid currently look like

{
{Redirector Statistics:}
{requests: 1981}
{replies: 1980}
{queue length: 0}
{avg service time: 1 msec}
{number of redirectors: 5}
{use histogram:}
{ redirector #1: 1888 (1888 rewrites)}
{ redirector #2: 66 (66 rewrites)}
{ redirector #3: 16 (16 rewrites)}
{ redirector #4: 5 (5 rewrites)}
{ redirector #5: 6 (6 rewrites)}
}

which seem OK to me (I don't understand why there is a difference of one
between the requests and replies - possibly it's the fact that I'm using
the client program to request them, and so it counts as one request ?

Anyhow, I've collected four core dumps so far, and analysed each with
GDB to see if I can work out what's going on. It looks like some sort
of malloc related problem.

Core # 1

(gdb) where
#0 0xef5f4594 in __tbl_10_big_digits ()
#1 0xef5ba62c in atoi ()
#2 0x413c0 in death (sig=10) at tools.c:301
#3 0xef5b8d2c in _libc_sigaction ()
#4 0xef5cb2ac in _malloc_unlocked ()
#5 0xef5cb37c in _malloc_unlocked ()
#6 0xef5ba988 in _catgets ()
#7 0x442c0 in xcalloc (n=1, sz=12) at util.c:387
#8 0x27a48 in hash_insert (hid=5,
    k=0x3b0708
"1207/GET/http://www.download.com/cgi-bin/dl?15055-ftp://ftp.acdsystems.
com/pub/acdsee32/acdc3222.exe", item=0x2cebf0) at hash.c:405
#9 0x3b4b8 in storeHashMemInsert (e=0x2cebf0) at store.c:398
#10 0x3b638 in storeSetMemStatus (e=0x2cebf0, status=2) at store.c:445
#11 0x3db48 in storeComplete (e=0x2cebf0) at store.c:1654
#12 0x28f14 in httpReadReply (fd=26, data=0x3572f8) at http.c:622
#13 0x1ef3c in comm_select (sec=26) at comm.c:1077
#14 0x31008 in main (argc=418816, argv=0x66400) at main.c:705
(gdb)

Core # 2

gdb) where
#0 0xef5f4594 in __tbl_10_big_digits ()
#1 0xef5ba62c in atoi ()
#2 0x413c0 in death (sig=11) at tools.c:301
#3 0xef5b8d2c in _libc_sigaction ()
#4 0xef5cba20 in realfree ()
#5 0xef5cc334 in _mbtowc_gen ()
#6 0xef5cb484 in _malloc_unlocked ()
#7 0xef5cb37c in _malloc_unlocked ()
#8 0x44170 in xmalloc (sz=119) at util.c:286
#9 0x44390 in xstrdup (
    s=0x1598c0 "$name alpha.minto.oeca.otis.com\n$h_name
alpha.minto.oeca.otis.c
om\n$h_len 4\n$ipcount 1\n153.14.7.127\n$aliascount 0\n$end\n") at
util.c:428
#10 0x2e474 in ipcache_parsebuffer (
    inbuf=0x1598c0 "$name alpha.minto.oeca.otis.com\n$h_name
alpha.minto.oeca.ot
is.com\n$h_len 4\n$ipcount 1\n153.14.7.127\n$aliascount 0\n$end\n",
    dnsData=0xf4d70) at ipcache.c:472
#11 0x2ea78 in ipcache_dnsHandleRead (fd=1, dnsData=0xf4d70) at
ipcache.c:593
#12 0x1ef3c in comm_select (sec=7) at comm.c:1077
#13 0x31008 in main (argc=418816, argv=0x66400) at main.c:705

Core # 3

(gdb) where
#0 0xef5f4594 in __tbl_10_big_digits ()
#1 0xef5ba62c in atoi ()
#2 0x413c0 in death (sig=11) at tools.c:301
#3 0xef5b8d2c in _libc_sigaction ()
#4 0xef5cba20 in realfree ()
#5 0xef5cc334 in _mbtowc_gen ()
#6 0xef5cb484 in _malloc_unlocked ()
#7 0xef5cb37c in _malloc_unlocked ()
#8 0x44170 in xmalloc (sz=159) at util.c:286
#9 0x44390 in xstrdup (
    s=0x9f65c "877666942.945 82 153.14.79.102 TCP_MISS/304 85 GET
http://alp
ha.minto.oeca.otis.com/ico/spialpha.minto.oeca.otis.com -
DIRECT/alpha.minto.oec
a.otis.com -\n") at util.c:428
#10 0x3a200 in log_append (obj=0xf24c0,
    url=0x2238468
"http://alpha.minto.oeca.otis.com/icl/spialpha.minto.oeca.otis
.com", caddr={S_un = {S_un_b = {s_b1 = 37 '%', s_b2 = 57 '9', s_b3 = 100
'd',
          s_b4 = 46 '.'}, S_un_w = {s_w1 = 9529, s_w2 = 25646},
        S_addr = 624518190}}, size=85, action=0x4baf0 "TCP_MISS",
    method=0x552c0 "GET", http_code=304, msec=82, ident=0x179c4d0 "",
    hierData=0x4e66b8, content_type=0x224bb18 "") at stat.c:1195
#11 0x2a178 in icpStateFree (fd=22, data=0x179c440) at icp.c:266
#12 0x1e610 in commCallCloseHandlers (fd=22) at comm.c:558
#13 0x1e6f0 in comm_close (fd=22) at comm.c:580
#14 0x2abcc in clientWriteComplete (fd=22, buf=0x0, size=0, errflag=0,
    data=0x179c440) at icp.c:634
#15 0x1db18 in RWStateCallbackAndFree (fd=22, code=0) at comm.c:184
#16 0x1ff90 in commHandleWrite (fd=22, state=0x22384e0) at comm.c:1550
#17 0x1efa4 in comm_select (sec=22) at comm.c:1086
#18 0x31008 in main (argc=418816, argv=0x66400) at main.c:705
(gdb)

Core # 4

(gdb) where
#0 0xef5f4594 in __tbl_10_big_digits ()
#1 0xef5ba62c in atoi ()
#2 0x413c0 in death (sig=10) at tools.c:301
#3 0xef5b8d2c in _libc_sigaction ()
#4 0xef5cb2ac in _malloc_unlocked ()
#5 0xef5cb37c in _malloc_unlocked ()
#6 0x44170 in xmalloc (sz=26) at util.c:286
#7 0x44390 in xstrdup (s=0x20c5890 "alpha.minto.oeca.otis.com") at
util.c:428
#8 0x1df08 in commConnectStart (fd=26,
    host=0x20c5890 "alpha.minto.oeca.otis.com", port=80,
    callback=0x29ba0 <httpConnectDone>, data=0x28f060) at comm.c:326
#9 0x29b98 in httpConnect (fd=26, ia=0x1d8d48, data=0x417508) at
http.c:940
#10 0x2e414 in ipcache_call_pending (i=0x1d8d38) at ipcache.c:458
#11 0x2ee90 in ipcache_nbgethostbyname (
    name=0x20c5890 "alpha.minto.oeca.otis.com", fd=26,
    handler=0x29b2c <httpConnect>, handlerData=0x28f060) at
ipcache.c:704
#12 0x29d48 in httpStart (url=0x1a "", request=0x20c5808,
    req_hdr=0x1e8d428 "If-Modified-Since: Saturday, 30-Aug-97 05:32:04
GMT; leng
th=2260\r\nUser-Agent: Mozilla/3.02Gold (Win95; I)\r\nHost:
153.14.7.127\r\nAcce
pt: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg\r\nVia: 1.0
tho"...,
    req_hdr_sz=288, entry=0x417508) at http.c:1007
#13 0x35bb8 in protoStart (fd=22, entry=0x417508, e=0x0,
request=0x20c5808)
    at proto.c:583
#14 0x353cc in protoDispatchDNSHandle (unused1=22, ia=0x1d8d48,
data=0x3712f8)
    at proto.c:289
#15 0x2e414 in ipcache_call_pending (i=0x1d8d38) at ipcache.c:458
#16 0x2ee90 in ipcache_nbgethostbyname (
    name=0x20c5890 "alpha.minto.oeca.otis.com", fd=22,
    handler=0x34fc4 <protoDispatchDNSHandle>, handlerData=0x3712f8)
    at ipcache.c:704
#17 0x35710 in protoDispatch (fd=0, url=0xbeabc "", entry=0x417508,
    request=0x20c5808) at proto.c:434
#18 0x2b320 in icpProcessMISS (fd=22, icpState=0x17b2758) at icp.c:915
#19 0x2b18c in icpProcessRequest (fd=22, icpState=0x17b2758) at
icp.c:851
#20 0x1d288 in clientRedirectDone (data=0x17b2758,
    result=0x15e8e8 "http://alpha/ico/spinworld.gif") at
client_side.c:302
#21 0x363b0 in redirectHandleRead (fd=12, redirector=0xf1f80) at
redirect.c:218
#22 0x1ef3c in comm_select (sec=12) at comm.c:1077
#23 0x31008 in main (argc=418816, argv=0x66400) at main.c:705

So, the problem seems to be in xmalloc, xcalloc somewhere. There are no
malloc-specific errors in cache.log, but each time a crash occurs, it
records

FATAL: Received Segment Violation...dying.

OR

FATAL: Received Bus Error...dying.

The log is always written with around 17000 (+/- 1000) lines in it. It
might be some sort of limit I'm reaching, but I don't know what it is.

97/10/28 10:21:35| storeWriteCleanLog: Starting...
97/10/28 10:21:36| 4096 lines written so far.
97/10/28 10:21:36| 8192 lines written so far.
97/10/28 10:21:37| 12288 lines written so far.
97/10/28 10:21:37| 16384 lines written so far.
97/10/28 10:21:38| Finished. Wrote 16975 lines.
97/10/28 10:21:38| Took 3 seconds (5658.3 lines/sec).
CPU Usage: user 40 sys 41
Maximum Resident Size: 0 KB
Page faults with physical i/o: 16534

As my limited understanding recalls, a Segment violation (SIGSEGV, sig =
11) has something to do with accessing memory which is not owned by my
process (a kind of Unix GPF), and a Bus Error (SIGBUS, sig = 10) has
something to do with unaligned accesses to memory.

Are there some Solaris values which perhaps limit how much memory a
process can use ? We current have RAM-hungry CAD users who have
experienced no such problems. For the record, here is the output of a
"limit" command

cputime unlimited
filesize unlimited
datasize 2097148 kbytes
stacksize 8192 kbytes
coredumpsize unlimited
descriptors 64
memorysize unlimited

I haven't applied the long-dns-coredump.patch yet (since this problem
doesn't seem to be a manifestation of that), and am open to ideas or
suggestions as to what the cause is. You will notice that in core # 3,
the call to xstrdup shows that the URL has somehow been clobbered - it
should say

http://alpha.minto.oeca.otis.com/ico/spinworld.gif.

Instead it says

http://alpha.minto.oeca.otis.com/ico/spialpha.minto.oeca.otis.com

as if the URL has been overwritten somewhere.

Yes, I know upgrading to a recent version sounds like the first place to
start. My time is scarse now, and re-implementing the changes to the
proxy authentication might be too much of a hassle and take a bit of
time now that we're up to V1.1.18.

What about GNU Malloc - should I try using this (Is Solaris 2.5.1 Sparc
buggy for malloc memory leaks ???)

However I'm open to any help I can get. Come on you Squid-Bug experts
.... give me a hand on this one !!!

Regards

Jason Armistead
Systems Engineer
armistej@oeca.otis.com
Received on Tue Oct 28 1997 - 16:02:06 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:37:21 MST