Bad interaction between max_stale and negative caching (2.HEAD)

From: Mark Nottingham <mnot_at_yahoo-inc.com>
Date: Thu, 18 Sep 2008 22:03:29 +1000

I've got a user who's running a pair of peered accelerators, using
both stale-while-revalidate and max_stale.

Occasionally, they see extremely old content being served; e.g., if
CC: max-age is 60s, they might see something go by which is 1000-3000
seconds old (but still within the max_stale window).

The pattern that appears to trigger this is when a resource with an in-
cache 200 response starts returning 404s; when this happens, Squid
will start returning TCP_NEGATIVE_HIT/200's. E.g. (traffic driven by
squidclient),

1713703.815 0 127.0.0.1 TCP_STALE_HIT/200 5234 GET http://
server1//5012904 - NONE/- application/json
1221713703.979 164 0.0.0.0 TCP_ASYNC_MISS/404 193 GET http://
server1/5012904 - FIRST_UP_PARENT/back-end-server1 text/plain
1221713711.431 0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET http://server
1/5012904 - NONE/- application/json
1221713720.978 0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET http://server
1/5012904 - NONE/- application/json
1221713723.483 0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET http://server
1/5012904 - NONE/- application/json

As you can see, stale-while-revalidate kicks in, and the async refresh
brings back a 404, but that doesn't get stored properly.

Looking at the code, I *think* the culprit is storeNegativeCache(),
which will, assuming that max_stale is set (either in squid.conf or
response headers), block the new response from updating the cache --
no matter what its status code is.

It makes sense to do this for 5xx status codes, because they're often
transient, and reflect server-side problems. It doesn't make as much
sense to do this for 4xx status codes, which reflect client-side
issues. In those cases, you always want to update the cache with the
most recent response (and potentially negative cache it, if the server
is silly enough to not put a freshness lifetime on it).

The interesting thing, BTW, is that this only happens when collapsed
forwarding is on, because this in httpReplyProcessHeader:

       if (neighbors_do_private_keys && !
Config.onoff.collapsed_forwarding)
        httpMaybeRemovePublic(entry, reply);

masks this behaviour.

Thoughts? I'm not 100% on this diagnosis, as the use of peering and
stale-while-revalidate make things considerably more complex, but I've
had pretty good luck reproducing it... I'm happy to attempt a fix, but
wanted input on what approach people preferred. Left to my own
devices, I'd add another condition to this in storeNegativeCache():

if (oe && !EBIT_TEST(oe->flags, KEY_PRIVATE) && !EBIT_TEST(oe->flags,
ENTRY_REVALIDATE))

to limit it to 5xx responses.

--
Mark Nottingham       mnot_at_yahoo-inc.com
Received on Thu Sep 18 2008 - 12:04:07 MDT

This archive was generated by hypermail 2.2.0 : Thu Sep 18 2008 - 12:00:04 MDT