Modify icap_service_failure_limit option to forget old ICAP errors. A busy or remote ICAP server may produce a steady but shallow stream of errors. Any ICAP server may become nearly unusable in a short period of time, producing a burst of errors. To avoid disabling a generally usable service, it is important to distinguish these two cases. Just counting the number of errors and suspending the service after icap_service_failure_limit is reached often either suspends the service in both cases or never suspends it at all, depending on the option value. One way to distinguish a large burst of errors from a steady but shallow error stream is to forget about old errors. The icap_service_failure_limit option modified to instructs Squid to ignore errors that are "too old" to be counted as a part of a burst. Another way to look at this feature is to say that the modified icap_service_failure_limit limits the ICAP error _rate_. For example, # suspend service usage after 10 failures in 5 seconds: icap_service_failure_limit 10 in 5 seconds Squid does not remember every transaction error that occurred within the allowed "oldest error" time period. That would be result in a precise but too expensive implementation, especially during error bursts on a busy server. Instead, Squid divides the period in ten slots, counts the number of errors that occurred in each slot, and forget the oldest slot(s) as needed. Thus, the algorithm has about 90% precision as far as timing of the failures is concerned. That 90% precision ought to be good enough for any deployment. === added file 'src/FadingCounter.cc' --- src/FadingCounter.cc 1970-01-01 00:00:00 +0000 +++ src/FadingCounter.cc 2010-03-11 22:20:28 +0000 @@ -0,0 +1,68 @@ +/* + * $Id$ + */ + +#include "TextException.h" +#include "SquidTime.h" +#include "FadingCounter.h" + +FadingCounter::FadingCounter(): horizon(-1), precision(10), delta(-1), + lastTime(0), total(0) +{ + counters.reserve(precision); + while (counters.size() < static_cast(precision)) + counters.push_back(0); +} + +void FadingCounter::clear() +{ + for (int i = 0; i < precision; ++i) + counters[i] = 0; + lastTime = current_dtime; + total = 0; +} + +void FadingCounter::configure(double newHorizon) +{ + if (fabs(newHorizon - horizon) >= 1e-3) { // diff exceeds one millisecond + clear(); // for simplicity + horizon = newHorizon; + delta = horizon / precision; // may become negative or zero + } +} + +int FadingCounter::count(int howMany) +{ + Must(howMany >= 0); + + if (delta < 0) + return total += howMany; // forget nothing + + if (horizon < 1e-3) // (e.g., zero) + return howMany; // remember nothing + + const double deltas = (current_dtime - lastTime) / delta; + if (deltas >= precision || current_dtime < lastTime) { + clear(); // forget all values + } else { + // forget stale values, if any + // fmod() or "current_dtime/delta" will overflow int for small deltas + const int lastSlot = static_cast(fmod(lastTime, horizon) / delta); + const int staleSlots = static_cast(deltas); + for (int i = 0, s = lastSlot + 1; i < staleSlots; ++i, ++s) { + const int idx = s % precision; + total -= counters[idx]; + counters[idx] = 0; + Must(total >= 0); + } + } + + // apply new information + lastTime = current_dtime; + const int curSlot = static_cast(fmod(lastTime, horizon) / delta); + counters[curSlot % precision] += howMany; + total += howMany; + Must(total >= 0); + + return total; +} === added file 'src/FadingCounter.h' --- src/FadingCounter.h 1970-01-01 00:00:00 +0000 +++ src/FadingCounter.h 2010-03-11 22:20:28 +0000 @@ -0,0 +1,35 @@ +/* + * $Id$ + */ + +#ifndef SQUID_FADING_COUNTER_H +#define SQUID_FADING_COUNTER_H + +#include "Array.h" + +/// Counts events, forgetting old ones. Usefull for "3 errors/minute" limits. +class FadingCounter { + public: + FadingCounter(); + + /// 0=remember nothing; -1=forget nothing; new value triggers clear() + void configure(double horizonSeconds); + + void clear(); ///< forgets all events + + int count(int howMany); ///< count fresh, return #events remembered + int remembered() const { return total; } ///< possibly stale #events + + /// read-only memory horizon in seconds; older events are forgotten + double horizon; + + private: + const int precision; ///< #counting slots, controls measur. occuracy + double delta; ///< sub-interval duration = horizon/precision + + double lastTime; ///< time of the last update + Vector counters; ///< events per delta (possibly stale) + int total; ///< number of remembered events (possibly stale) +}; + +#endif /* SQUID_FADING_COUNTER_H */ === modified file 'src/Makefile.am' --- src/Makefile.am 2010-01-12 21:16:29 +0000 +++ src/Makefile.am 2010-03-11 22:20:28 +0000 @@ -314,6 +314,8 @@ ExternalACL.h \ ExternalACLEntry.cc \ ExternalACLEntry.h \ + FadingCounter.h \ + FadingCounter.cc \ fd.cc \ fde.cc \ fde.h \ @@ -1226,6 +1228,7 @@ useragent.cc \ wccp2.cc \ whois.cc \ + FadingCounter.cc \ $(WIN32_SOURCE) \ wordlist.cc nodist_tests_testCacheManager_SOURCES = \ @@ -1407,6 +1410,7 @@ useragent.cc \ wccp2.cc \ whois.cc \ + FadingCounter.cc \ $(WIN32_SOURCE) \ wordlist.cc nodist_tests_testEvent_SOURCES = \ @@ -1560,6 +1564,7 @@ useragent.cc \ wccp2.cc \ whois.cc \ + FadingCounter.cc \ $(WIN32_SOURCE) \ wordlist.cc nodist_tests_testEventLoop_SOURCES = \ @@ -1706,6 +1711,7 @@ useragent.cc \ wccp2.cc \ whois.cc \ + FadingCounter.cc \ $(WIN32_SOURCE) \ wordlist.cc \ Packer.cc \ @@ -1861,6 +1867,7 @@ useragent.cc \ wccp2.cc \ whois.cc \ + FadingCounter.cc \ $(WIN32_SOURCE) \ wordlist.cc nodist_tests_testHttpRequest_SOURCES = \ @@ -2229,6 +2236,7 @@ useragent.cc \ wccp2.cc \ whois.cc \ + FadingCounter.cc \ $(WIN32_SOURCE) \ wordlist.cc nodist_tests_testURL_SOURCES = \ === modified file 'src/adaptation/Config.h' --- src/adaptation/Config.h 2010-01-21 10:22:02 +0000 +++ src/adaptation/Config.h 2010-03-11 22:20:28 +0000 @@ -38,6 +38,7 @@ int send_client_ip; int send_client_username; int service_failure_limit; + time_t oldest_service_failure; int service_revival_delay; int icap_uses_indirect_client; === modified file 'src/adaptation/icap/ServiceRep.cc' --- src/adaptation/icap/ServiceRep.cc 2009-11-18 16:31:08 +0000 +++ src/adaptation/icap/ServiceRep.cc 2010-03-11 22:20:28 +0000 @@ -18,7 +18,7 @@ Adaptation::Icap::ServiceRep::ServiceRep(const Adaptation::ServiceConfig &svcCfg): AsyncJob("Adaptation::Icap::ServiceRep"), Adaptation::Service(svcCfg), theOptions(NULL), theOptionsFetcher(0), theLastUpdate(0), - theSessionFailures(0), isSuspended(0), notifying(false), + isSuspended(0), notifying(false), updateScheduled(false), self(NULL), wasAnnouncedUp(true) // do not announce an "up" service at startup {} @@ -53,6 +53,9 @@ writeableCfg().port = 1344; } } + + theSessionFailures.configure(TheConfig.oldest_service_failure > 0 ? + TheConfig.oldest_service_failure : -1); } void Adaptation::Icap::ServiceRep::invalidate() @@ -69,15 +72,16 @@ void Adaptation::Icap::ServiceRep::noteFailure() { - ++theSessionFailures; - debugs(93,4, HERE << " failure " << theSessionFailures << " out of " << - TheConfig.service_failure_limit << " allowed " << status()); + const int failures = theSessionFailures.count(1); + debugs(93,4, HERE << " failure " << failures << " out of " << + TheConfig.service_failure_limit << " allowed in " << + TheConfig.oldest_service_failure << "sec " << status()); if (isSuspended) return; if (TheConfig.service_failure_limit >= 0 && - theSessionFailures > TheConfig.service_failure_limit) + failures > TheConfig.service_failure_limit) suspend("too many failures"); // TODO: Should bypass setting affect how much Squid tries to talk to @@ -235,7 +239,7 @@ delete theOptions; theOptions = newOptions; - theSessionFailures = 0; + theSessionFailures.clear(); isSuspended = 0; theLastUpdate = squid_curtime; @@ -463,8 +467,8 @@ if (notifying) buf.append(",notif", 6); - if (theSessionFailures > 0) - buf.Printf(",fail%d", theSessionFailures); + if (const int failures = theSessionFailures.remembered()) + buf.Printf(",fail%d", failures); buf.append("]", 1); buf.terminate(); === modified file 'src/adaptation/icap/ServiceRep.h' --- src/adaptation/icap/ServiceRep.h 2009-08-23 09:30:49 +0000 +++ src/adaptation/icap/ServiceRep.h 2010-03-11 22:20:28 +0000 @@ -35,6 +35,7 @@ #define SQUID_ICAPSERVICEREP_H #include "cbdata.h" +#include "FadingCounter.h" #include "adaptation/Service.h" #include "adaptation/forward.h" #include "adaptation/Initiator.h" @@ -134,8 +135,7 @@ Adaptation::Initiate *theOptionsFetcher; // pending ICAP OPTIONS transaction time_t theLastUpdate; // time the options were last updated - static const int TheSessionFailureLimit; - int theSessionFailures; + FadingCounter theSessionFailures; const char *isSuspended; // also stores suspension reason for debugging bool notifying; // may be true in any state except for the initial === modified file 'src/adaptation/icap/Xaction.cc' --- src/adaptation/icap/Xaction.cc 2009-08-23 09:30:49 +0000 +++ src/adaptation/icap/Xaction.cc 2010-03-11 22:20:28 +0000 @@ -215,7 +215,6 @@ { debugs(93, 2, HERE << typeName << " failed to connect to " << service().cfg().uri); - theService->noteFailure(); throw TexcHere("cannot connect to the ICAP service"); } @@ -259,8 +258,6 @@ theService->cfg().methodStr() << " " << theService->cfg().uri << status()); reuseConnection = false; - service().noteFailure(); - throw TexcHere(connector != NULL ? "timed out while connecting to the ICAP service" : "timed out while talking to the ICAP service"); @@ -281,6 +278,7 @@ void Adaptation::Icap::Xaction::callException(const std::exception &e) { setOutcome(xoError); + service().noteFailure(); Adaptation::Initiate::callException(e); } === modified file 'src/cache_cf.cc' --- src/cache_cf.cc 2010-03-05 02:07:24 +0000 +++ src/cache_cf.cc 2010-03-11 22:39:03 +0000 @@ -86,6 +86,10 @@ static void free_icap_service_type(Adaptation::Icap::Config *); static void parse_icap_class_type(); static void parse_icap_access_type(); + +static void parse_icap_service_failure_limit(Adaptation::Icap::Config *); +static void dump_icap_service_failure_limit(StoreEntry *, const char *, const Adaptation::Icap::Config &); +static void free_icap_service_failure_limit(Adaptation::Icap::Config *); #endif #if USE_ECAP @@ -3719,3 +3723,56 @@ } #endif /* USE_ECAP */ + +#if ICAP_CLIENT +static void parse_icap_service_failure_limit(Adaptation::Icap::Config *cfg) +{ + char *token; + time_t d; + time_t m; + cfg->service_failure_limit = GetInteger(); + + if ((token = strtok(NULL, w_space)) == NULL) + return; + + if (strcmp(token,"in") != 0) { + debugs(3, 0, "expecting 'in' on'" << config_input_line << "'"); + self_destruct(); + } + + if ((token = strtok(NULL, w_space)) == NULL) { + self_destruct(); + } + + d = static_cast (xatoi(token)); + + m = static_cast (1); + + if (0 == d) + (void) 0; + else if ((token = strtok(NULL, w_space)) == NULL) { + debugs(3, 0, "No time-units on '" << config_input_line << "'"); + self_destruct(); + } + else if ((m = parseTimeUnits(token)) == 0) + self_destruct(); + + cfg->oldest_service_failure = (m * d); +} + +static void dump_icap_service_failure_limit(StoreEntry *entry, const char *name, const Adaptation::Icap::Config &cfg) +{ + storeAppendPrintf(entry, "%s %d", name, cfg.service_failure_limit); + if (cfg.oldest_service_failure > 0) { + storeAppendPrintf(entry, " in %d seconds", (int)cfg.oldest_service_failure); + } + storeAppendPrintf(entry, "\n"); +} + +static void free_icap_service_failure_limit(Adaptation::Icap::Config *cfg) +{ + cfg->oldest_service_failure = 0; + cfg->service_failure_limit = 0; +} + +#endif === modified file 'src/cf.data.depend' --- src/cf.data.depend 2010-02-07 03:38:46 +0000 +++ src/cf.data.depend 2010-03-11 22:20:28 +0000 @@ -32,6 +32,7 @@ icap_access_type icap_class acl icap_class_type icap_service icap_service_type +icap_service_failure_limit ecap_service_type int kb_int64_t === modified file 'src/cf.data.pre' --- src/cf.data.pre 2010-03-06 19:43:02 +0000 +++ src/cf.data.pre 2010-03-11 22:20:28 +0000 @@ -5757,21 +5757,38 @@ DOC_END NAME: icap_service_failure_limit -TYPE: int +COMMENT: limit [in memory-depth time-units] +TYPE: icap_service_failure_limit IFDEF: ICAP_CLIENT -LOC: Adaptation::Icap::TheConfig.service_failure_limit +LOC: Adaptation::Icap::TheConfig DEFAULT: 10 DOC_START The limit specifies the number of failures that Squid tolerates when establishing a new TCP connection with an ICAP service. If the number of failures exceeds the limit, the ICAP service is not used for new ICAP requests until it is time to refresh its - OPTIONS. The per-service failure counter is reset to zero each - time Squid fetches new service OPTIONS. + OPTIONS. A negative value disables the limit. Without the limit, an ICAP service will not be considered down due to connectivity failures between ICAP OPTIONS requests. + + Squid forgets ICAP service failures older than the specified + value of memory-depth. The memory fading algorithm + is approximate because Squid does not remember individual + errors but groups them instead, splitting the option + value into ten time slots of equal length. + + When memory-depth is 0 and by default this option has no + effect on service failure expiration. + + Squid always forgets failures when updating service settings + using an ICAP OPTIONS transaction, regardless of this option + setting. + + For example, + # suspend service usage after 10 failures in 5 seconds: + icap_service_failure_limit 10 in 5 seconds DOC_END NAME: icap_service_revival_delay