[PATCH] Accelerator load balancing #2 from richard@dont-contact.us on 2000-10-21 (squid-dev)

From: <richard@dont-contact.us>
Date: Sun, 22 Oct 2000 16:44:32 +1100 (EST)

Ok, attached is the latest accel mode load balancing patch, vast
improvement over previous one yadda yadda.

I do not recommend including this in distribution squid at this stage,
although I hope one day it'll reach that :)

The patch is against 2.3-STABLE4, and at this stage is believed to be
stable.

Description:

This patch allows squid, in httpd_accel mode, to act as a multiple-machine
load-balancing/failover/caching engine for any number of backend
webservers. The load-balancing algorithm is considerably advanced from
preivous versions, it now tracks average time-to-complete over the last
100 connections to the given node, and then calculates which node is
liable to serve the current request first. It maintains a track of
currently running requests thus balancing to a slower machine if the
faster machine is too busy, etc. The engine uses the first AVG_LEN (by
default, 100) requests to calibrate, distributing amoung the nodes evenly.

Failover is implemented by taking a certain set of errors returned by
squid on a server connection, and marking the node dead for 60 seconds.
The request is then transfered transparently by squid to another node.
Nodes marked dead are not given any requests during their dead time.

It is possible to dynamically reconfigure the server using the standard
squid reload, in-progress requests will continue, all new requests will
use the new configuration, allowing you to explicitly remove or add nodes
without downtime.

Configuration:

I currently use the following lines in my accel config:

httpd_accel_host virtual
httpd_accel_port 80
httpd_accel_uses_host_header on
httpd_accel_balance_targets web3.dmz web2.dmz web1.dmz

Note that you may also do:

httpd_accel_balance_targets web3.dmz:80 web3.dmz:90 web2.dmz

etc, if you do not specify the port, it will use the httpd_accel_port as
default.

Log information:

This patch dumps information every 30 seconds to the cache.log. This is
not ideal and it would be nice to incorporate the statistics into the snmp
engine or something but it works for now, format:

2000/10/22 16:18:57| balance: web3.dmz queue 0 average 0.076025 total 0.000000 reqs 82 hps 0.117913
2000/10/22 16:18:57| balance: web2.dmz queue 0 average 0.120233 total 0.000000 reqs 83 hps 0.117913
2000/10/22 16:18:57| balance: web1.dmz queue 0 average 0.129913 total 0.000000 reqs 83 hps 0.147391

The queue is the number of currently in-progress requests for that node,
average is the average time taken to serve a request, total is 0 during
calibration, otherwise it is queue*average, reqs is the number of requests
served by that node, and hps is the last 30 seconds hits-per-sec.

Code:

Various outstanding issues with the quality of the code. While externally
it has moved much close to the squid look&feel, internally it still uses
my own list code, I will have to get around to converting that, and there
is a bit of commenting out of debugging lines :) Mostly its cool, but my
natural perfectionist bent will probably force a cleanup a little later
on. If there are any other issues anyone notices, I'd be happy to take a
look at them.

If anyone wants any more information, or has any suggestions, let me know.
The patch is included here for your own interest and use, I aint taking
any responsibility for people who use it and crash'n'burn. That said, its
been pretty solid here. It should scale up pretty well, so as long as
squid itself can handle it, things should be fine no matter how many nodes
you add.

Richard.

[ Experienced linux system/network/software designer/enginner ]
[ available contract/perm, http://richard.exorsus.net/cv.html ]
[ c/pl/php/js/opengl/apache/squid/route/secure/mail/dns/fw/++ ]

diff -C 3 -N -r ../squid-2.3.STABLE4/src/Makefile.in src/Makefile.in
*** ../squid-2.3.STABLE4/src/Makefile.in Wed Oct 20 07:28:35 1999
--- src/Makefile.in Mon Oct 9 23:07:38 2000
***************
*** 84,89 ****
--- 84,90 ----
                  asn.o \
                  @ASYNC_OBJS@ \
                  authenticate.o \
+ balance.o \
                  cache_cf.o \
                  CacheDigest.o \
                  cache_manager.o \
diff -C 3 -N -r ../squid-2.3.STABLE4/src/balance.c src/balance.c
*** ../squid-2.3.STABLE4/src/balance.c Thu Jan 1 10:00:00 1970
--- src/balance.c Sun Oct 22 15:37:55 2000
***************
*** 0 ****
--- 1,240 ----
+ #include "squid.h"
+ #include "list.h"
+
+ #define STATE_OK 0
+ #define STATE_DEAD 1
+ /* Delay in seconds until next target access attempt */
+ #define DEFAULT_DELAY 60
+
+ struct bnode;
+ struct balance;
+
+ /*
+ * Concept
+ *
+ * Maintain an average response time for each node, every time we are to distribute a request,
+ * take that request and all other requests queued for that node, multiple them by the average
+ * serving time. The one with the lowest time gets the request.
+ *
+ */
+
+ #define AVG_LEN 100
+
+ struct balance {
+ list_head_struct(balance,bnode);
+ struct bnode *previous; /* Last bnode to be returned */
+ } bal;
+
+ struct balance *b = &bal;
+
+ struct bnode {
+ list_node_struct(balance,bnode);
+ char *target;
+ int port;
+ int dead_until;
+ int reqs;
+ float timing; /* last AVG_LEN requests total time */
+ int queue;
+ int stat_hits;
+ unsigned long stat_sec;
+ unsigned long stat_usec;
+ };
+
+ unsigned long last_sec=0;
+
+
+ /* Initialise the balancing stuff */
+ void balanceInit(void) {
+ struct bnode *bn;
+ struct timeval tv;
+ char *t;
+ wordlist *targets = Config.Accel.targets;
+ if (!targets)
+ return; /* No targets, no work, nothing to do, go away */
+ list_init(b);
+ debug(32, 1) ("balance_init: Kicking up round-robin acceleration balance engine. Phoar!\n");
+ while(targets) {
+ bn = xmalloc(sizeof(struct bnode));
+ bn->dead_until = 0;
+ t = strtok(targets->key,":");
+ bn->target = xstrdup(t);
+ if ((t = strtok(NULL,":"))) {
+ bn->port = atoi(t);
+ } else {
+ bn->port = Config.Accel.port;
+ }
+ bn->reqs=0;
+ bn->queue=0;
+ bn->timing=0;
+ bn->stat_hits=0;
+
+ gettimeofday(&tv,NULL);
+ bn->stat_sec=tv.tv_sec;
+ bn->stat_usec=tv.tv_usec;
+
+ debug(32, 1) ("balance_init: Adding %s:%d to balance list\n", bn->target,bn->port);
+ list_append(b,bn);
+ targets = targets->next;
+ }
+ b->previous = list_first(b);
+ return;
+ };
+
+ /* Get the next target for the balance */
+ void balanceGetTarget(FwdState *fwdState, char **host, unsigned short *port) {
+ struct bnode *bn;
+ struct bnode *fav;
+ struct bnode *first;
+ float fav_timing=0;
+ float bn_timing=0;
+ float bn_av=0;
+ float diff_time;
+ struct timeval tv;
+ int show_timings;
+ float hps=0;
+
+ gettimeofday(&tv,NULL);
+
+ show_timings=0;
+
+ if ((last_sec+30)<tv.tv_sec) {
+ last_sec=tv.tv_sec;
+ show_timings=1;
+ debug(32, 1) ("balance: --\n");
+ }
+
+ /* debug(32, 1) ("balanceGetTarget()\n"); */
+
+ /* If the fwdState already has balance info, clear that up first, we must be
+ retrying or something */
+ if (fwdState->balance.target) {
+ balanceComplete(fwdState);
+ }
+
+ fav=bn=b->previous;
+ fav_timing=2^32;
+
+ gettimeofday(&tv,NULL);
+
+ if (node_next(bn)) { first=bn = node_next(bn); } else { first=bn=list_first(b); }
+
+ while (bn) {
+ if (bn->dead_until<squid_curtime) {
+ /* Ok, this deserves explanation:
+ * its the average response time * number of requests incomplete for that host, plus the
+ * one we're about to add if successful
+ */
+ bn_av = bn->timing/((bn->reqs+1) < AVG_LEN ? (bn->reqs+1) : AVG_LEN);
+ bn_timing = bn_av*(bn->queue+1);
+ /* This is a trial idea, make it balance evenly until AVG_LEN is reached */
+ if (bn->reqs<AVG_LEN)
+ bn_timing=0;
+ /* If the node we're on will respond faster, select that */
+ if (bn_timing<fav_timing) {
+ fav=bn;
+ fav_timing=bn_timing;
+ }
+ if (show_timings) {
+ diff_time = (tv.tv_sec-bn->stat_sec)+(((float) (tv.tv_usec-bn->stat_usec))/1000000);
+ hps = ((float) bn->stat_hits)/diff_time;
+ bn->stat_hits=0;
+ bn->stat_sec=tv.tv_sec;
+ bn->stat_usec=tv.tv_usec;
+ debug(32, 1) ("balance: %s queue %d average %f total %f reqs %d hps %f\n",bn->target,bn->queue,bn_av,bn_timing,bn->reqs,hps);
+ }
+ } else {
+ if (show_timings)
+ debug(32, 1) ("balance: %s currently marked dead\n",fav->target);
+ }
+ if (node_next(bn)) { bn = node_next(bn); } else { bn=list_first(b); }
+ if (bn==first) { break; }
+ }
+ b->previous = fav;
+ *port = fav->port;
+ *host = fav->target;
+ fav->stat_hits++;
+
+ fwdState->balance.target=fav;
+
+ fwdState->balance.sec = tv.tv_sec;
+ fwdState->balance.usec = tv.tv_usec;
+ fav->queue++; /* One more in the queue */
+ };
+
+ /* mark a request complete, updating timings and queues */
+ void balanceComplete(FwdState *fwdState) {
+ struct bnode *bn = NULL;
+ float t;
+ float av;
+ struct timeval tv;
+ int sec; int usec;
+
+ /* debug(32, 1) ("balanceComplete()\n"); */
+
+ gettimeofday(&tv,NULL);
+
+ if (!fwdState) {
+ debug(32, 1) ("balanceComplete() recieved null fwdState!\n");
+ return;
+ }
+
+ sec = tv.tv_sec-fwdState->balance.sec;
+ usec = tv.tv_usec-fwdState->balance.usec;
+
+ if (usec<0) {
+ sec--;
+ usec = 1000000+usec;
+ }
+
+ /* This is a check to make sure the pointer is in our target
+ list. It can happen that it isn't, if there has been a reload
+ between the start and complete
+ */
+ list_run(b,if (b->op == fwdState->balance.target) { bn = b->op; break; });
+ if (!bn)
+ return;
+
+ /* Update timings */
+
+ if (fwdState->err) {
+ switch(fwdState->err->type) {
+ case ERR_CONNECT_FAIL:
+ case ERR_SOCKET_FAILURE:
+ case ERR_READ_ERROR:
+ case ERR_WRITE_ERROR:
+ debug(32, 1) ("balance_mark_failed: Setting %s as dead\n", bn->target);
+ bn->dead_until = squid_curtime+DEFAULT_DELAY;
+ break;
+ default:
+ t = sec+((float)(((float)usec)/1000000.0));
+ if (bn->reqs>AVG_LEN) {
+ av = bn->timing/AVG_LEN;
+ bn->timing-=av;
+ }
+ bn->timing+=t;
+ /* debug(32, 1) ("balanceComplete:default: req to %s took %f\n", bn->target,t); */
+ };
+ } else {
+ t = sec+((float)(((float)usec)/1000000.0));
+ if (bn->reqs>AVG_LEN) {
+ av = bn->timing/AVG_LEN;
+ bn->timing-=av;
+ }
+ bn->timing+=t;
+ /* debug(32, 1) ("balanceComplete:else: req to %s took %f\n", bn->target,t); */
+ }
+
+ fwdState->balance.target = NULL;
+ fwdState->balance.sec = 0;
+ fwdState->balance.usec = 0;
+
+ bn->queue--;
+ if (bn->queue<0) /* can happen after a reload */
+ bn->queue=0;
+ bn->reqs++;
+ }
+
+ void balanceDestroy(void) {
+ /* run the list in safe mode (copy next pointer) and free all the nodes */
+ list_run_safe(b,xfree(b->op));
+ };
diff -C 3 -N -r ../squid-2.3.STABLE4/src/cf.data.pre src/cf.data.pre
*** ../squid-2.3.STABLE4/src/cf.data.pre Wed Jun 14 00:19:57 2000
--- src/cf.data.pre Mon Oct 9 01:41:49 2000
***************
*** 2049,2054 ****
--- 2049,2065 ----
  httpd_accel_uses_host_header off
  DOC_END

+ NAME: httpd_accel_balance_targets
+ TYPE: wordlist
+ LOC: Config.Accel.targets
+ DEFAULT: none
+ DOC_START
+ A list of servers that all requests will be balanced across. This
+ is a hack. Use at your own risk. Noony noony.
+
+ httpd_accel_balance_targets web1.dmz web2.dmz web3.dmz
+ DOC_END
+
  COMMENT_START
   MISCELLANEOUS
   -----------------------------------------------------------------------------
diff -C 3 -N -r ../squid-2.3.STABLE4/src/defines.h src/defines.h
*** ../squid-2.3.STABLE4/src/defines.h Tue Jul 18 13:48:07 2000
--- src/defines.h Sun Oct 22 00:59:49 2000
***************
*** 38,43 ****
--- 38,45 ----
  #define FALSE 0
  #endif

+ #define BALANCE
+
  #define ACL_NAME_SZ 32
  #define BROWSERNAMELEN 128

diff -C 3 -N -r ../squid-2.3.STABLE4/src/forward.c src/forward.c
*** ../squid-2.3.STABLE4/src/forward.c Wed Feb 23 18:13:23 2000
--- src/forward.c Sun Oct 22 02:24:50 2000
***************
*** 68,73 ****
--- 68,78 ----
      int sfd;
      debug(17, 3) ("fwdStateFree: %p\n", fwdState);
      assert(e->mem_obj);
+
+ #ifdef BALANCE
+ balanceComplete(fwdState); /* This call will catch most closes */
+ #endif
+
      if (e->store_status == STORE_PENDING) {
          if (e->mem_obj->inmem_hi == 0) {
              assert(fwdState->err);
***************
*** 221,227 ****
      int fd;
      ErrorState *err;
      FwdServer *fs = fwdState->servers;
! const char *host;
      unsigned short port;
      time_t ctimeout;
      assert(fs);
--- 226,232 ----
      int fd;
      ErrorState *err;
      FwdServer *fs = fwdState->servers;
! char *host;
      unsigned short port;
      time_t ctimeout;
      assert(fs);
***************
*** 233,240 ****
--- 238,254 ----
          ctimeout = fs->peer->connect_timeout > 0 ? fs->peer->connect_timeout
              : Config.Timeout.peer_connect;
      } else {
+ #ifdef BALANCE
+ if (Config.Accel.targets) {
+ balanceGetTarget(fwdState,&host,&port);
+ } else {
+ host = fwdState->request->host;
+ port = fwdState->request->port;
+ }
+ #else
          host = fwdState->request->host;
          port = fwdState->request->port;
+ #endif
          ctimeout = Config.Timeout.connect;
      }
      hierarchyNote(&fwdState->request->hier, fs->code, host);
***************
*** 471,477 ****
      fwdState->client_fd = fd;
      fwdState->server_fd = -1;
      fwdState->request = requestLink(r);
! fwdState->start = squid_curtime;
      storeLockObject(e);
      EBIT_SET(e->flags, ENTRY_FWD_HDR_WAIT);
      storeRegisterAbort(e, fwdAbort, fwdState);
--- 485,493 ----
      fwdState->client_fd = fd;
      fwdState->server_fd = -1;
      fwdState->request = requestLink(r);
! #ifdef BALANCE
! fwdState->balance.target=NULL;
! #endif
      storeLockObject(e);
      EBIT_SET(e->flags, ENTRY_FWD_HDR_WAIT);
      storeRegisterAbort(e, fwdAbort, fwdState);
***************
*** 520,525 ****
--- 536,544 ----
      if (fwdState->err)
          errorStateFree(fwdState->err);
      fwdState->err = errorState;
+ #ifdef BALANCE
+ balanceComplete(fwdState); /* This particular call will catch errors, to mark dead */
+ #endif
  }

  /*
***************
*** 529,534 ****
--- 548,554 ----
  fwdAbort(void *data)
  {
      FwdState *fwdState = data;
+
      debug(17, 2) ("fwdAbort: %s\n", storeUrl(fwdState->entry));
      fwdStateFree(fwdState);
  }
***************
*** 539,544 ****
--- 559,565 ----
  void
  fwdUnregister(int fd, FwdState * fwdState)
  {
+
      debug(17, 3) ("fwdUnregister: %s\n", storeUrl(fwdState->entry));
      assert(fd = fwdState->server_fd);
      assert(fd > -1);
***************
*** 560,565 ****
--- 581,587 ----
      debug(17, 3) ("fwdComplete: %s\n\tstatus %d\n", storeUrl(e),
          e->mem_obj->reply->sline.status);
      fwdLogReplyStatus(fwdState->n_tries, e->mem_obj->reply->sline.status);
+
      if (fwdReforward(fwdState)) {
          debug(17, 3) ("fwdComplete: re-forwarding %d %s\n",
              e->mem_obj->reply->sline.status,
diff -C 3 -N -r ../squid-2.3.STABLE4/src/list.h src/list.h
*** ../squid-2.3.STABLE4/src/list.h Thu Jan 1 10:00:00 1970
--- src/list.h Mon Oct 9 03:24:54 2000
***************
*** 0 ****
--- 1,27 ----
+ /***
+ *
+ * Rather strange linked list implementation :)
+ *
+ */
+
+ #define list_node_struct(ht, t) struct t *next; struct t *prev; struct ht *head;
+ #define list_head_struct(ht, t) struct t *first; struct t *last; struct t *op; struct t *op2; unsigned long count;
+
+ #define node_next(n) n->next
+ #define node_prev(n) n->prev
+ #define node_head(n) n->head
+ #define list_first(h) h->first
+ #define list_last(h) h->last
+ #define list_count(h) h->count
+ #define list_append(h, n) if (h->last) { h->last->next=n; } n->prev=h->last; n->next=NULL; h->last=n; n->head=h; if (!(h->first)) { h->first=n; n->prev=NULL; }
+ #define list_prepend(h, n) if (h->first) { h->first->prev=n; } n->next=h->first; n->prev=NULL; h->first=n; n->head=h; if (!(h->last)) { h->last=n; n->next=NULL; }
+ #define list_init(h) h->first=NULL; h->last=NULL;h->op = NULL; h->count = 0;
+
+ /* List_run is for iteration, *however* if you're doing a free(); run, you *must* use _safe, slightly
+ less efficient, but it saves the next() pointer so you don't segfault :)
+ */
+
+ #define list_run(h, c...) h->op = h->first; while (h->op) { c; h->op = h->op->next; }
+ #define list_run_safe(h, c...) h->op = h->first; while (h->op) { h->op2 = h->op->next; c; h->op = h->op2; }
+
+
diff -C 3 -N -r ../squid-2.3.STABLE4/src/main.c src/main.c
*** ../squid-2.3.STABLE4/src/main.c Thu Feb 10 10:29:58 2000
--- src/main.c Sun Oct 22 02:25:39 2000
***************
*** 341,346 ****
--- 341,347 ----
      redirectShutdown();
      authenticateShutdown();
      storeDirCloseSwapLogs();
+ balanceDestroy(); /* Clean up balancer */
      errorClean();
      mimeFreeMemory();
      parseConfigFile(ConfigFile);
***************
*** 352,357 ****
--- 353,359 ----
  #if !USE_DNSSERVERS
      idnsInit();
  #endif
+ balanceInit(); /* Initialise balancer */
      redirectInit();
      authenticateInit();
  #if USE_WCCP
***************
*** 467,472 ****
--- 469,475 ----
  #if !USE_DNSSERVERS
      idnsInit();
  #endif
+ balanceInit(); /* Initialise balancer */
      redirectInit();
      authenticateInit();
      useragentOpenLog();
diff -C 3 -N -r ../squid-2.3.STABLE4/src/protos.h src/protos.h
*** ../squid-2.3.STABLE4/src/protos.h Sat Apr 8 06:32:30 2000
--- src/protos.h Sun Oct 22 02:13:03 2000
***************
*** 93,98 ****
--- 93,111 ----
  #endif

  /*
+ * balance.c
+ */
+
+ #ifdef BALANCE
+
+ extern void balanceInit(void);
+ extern void balanceGetTarget(FwdState *fwdState, char **host, unsigned short *port);
+ extern void balanceComplete(FwdState *fwdState);
+ extern void balanceDestroy(void);
+
+ #endif
+
+ /*
   * cache_cf.c
   */
  extern int parseConfigFile(const char *file_name);
diff -C 3 -N -r ../squid-2.3.STABLE4/src/squid.h src/squid.h
*** ../squid-2.3.STABLE4/src/squid.h Thu Feb 10 10:30:00 2000
--- src/squid.h Sun Oct 22 02:11:32 2000
***************
*** 38,43 ****
--- 38,49 ----
  #include "config.h"

  /*
+ * Activate accelerator balancing code
+ */
+
+ #define BALANCE
+
+ /*
   * On some systems, FD_SETSIZE is set to something lower than the
   * actual number of files which can be opened. IRIX is one case,
   * NetBSD is another. So here we increase FD_SETSIZE to our
diff -C 3 -N -r ../squid-2.3.STABLE4/src/structs.h src/structs.h
*** ../squid-2.3.STABLE4/src/structs.h Thu Mar 30 08:56:57 2000
--- src/structs.h Sun Oct 22 02:04:14 2000
***************
*** 327,332 ****
--- 327,333 ----
      struct {
          char *host;
          u_short port;
+ wordlist *targets;
      } Accel;
      char *appendDomain;
      size_t appendDomainLen;
***************
*** 1746,1751 ****
--- 1747,1759 ----
          unsigned int dont_retry:1;
          unsigned int ftp_pasv_failed:1;
      } flags;
+ #ifdef BALANCE
+ struct {
+ time_t sec;
+ time_t usec;
+ void *target;
+ } balance;
+ #endif
  };

  #if USE_HTCP
Received on Sun Oct 22 2000 - 06:14:08 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:52 MST