Re: Patch to fix comm_incoming (squid-1.2beta22)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Thu, 4 Jun 1998 17:21:45 +0300 (EETDST)

--MimeMultipartBoundary
Content-type: text/plain; charset=US-ASCII
Content-transfer-encoding: 7BIT

On 4 Jun 98, at 16:35, Stewart Forster <slf@connect.com.au> wrote:

 hi, Stewart,

> Some problems with the standard beta22 comm_incoming business.
>
> As noted by Michael O'Reilly, if there stuff on the UDP write
> queue, that never gets sent. I've fixed this by adding a check in
> icpHandleUdp to call and reset the write handler on a UDP socket if
> it's set. That way the queue gets cleared when necessary.

 I personally do not like the idea of calling all special sockets directly
 in a row, and here's why:
 1) what you want to achieve is avoiding extra syscall in form of poll()
 2) you do at least 2 syscalls in form of read() instead
 3) you need to modify each handler (pass a pointer to int to it) just to
    make note of no of ready FD's
 4) in case we want to have separate sockets for ICP.client and ICP.server
    and/or several HTTP accept sockets, it will make needed no of syscalls
    far higher than with poll()ing them all first.
 5) you need special handling for them all, eg. writes, busting somewhat
    overall squid design.

 alternatively:
 1) using poll() for special sockets we'd do at worst 1 more syscall, in case
    some of sockets are ready, and at best _only_ 1 syscall in case there are
    none. (which IMHO is far more likely to happen, especially if you want
    good reaction time to incoming bursts)
 2) poll returns you number of FD's ready, you need not hack with &incame
 3) we service only those FD's that are indeed ready
 4) we can have the same procedure to handle ready FD's regardless of what
    sockets they are.

> Further, the algorithm was a fair bit of a shambles. It required
> progressively larger and larger `incame' values just to keep the current
> incoming_interval where it was. ie. if the incoming_interval rate dropped,
> the we needed to have even more `incame' just to keep it there. That's
> kinda against what the whole idea was. It was fighting an increasing
> curve.
>
> My new algorithm is much simpler and does away with the horrible
> division array. I've assumed that we want half of our number of incoming
> sockets to have activity on them (on average).
>
> incoming_interval = incoming_interval + (no_incoming_sockets / 2)
> - incame;
>
> This allows for fairly rapid dropping on the incoming_interval
> when things start blowing out, but only increases fairly slowly. Also,
> since we only want to call the checks when things are probably going to
> happen we avoid excessive calls and also excessive delays.

 Stew, your algoritm has assumptions that may brake the idea. For eg. if
 number of incoming sockets is 1, the above algoritm will move towards 0
 incoming_interval, which is definitely wrong for http-only caches.

 Please let me define the goal from slightly different perspective:
 1) whatever algoritm, its only prupose is to _predict_ when the next
    incoming event is likely to happen
 2) we tend to achieve reasonable prediction by measuring average incoming
    event rate (comparing to regular event rate, not real time)
 3) we NEVER can predict actual next event correctly: our prediction will
    make us repoll ALWAYS either "too-late" or "too-early"
 4) we want to error on too-late rather than too-early (surprise?),
    BUT, VERY slightly. (not to poll FD's just before they become ready)
 5) we surely want the algoritm to track as precisely as possible the real
    world with minimal set of data and assumptions, yet still NOT get
    confused by accidental spikes in traffic rate
 6) we want it to react to rate increase of incoming events adequately fast
 7) we want it to react to lack of incoming events slowly, but still react
 8) whenever any event happens on incoming socket, all we know is that there
    were some (N > 0) FD's ready at the same time.
 9) we should not assume any rate from the ready_FD count, unless we measure
    no of messages received on each FD individually.
 (Thats because all we know is that event happened, we do not know exact
  timing, and assuming >1 events in a row to represent any incoming rate is
  largely unwarranted. For eg. http and ICP can come in at exactly same rate,
  but always in pairs, requests from remote ICP peers can come in at the
  same slow rate, but in chunks of 1-3, counting these may lead to way wrong
  predictions. Do not confuse prediction with statistical average incoming rate)

 From that I'd propose:
 1) not to count number of ready FD's on all incoming sockets at all
  ( if we really want such count to influence the algoritm, then count messages
    serviced in a row on each FD and use max of all per FD counts)

 2) take the only data for algoritm to be ready_incoming_FDs == false/true

 3) in comm_incoming, poll() all incoming sockets, service those ready and
    set next poll time.

 4) use variable (next_poll for eg.) that "predicts" after how many ordinary
    io_events we expect activity on incoming sockets.

 5) when io_events reach next_poll, we do check comm_incoming

   1) if prediction of next_poll was too-late (sockets ready) we reduce next_poll
      by reduce_rate _slightly_ (to avoid over-regulation)

   2) if prediction of next_poll was too-early, (no FD's ready), we increase next_poll
      by increase_rate _slightly_ (to avoid too fast decay)

   3) if prediction was too-late >1 consecutive times, then we know that our
      next_poll reduce_rate is too slow, adapt by upping reduce_rate (say * 2)

   4) if prediction was too-early >1 consecutive times, then we are checking incoming
      sockets too often, but might wish not to change next_poll increase_rate, instead
      allow algoritm to slowly decay linearly.
      
      Although we might also implement exponentially changing next_poll increase_rate.
 
   5) for both cases in 3) and 4) when consecutive errors change to reverse error-type,
      reset the next_poll change rate to their defaults, thus returning to
      "precision-tracking mode"

 What we'd effectively have is a kind of "phase-locked-loop" that tracks actual
 incoming rate with exponentially increasing reaction-speed (glitch-resistent but fast)
 linearly and slowly (stability) drifting away towards too-late (ie idle. slow rate)

 Please note that reacting too fast to changing world will make the algoritm to
 jump wildly in wrong directions almost always (over-reaction), resulting in
 unwanted effects mostly (like oscillation in electronics equipment)

 Strictly, you can't track actual average more precisely than some percent, and this
 only if your reaction speed is equal or smaller of that percent. So, is we want to
 be within 5% error in predicting right next_poll time, we can't shift our next_poll
 by more than 1/20 at a time. In this light, your selection of 1/16 is quite ok.

 In terms of actual algoritm, it would be pretty simple:
(code not tested, just for illustration)

comm_incoming() {
  ready_fds = poll();
  ...
   if (ready_fds > 0) {
      if (change_rate < 0) change_rate = change_rate << 1;
      else change_rate = -1;
   } else {
      if (change_rate > 0) change_rate = change_rate << 1; /* or cr++ or cr=2, etc */
      else change_rate = 1;
   }
   if (change_rate < cmin) change_rate = cmin;
   if (change_rate > cmax) change_rate = cmax;

   next_poll += change_rate;

   if (next_poll > max) next_poll = max;
   if (next_poll < min) next_poll = min;
}

comm_poll() {
 ...
   if (++io_events > (next_poll >> 4) ) {
      io_events = 0;
      comm_incoming()
   }
 ...
}

 ----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
 ----------------------------------------------------------------------

--MimeMultipartBoundary--
Received on Tue Jul 29 2003 - 13:15:50 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:48 MST