Drop in replacement storeMostFreeSwapDir in store_dir.c (1.2beta20)

From: Stewart Forster <slf@dont-contact.us>
Date: Thu, 07 May 1998 15:07:37 +1000

Hiya,

        In investigating why our caches (1.2beat20) get occasional warnings
about the ASYNC queue filling up under heavy load, I discovered that a lot
of the warnings had to do with open's, read's and write's all to the same
store directory, ie. there was an imbalance happening. A single cache store
directory was getting all the load, blowing out service times and causing
ASYNC disk IO requests to bank up to the point of causing warnings.

        The current algorithm simply writes to the least allocated directory.
This actually has a few drawbacks because of the way squid fetches objects.
Once a directory has been earmarked for being written to, the object is still
just starting to come in. Given that even a fairly heavily loaded cache
pulls in little more than 1MB/sec, large cache store directories that are out
by relatively small proportions can incur a great deal of writing while enough
data is pulled in to push the least used cache store directory up above the
next least.

        This results in cache store directories being blasted with load, one
at a time. Sure other writes are going on to other directories, but these
are residual longer transfers, and the load is severely imbalanced to the
point that the store with the least space can at times get 8-10 times the
write load of the next highest disk store.

        I have coded up, tested and proven the following drop in replacement
for storeMostFreeSwapDir in the store_dir.c module. This new routine takes
about twice as long (CPU wise) as the old routine to run, but dramatically
increases disk efficiency. We have seen our disk service times drop from
a daily average of 47ms and high load average of 70ms, to an average of 31ms
under high load periods. This has also had the effect of all but eliminating
(except during startup) the ASYNC queue growing messages, and has meant that
there are now no slowdowns do to the ASYNC library waiting for disk IO to
complete (under the loads we see).

        The routine works by finding out (via selection sort) the least
3/4 of the set of store directories and allocates writes (round-robin) to
this set. The set is recalculated each time through the set. This means
if you have 4 store dirs (on 4 disks hopefully), disk writes will go to the
3 least store dirs. This still maintains the space balancing mechanism the
old algorithm used, but balances disk load much better. Whereas the old
algorithn was O(n) for n store directories, the new algorithm is O(n+n),
since although the O(n^2) selection sort is used, that only gets called
1/nth of the time. Basically nothing to worry about unless you have some
thing like >32 cache store directories where switching to qsort() might
be an idea to find the least used directories.

        BTW, only expect good speedups if you are using (mostly) 1 store dir
per disk spindle.

        Cheers,

                Stew.

-------
/* Spread load across least 3/4 of the store directories */
 
static int
storeMostFreeSwapDir(void)
{
    double least_used = 1.0;
    int dirn;
    int i, j;
    SwapDir *SD;
    static int nleast = 0;
    static int nconf = 0;
    static int *dirq = NULL;
    static double *diru = NULL;
 
    /* Handle simplest case of a single swap directory immediately */
 
    if(Config.cacheSwap.n_configured == 1)
        return 0;
 
    /* Initialise dirq on the first call or on change of number of dirs */
 
    if(nconf != Config.cacheSwap.n_configured) {
        nconf = Config.cacheSwap.n_configured;
        nleast = (nconf * 3) / 4;
        if (dirq != NULL)
            xfree(dirq);
        dirq = (int *)xmalloc(sizeof(int)*nleast);
        if (diru != NULL)
            xfree(diru);
        diru = (double *)xmalloc(sizeof(double)*nconf);
        for (j = 0; j < nleast; j++)
            dirq[j] = -1;
    }
 
    /* Scan for a non-negative dirn in the dirq array and return that one */
 
    dirn = -1;
    for (j = 0; j < nleast; j++) {
        dirn = dirq[j];
        if(dirn < 0)
            continue;
        dirq[j] = -1;
        break;
    }
 
    /* If we found a valid dirn return it */
 
    if(dirn >= 0)
        return dirn;
 
    /* Now for the real guts of the algorithm - building the dirq array */
 
    for (i = 0; i < nconf; i++) {
        diru[i] = 1.1;
        SD = &Config.cacheSwap.swapDirs[i];
        if(SD->read_only)
            continue;
        diru[i] = (double)SD->cur_size;
        diru[i] /= SD->max_size;
    }
 
    for (j = 0; j < nleast; j++) {
        dirq[j] = -1;
        least_used = 1.0;
        dirn = -1;
        for (i = 0; i < nconf; i++) {
            if(diru[i] < least_used) {
                least_used = diru[i];
                dirn = i;
            }
        }
        if (dirn < 0)
            break;
        dirq[j] = dirn;
        diru[dirn] = 1.1;
    }
 
    if(dirq[0] < 0) /* Setup default return of 0 if no least found */
        dirq[0] = 0;
 
    dirn = dirq[0];
    dirq[0] = -1;
    return dirn;
}

-- 
Stewart Forster (Snr. Development Engineer)
connect.com.au pty ltd, Level 9, 114 Albert Rd, Sth Melbourne, VIC 3205, Aust.
Email: slf@connect.com.au   Phone: +61 3 9251-3684   Fax: +61 3 9251-3666
Received on Tue Jul 29 2003 - 13:15:49 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:46 MST