Re: Squid performance wish-list

From: Kevin Littlejohn <darius@dont-contact.us>
Date: Fri, 28 Aug 1998 11:12:54 +1000

>
> By writing your own FS, you'd have to use direct io (unbuffered & uncached by OS)
> as you would write to raw partitions and to achieve same numbers you'd need
> to also implement very efficient caching. You'd need to allocate large amounts
> of ram for that task and you'd need to make sure that this ram is not paged
> out to swap, you'd need to implement clustering, read-ahead, free-behind,
> avoid fragmentation, etc.etc.

I think you'll find a fair portion of your io buffering/cacheing is being
taken up with inodes. Given that squid already does cacheing of objects
in memory (or did last I checked :), then we can get away with not buffering
data read from disk - this is the assumption for the novm version, in reverse.
If that's the case, then by getting rid of filename->inode lookups, we do
away with the other large user of the fs buffers. I don't think we _do_
need to implement the io cacheing you refer to above.

> > > I'd rather focus on changing squid's disk usage patterns in such a way
> > > that it would make it really easy for ufs and OS to do what we need.
> > > I believe there are tons of things we can do in that direction.
> >
> > I believe we're starting to cap out in this area. I understand very well
> > about disk optimisation and Squid is getting close to the end of things
> > it can do with regards to UFS without making big assumptions about what
> > an OS is doing and the disk layout. As for disk usage patterns, what
> > did you have in mind with regard to a system that is fairly random. We
> > already use the temporal and spatial locality inherent is cache operation
> > to make things work better. I'm open for any suggestions as to what more
> > can be done? Remember I'm after quantum leaps in performance, not just
> > 5% here and 10% there from minor fiddling.
>
> OK, I'll give it a try ;)
>
> To optimize disk io, we need first write down what makes it slow, why
> and what can be done about it. I guess noone will argue that the most
> slow thing in squid is file open, compared to subsequent reads/writes
> it is about 10 times slower. So this is the area of most speedup.
> Although unlink is even slower, I believe that it can be avoided and
> should be removed from squid if at all possible.

(Very quick note: The proposed fs design _does_ address unlink, makes it
a 'read inode, blatt bitmap' operation - one read for any small-to-medium
sized files, and a twiddling of bits in a bitmap to mark blocks as free
en mass. I don't know, but I suspect that may even weigh in cheaper than
truncating files under a fully-fledged (or extant-based) fs)

> The whole issue is the efficiency of caching.
>
> 1) It is clear, that all L1 dirs would get into OS cache, and all
> disk ops (seeks) regarding them are avoided.
>
> 2) It is clear that those L2 directories that have been accessed
> recently would also be accessed from cache, but this is already
> not so certain, and depends heavily on squid dir access patterns.
>
> 3) inodes of files are unrealistic to cache, because it takes too
> much ram, so only recently accessed file inodes are cached.
>
> 4) Squid creates its directory structure upon installation, that
> means that all directory files and their inodes are located
> close to each other (UFS clustering)

It's probably worth finding some tools to actually study the contents
of your fs caches there. Drawing from inn experience (which is somewhat
different in some areas, but shares _some_ characteristics of disk access),
filename -> inode -> data is a long, expensive process on a fs that's
being hammered hard enough to matter. The easiest way around this (and
the most effective) is to do away with it altogether - refer to file
data in terms of it's position on disk, rather than in terms of a pointer
to it (or a pointer to a pointer to it).
That way, we remove the complexity of filenames, directory structures etc.

(Incidentally, did anyone out there (MOR?) ever try the old inn-inode
linux patch on a squid box?)

[snip lots of good ideas for optimising standard filesystem-based squids]
   
> - We want to make squid fetch buffers at around 32-64K, to allow
> writes to happen in larger blocks.
> - We want to open/write/close in fast succession. For objects that fit
> fully into squid internal buffers, it is a "crime" to issue open()
> then wait sometimes upto few minutes until first block arrives (8K)
> and only then write it out to OS. By the time, all OS fs cache is
> many times reused and it has had to deal with partially filled pages,
> disk blocks, etc. Better let the squid buffers to get paged out to
> swap, because when you later write out the whole buffer, they are
> neatly together.
> - For reads, there's little we can do, so we might want to assign
> different sized buffers for read vs writes to conserve memory.

Stew commented that apparently squid reads in 8K chunks and writes to
the network in 8K chunks at the moment anyway - I haven't looked at that
yet, but it makes sense, given the size distribution of objects.
It _may_ be that increasing the fetch buffers will improve performance
- but if 90% of your objects are <8K in size (figures plucked from air),
then it's not going to help anywhere near as much as might first appear.

 
> In theory, if all ufs metadata is constantly cached in ram, then to get
> any object into or from disks, OS needs minimal amount of disk ops. In
> case where clustering is efficient and fragmentation allows, in a single
> disk operation.
> We CAN cache all data for L1 and L2 dirs, but we do need some care
> to make sure they are not kicked out of ram.

Or, we can simply blow away the whole L1/L2 dir and metadata information,
store direct pointers to file data in squid's own tables, and reference
accordingly.
 
> This sounds good. But
> - would you implement cache in ram? how'd you lock cache from being paged out to swap?
> - would you implement read-ahead?
> - would you implement free-behind?
> - would you implement delayed-writes,
> - clustering?
> - data caching, its LRU?

The cacheing we rely on squid to handle, as it's already fairly good at that.

> - fsck?

The final plan allows for a consistency check of blocks vs. the allocation
bitmaps. In the case of inconsistencies, blocks/files are simply dumped.
The fsck should also be a fairly quick operation, as far as we can tell.

> - work over network (LAN)?

We don't run our squid cache's off a network drive, nor do we plan to
anytime in the near future :) Remember, this FS is designed as a
_specific purpose_ FS - not to be used for anything other than squid,
and probably of most gain to heavily-hit squids at that.

> - spanning over multiple spindles?

Again, squid already has good support for multiple spindles. In honesty,
I prefer using the approach squid already uses - if handled correctly (and
assuming hot-swappable drives), we should be able to invalidate a spindle
in the case of hardware problems, rip the drive out and replace it, then
(currently) sigHUP squid to bring the new drive on-line - zero downtime,
zero incidental data loss (except for what's on the dead drive, of course).
It'd be nice to be able to bring the new drive in without interrupting
squid's servicing of requests, but that's a project for another day... :)

KevinL
Received on Tue Jul 29 2003 - 13:15:53 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:54 MST