Re: Squid performance wish-list from Stewart Forster on 1998-08-23 (squid-dev)

From: Stewart Forster <slf@dont-contact.us>
Date: Mon, 24 Aug 1998 12:20:41 +1000

> > From the way I look at it, there are many different kinds of sockets:
>
> Stew means file descriptors I hope, because whatever priority scheme,
> disk io should be highest priority, isn't it?

Yep.

> > Active client connections - connections where we are expecting to receive
> > a request on, or send data down.
> >
> > Idle client connections - connections that are waiting for some data to
> > come back from a server
> >
> > Persistent idle client connections - connections we don't expect to see
> > a request on immediately
> >
> > Active server connections - connections we are about to send a request down
> > or are expecting to receive some data back from
> >
> > Persistent idle server connections - connections to servers that are waiting
> > for a client
>
> I don't quite get why you differentiate them like that, I'd rather look at it:
> 1) there are FD's that need immediate action, no matter what else is going on.
> 2) there are classes of FD's that need higher priority than others
> 3) there are classes of sessions that need to be throttled back

I was merely pointing out the various types of FD's that exist in Squid
now and their differences. Of course the next step in an implementation
would be to classify them into the areas that you have mentioned.

> From here, I'd write down priority list as I wish to see it:
> 1) disk io, including logs, perhaps different priority for reads and writes
> 2) DNS lookups and other squid-internal interprocess comms
> --> here we could insert squid internal houskeeping and eventRUN()s
> 3) ICP client of peers (sending ICP's to peers and getting replies)
> -- below could be user-configurable from squid.conf
> 4) HTTP(/ftp) downloads from source
> 5) ICP server for peers
> 6) HTTP server for peers
> -- below could be user configurable by ACL's
> 7) HTTP server for clients
> 7..10) differnet classes of service for clients
>
> OK, so after we have defined priorities one way or another, we need some way
> to enforce that. How? There are perhaps few different prioritization schemes,
> (taking examples from routers comes first to mind) like absolute priority,
> weighted fair, CIR, CBR, ABR, traffic-shape, etc.
>
> One way or another, we have to deal with real time, we just can't service
> each queue "every N-th event in higher-priority queues". So we'd need additional
> stats for every queue (or even every session), based on which we can determine
> actual "quality of service" and based on that decide how to react.
>
> Then comes queue-run, or scheduling of servicing. It definitely depends on
> selected queueing scheme, but it must have some parameters that differs
> from that implemented in say routers:
> 1) squid can't drop packets ;) so it may face accumulating queues.
> 2) all FD's have to be serviced in some timeframe, no matter how low priority.
> (otherwise timeouts would make the cache service suck)
>
> (squid also can't stop servicing some FD for prolonged time, then service it for
> some time, ie. it can't be stop-go-stop-go. Why? because tcp queues in OS
> would blow, tcp sessions would stall, needing slow start later on, having far
> worse impact on the real tcp session than the priority schemes enforces)
>
> Speculating, I would imagine _one_ example to implement that could be:
> 1) at FD creation time, determine the priority of FD and assign it to some
> queue (array, linked-list, .. )
> 2) Service (ie. poll() each queue (collection of FD's) at predetermined rate.
> - if it is a non-limited queue, service all FD's in a row,
> - if it is limited queue, service FD's in a row until exceed queue's max
> (be it byte count, or spent time for eg), then save last serviced FD to be
> able to restart from there next time, and exit
> - while servicing each queue, schedule next run time using mSec accuracy.
> ( - if we are past schedule, rerun higher priority queues before going to next
> lower queue)
> Of course, schedule highest priority queues to be run every time.
>
> This would be like constant bitrate scheme, not quite suited for squid I guess,
> but still might be needed for some cases. for eg. a queue or few could be
> assigned to be CBR queues (eg. to throttle back porn traffic, or some customer
> who have subscribed for given bitrate)
>
> Of course, it would be nice to define "queue" in the first place, but it seems
> to be quite tricky.
>
> 1) You can keep few separate collections of FD's and poll them separately,
> - but then you can block to syscall poll() for too long.
> - Reducing poll() timeout would increase syscall rate and context switches,
> burning cpu wastefully.
> - You could set the poll() timeout to be the time for next earliest queue's
> scheduled run.
> - it is still very difficult to continue as "best effort service", its too
> bound to be time-driven rather than event-driven.
>
> 2) You could poll all open FD's at once, but only service in priority order.
> - first those that are highest priority and need attention every time
> - then those that meet some criteria defined elsewhere (like the queue's,
> to which the FD belongs, quota and next run-time, etc.)
> - then, after repoll of FD's and servicing highest priority, continue with
> lowest priority FD's at the leisure.
>
> 3) You could scan all FD's and poll only those appropriate for current timeframe.
> - include highest priority FD's every time
> - include those FD's whose scheduled time is in the past (missed guys)
> - whatelse.
> - then service them in whatever order seems nice and reschedule according to
> policy.
>
> Of course, implementing some sort of configurable policies would go way up
> in complexity, but some selection should be in place. It is pretty difficult
> to even select what should be there, not speaking of how to implement it.
>
> Therefore, it would be really lovely, it you guys would express your thoughts
> on wether at all, what at least, what ideally, and perhaps also how would you
> implement queueing?
>
> Pesronally, I'd like to have these types:
> 1) ACL based assigning to different classes or queues or priorities (COS)
> 2) ACL based rate control of those classes or individual IP's (sessions) (QOS)
>
> then:
> 1) ACL based throttling of specific IP, URL, etc, that is per session bitrate.
> 2) ACL based rate enforcement for class that can represent a collection
> of client IP's, URL's, domains, etc.
>
> Eventually, if we can control how we allocate priorities to differing types
> of events, we can optimise and control how squid behaves in case of overload.

An excellent insight into the problem Andres!

My approach to problems is, as always, to not think too much about what
might not be possible or how hard it is, but to just do it in the simplest
way possible that caters for the most cases. After that we can fine tune
it with added extras but for now a simple solution would be as I mentioned
in my previous mail.

> About a year ago I wrote up a poll version that defined a global FD set
> and that was updated at points where handlers are installed. So comm_select
> didn't need to check for all the handlers - they were already in place.
> With poll() it is especially cute, because it doesn't modify requested
> pollfd.events but only returned pollfd.revents.
> At the time I was objected that scanning for warm FD's in each select loop
> has neglible impact to CPU usage and that maintaining global fd_set to be
> in sync with real life adds more unneeded complexity.

This is a good possible approach to the FD poll updating problem.

> > > *) Inlining DNS lookups.
> >
> > Is there a major win to come from this?
>
> Yes there is. For every call to resolv lib, it first creates a UDP socket,
> sends query, gets a reply and closes socket. By keeping only one socket open
> inside squid, we avoid at least 2 syscalls and a bunch of context switches
> due to IPC between processes. It also eliminates the need to worry about
> the number or dnsserver proccesses needed, and allows to catch delayed
> replies from DNS server and update DNS cache even after we gave up waiting
> for the reply.
>
> Btw, Stew, what code? Way back I suggested to use Darren Reed's arlib found
> in every bind contrib. I even got written permission from Reed to modify his
> code to death if needed for squid, but I never had enough time ;)

I just grabbed the DNS RFC from the net and coded up a minimalistic DNS packet
builder/decoder. It doesn't handle every possible option, but then it doesn't
need to since it only sending queries of a single type. I haven't looked
at Darren's arlib() but it had been mentioned to me as a possibility.

> > > *) Squid FS.
>
> I'd be really cautious with that. There are so many things to go wrong
> that can render the whole benefit to nothing. And it make take "years"
> to develop robust (and efficient for all cases) FS.
> By moving FS to inside squid we take high risks.

True, there's some risk, but also some big gains. It also removes OS
specific filesystems out of the loop and considerations when coding for
greater disk performance since we have a fixed system with which to work.

> I'd rather focus on changing squid's disk usage patterns in such a way
> that it would make it really easy for ufs and OS to do what we need.
> I believe there are tons of things we can do in that direction.

I believe we're starting to cap out in this area. I understand very well
about disk optimisation and Squid is getting close to the end of things
it can do with regards to UFS without making big assumptions about what
an OS is doing and the disk layout. As for disk usage patterns, what
did you have in mind with regard to a system that is fairly random. We
already use the temporal and spatial locality inherent is cache operation
to make things work better. I'm open for any suggestions as to what more
can be done? Remember I'm after quantum leaps in performance, not just
5% here and 10% there from minor fiddling.

> > > I've done some research on this. Squid would work well with a 4K
>
> I guess you have few pointers to some performance comparisons and
> differing FS internals analysis? Could you share?

I ran disk access patterns from Squid through various FS designs and
thought about the problems with regard the need to keep around objects for
possibly longer than a Cyclical FS would allow. This meant still something
slong the lines of a UFS style filesystem. I then decided we need to
throw away the directory structure (Squid already has its own index file -
all it needs is a index straight to the file's inode rather than an indirect
filename).

> > > frag/ 8K chunk filesystem. Basically we want to write a stripped
> > > down version of UFS that allows indexing by inode alone. I have a
> > > design here at Connect that works out to an average of 2.5 disk
> > > accesses per new object pulled in, and 1.7 disk accesses per cache
> > > hit, based on a sample taken from our live caches.
> > >
> > > Compare this to an average of approx 7 disk accesses per second with UFS on
> > > a new object write, and an average of 3.5 disk accesses per cache hit.
>
> How do you measure that?

Plugged in disk access traces into some simulations to arrive at the 2.5/1.7
values. For UFS I just watched the amount of disk accesses with OS tools
as opposed to what Squid was doing over the same period. It would seem that
UFS does a great deal more disk accesses that it needs to for Squid.

> > > This means about a 250% gain can be had from a customised UFS filesystem.
>
> Where does this gain come from? Could you tell more detail?

Okay, my final design was to do this:

4K fragments, 8K chunks.

Each file starts off with a 4K fragment that also contains that file's inode.
The inode simply lists the length of the file and 63 direct block pointers
and 64 single indirect block pointers for a inode length of 512 bytes (8 byte
pointers). The inode number is also the location of that inode on the disk.
eg. inode 4856 is 4856 * 4K bytes from the start of the filesystem. Most disks
have 512 byte blocks so working with block devices we easily can find the inode
and the first 3584 (4096 - 512) bytes of data of the file.

50% of all squid disk object are < 3584 bytes (on our system anyway) so
that means 50% of all squid disk object writes will simply be 1 disk access
to write the data plus inode.

To write more data requires more disk accesses (of course), but since squid's
disk objects aren't vitally important we can delay updating of inode block
data and bitmap info for extended periods for disk optimisation.

Plugging in object sizes in a filesystem simulator for this system gave back
the numbers I reported.

Like I said, hope to have this coded up and running this week.

Stew.
Received on Tue Jul 29 2003 - 13:15:52 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:52 MST