Re: Squid performance wish-list

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Sat, 22 Aug 1998 15:08:02 +0300 (EETDST)

On 21 Aug 98, at 16:05, Michael O'Reilly <michael@metal.iinet.net.au> wrote:

> Stewart Forster <slf@connect.com.au> writes:
> >
> > *) Priority queues in select() loop:

 This is what I'd really love to have - priority queues. Although I'm not
 sure how to do that. I've been thinking of it and it doesn't seem any
 trivial task.

> From the way I look at it, there are many different kinds of sockets:

 Stew means file descriptors I hope, because whatever priority scheme,
 disk io should be highest priority, isn't it?

> HTTP receive - Already handled specially
> ICP receive - Already handled specially
 
 I'd almost demand splitting of ICP into 2 sockets, one for quering remote
 peers, and other for replying to remote peers. Currently, you have to
 answer all queries from remote peers just to dig out from UDP queues the
 replys that were sent to you. This takes time and causes reply timeouts
 when you get bursts of incoming requests.
 I'd also split HTTP into 2 sockets at least, one for clients, and other for
 peers. And I mean not just sockets - there is difference whether you talk
 to a client or a peer.

> Active client connections - connections where we are expecting to receive
> a request on, or send data down.
>
> Idle client connections - connections that are waiting for some data to
> come back from a server
>
> Persistent idle client connections - connections we don't expect to see
> a request on immediately
>
> Active server connections - connections we are about to send a request down
> or are expecting to receive some data back from
>
> Persistent idle server connections - connections to servers that are waiting
> for a client

 I don't quite get why you differentiate them like that, I'd rather look at it:
  1) there are FD's that need immediate action, no matter what else is going on.
  2) there are classes of FD's that need higher priority than others
  3) there are classes of sessions that need to be throttled back

 From here, I'd write down priority list as I wish to see it:
 1) disk io, including logs, perhaps different priority for reads and writes
 2) DNS lookups and other squid-internal interprocess comms
--> here we could insert squid internal houskeeping and eventRUN()s
 3) ICP client of peers (sending ICP's to peers and getting replies)
-- below could be user-configurable from squid.conf
 4) HTTP(/ftp) downloads from source
 5) ICP server for peers
 6) HTTP server for peers
-- below could be user configurable by ACL's
 7) HTTP server for clients
 7..10) differnet classes of service for clients

 OK, so after we have defined priorities one way or another, we need some way
 to enforce that. How? There are perhaps few different prioritization schemes,
 (taking examples from routers comes first to mind) like absolute priority,
 weighted fair, CIR, CBR, ABR, traffic-shape, etc.
 
 One way or another, we have to deal with real time, we just can't service
 each queue "every N-th event in higher-priority queues". So we'd need additional
 stats for every queue (or even every session), based on which we can determine
 actual "quality of service" and based on that decide how to react.

 Then comes queue-run, or scheduling of servicing. It definitely depends on
 selected queueing scheme, but it must have some parameters that differs
 from that implemented in say routers:
 1) squid can't drop packets ;) so it may face accumulating queues.
 2) all FD's have to be serviced in some timeframe, no matter how low priority.
    (otherwise timeouts would make the cache service suck)

(squid also can't stop servicing some FD for prolonged time, then service it for
 some time, ie. it can't be stop-go-stop-go. Why? because tcp queues in OS
 would blow, tcp sessions would stall, needing slow start later on, having far
 worse impact on the real tcp session than the priority schemes enforces)

 Speculating, I would imagine _one_ example to implement that could be:
 1) at FD creation time, determine the priority of FD and assign it to some
    queue (array, linked-list, .. )
 2) Service (ie. poll() each queue (collection of FD's) at predetermined rate.
  - if it is a non-limited queue, service all FD's in a row,
  - if it is limited queue, service FD's in a row until exceed queue's max
    (be it byte count, or spent time for eg), then save last serviced FD to be
    able to restart from there next time, and exit
  - while servicing each queue, schedule next run time using mSec accuracy.
( - if we are past schedule, rerun higher priority queues before going to next
    lower queue)
 Of course, schedule highest priority queues to be run every time.
 
 This would be like constant bitrate scheme, not quite suited for squid I guess,
 but still might be needed for some cases. for eg. a queue or few could be
 assigned to be CBR queues (eg. to throttle back porn traffic, or some customer
 who have subscribed for given bitrate)

 Of course, it would be nice to define "queue" in the first place, but it seems
 to be quite tricky.

 1) You can keep few separate collections of FD's and poll them separately,
  - but then you can block to syscall poll() for too long.
  - Reducing poll() timeout would increase syscall rate and context switches,
    burning cpu wastefully.
  - You could set the poll() timeout to be the time for next earliest queue's
    scheduled run.
  - it is still very difficult to continue as "best effort service", its too
    bound to be time-driven rather than event-driven.

 2) You could poll all open FD's at once, but only service in priority order.
  - first those that are highest priority and need attention every time
  - then those that meet some criteria defined elsewhere (like the queue's,
    to which the FD belongs, quota and next run-time, etc.)
  - then, after repoll of FD's and servicing highest priority, continue with
    lowest priority FD's at the leisure.

 3) You could scan all FD's and poll only those appropriate for current timeframe.
  - include highest priority FD's every time
  - include those FD's whose scheduled time is in the past (missed guys)
  - whatelse.
  - then service them in whatever order seems nice and reschedule according to
    policy.

 Of course, implementing some sort of configurable policies would go way up
 in complexity, but some selection should be in place. It is pretty difficult
 to even select what should be there, not speaking of how to implement it.

 Therefore, it would be really lovely, it you guys would express your thoughts
 on wether at all, what at least, what ideally, and perhaps also how would you
 implement queueing?

 Pesronally, I'd like to have these types:
 1) ACL based assigning to different classes or queues or priorities (COS)
 2) ACL based rate control of those classes or individual IP's (sessions) (QOS)

 then:
 1) ACL based throttling of specific IP, URL, etc, that is per session bitrate.
 2) ACL based rate enforcement for class that can represent a collection
    of client IP's, URL's, domains, etc.
 
 Eventually, if we can control how we allocate priorities to differing types
 of events, we can optimise and control how squid behaves in case of overload.

> > Idle connections (while not polled explicity) still have to scanned each
> > time through the select() loop to see if they have a handler attached. This
> > seems like a waste. We should only be putting sockets onto the select list
> > if they are on a list of possible sockets, not scanning each time through to
> > find which ones.
> >
> > I predict that with an efficient select() loop mechanism about 10% CPU could
> > be saved.
>
> The huge one here is incremental updates of the select/poll
> structures. It's pretty silly to have nice pretty functions that all
> the handler updates go thru, but then every call to comm_select()
> searching the entire list to find which ones have handlers.
> Adding a couple of global fd_set's (pollfd's) which are updated every
> time a handler is added/deleted, and then having comm_select do
> fd_set tmp_read = global_read;

 About a year ago I wrote up a poll version that defined a global FD set
 and that was updated at points where handlers are installed. So comm_select
 didn't need to check for all the handlers - they were already in place.
 With poll() it is especially cute, because it doesn't modify requested
 pollfd.events but only returned pollfd.revents.
 At the time I was objected that scanning for warm FD's in each select loop
 has neglible impact to CPU usage and that maintaining global fd_set to be
 in sync with real life adds more unneeded complexity.

> > *) Capping number of sockets handled per select() loop.
> >
> > With a large select loop, it may be possible that it takes a long time to
> > get to te end and call the next eventRun(). This can hinder squid's cleanup
> > and maintenance mechanisms. Perhaps breaking up LARGE select chunks into a
> > few smaller ones would be good.
>
> What are you trying to optimize for here? This sounds like a latency
> optimization. It would probably be a better idea to have the squid
> cleanup/maintenance in a seperate thread or such like.
 I guess not unless squid is rewritten to be fully threaded...

> Even just
> having the wakeups in a seperate thread, and have the comm_select()
> execution loop poll the global variable.

 we need to make comm_select restartable, ie. it should remember last FD it was
 servicing after a poll. then it can exit at any time and on next run continue
 from where it left.

 Then, if we have collections of classified FD's and priority queues then we
 already have the needed splitup.
 
> Or better yet, just make the cleanup/maintain routines notice how much
> work is actually needed, rather than assuming they'll be called a
> minimum of X times per time period.

 Or make just another item for queues with defined priority...

> > *) Moving ICP query responder/receiver to a sub-thread.
> >
> > This will be a BIG win by removing the ICP poll out of the main thread and
> > the associated processing that goes with it. This function has two
> > operations.

 I'd avoid that. Giving separate thread means giving maximum priority.
 I do not want some huge peer bombarding my cache to death, I'd want to control
 the rate at which my cache responds to remote queries, ie. I want ICP server's
 priority be _lower_ than that of servicing my customers.
 Of course, getting replies to your own queries is totally another matter ;)
 we're selfish, aren't we? ;)
 But still, I'd like the squid be more internally controlled rather than
 externally, especially if thinking of possible DOS attacks.

> > 1) Sending ICP queries and receiving replies
> > 2) Responding to ICP queries
> >
> > By breaking this out the parent thread won't need to worry about the ICP
> > socket and can just poll (similar to the way it currently does), but it
> > only needs to do a pointer lookup instead of an actual poll().
> >
> > This could gain another 10%+ CPU based on our current ICP handling loads.
> > This load would also translate well to another CPU.
>
> Am I reading this right? What you've have is a child thread sleeping
> on select or recvfrom or something, and every time it gets a packet,
>
> This should dramatically reduce the number of system calls
> (automatically reducing CPU usage).

 yep, this is nice, but as said may cause problems when you want to prioritize
 what you service first what next.
 
> > *) Inlining DNS lookups.
> >
> > I have suitable code for building/sending/receiving/decoding of
> > packets sent to a DNS server. It would just need to have a DNS socket
> > open full-time to send DNS requests out/receive DNS requests and do socket
> > waits on that. Alternately this operation could be pushed into a single
> > DNS thhread that handles the multi-plexing operation of multiple DNS
> > requests.
>
> Is there a major win to come from this?

 Yes there is. For every call to resolv lib, it first creates a UDP socket,
 sends query, gets a reply and closes socket. By keeping only one socket open
 inside squid, we avoid at least 2 syscalls and a bunch of context switches
 due to IPC between processes. It also eliminates the need to worry about
 the number or dnsserver proccesses needed, and allows to catch delayed
 replies from DNS server and update DNS cache even after we gave up waiting
 for the reply.

 Btw, Stew, what code? Way back I suggested to use Darren Reed's arlib found
 in every bind contrib. I even got written permission from Reed to modify his
 code to death if needed for squid, but I never had enough time ;)

> > *) Squid FS.

 I'd be really cautious with that. There are so many things to go wrong
 that can render the whole benefit to nothing. And it make take "years"
 to develop robust (and efficient for all cases) FS.
 By moving FS to inside squid we take high risks.

 I'd rather focus on changing squid's disk usage patterns in such a way
 that it would make it really easy for ufs and OS to do what we need.
 I believe there are tons of things we can do in that direction.

> > I've done some research on this. Squid would work well with a 4K

 I guess you have few pointers to some performance comparisons and
 differing FS internals analysis? Could you share?

> > frag/ 8K chunk filesystem. Basically we want to write a stripped
> > down version of UFS that allows indexing by inode alone. I have a
> > design here at Connect that works out to an average of 2.5 disk
> > accesses per new object pulled in, and 1.7 disk accesses per cache
> > hit, based on a sample taken from our live caches.
> >
> > Compare this to an average of approx 7 disk accesses per second with UFS on
> > a new object write, and an average of 3.5 disk accesses per cache hit.

 How do you measure that?

> > This means about a 250% gain can be had from a customised UFS filesystem.

 Where does this gain come from? Could you tell more detail?

> > *) Threading squid completely (!)
> >
> > I'll leave this open.
>
> Grin. I suspect this one would be total-rewrite material. If you're
> going to fully thread, then you'd get rid of all the call-back cruft
> that's around at the moment, shrink the code size by a factor of 3,
> and allow for full use of multi-processor machines.

 Would it then be Squid 1.2 anymore? or would it be Octopus 0.1? ;))

 ----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
 ----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:52 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:52 MST