Re: Squid performance wish-list from Andres Kroonmaa on 1998-08-28 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Fri, 28 Aug 1998 13:00:16 +0300 (EETDST)

On 28 Aug 98, at 11:47, Stewart Forster <slf@connect.com.au> wrote:

> > I agree that directory structure adds overhead, but it adds also abstraction
> > layer that allows you to use virtually any type of future FS type provided
> > by vendor. By bypassing this abstraction, you limit yourself to your FS only.
>
> And why is that a problem. The FS would be a #idef'ed thing controlling its
> use at compile time. If you wanted to try it out, just do it. If you don't
> trust it, that's your choice.

That's not quite right attitude. I know that you're more of a practical type
and that you would simply implement what you have planned, and that it most
probably would work right away, and I'd most probably trust it.
What makes me worry slightly, is that IF it doesn't fit all situations and
people would need to use classic fs, then squid would not use it most
efficiently. Giving them a choice of using new squidfs or go away is not
quite right.

I repeat, I don't object working on special fs, I agree that it may (will)
prove way more efficient. Simply I'd like to see squid behave on classic
FS's also much more efficiently and I believe it could.

> > about 30 urls/sec my disks are doing about 16-20 reads/sec and about 25-35
> > writes/sec, giving the average about 1-1.5 of disk accesses _per URL_.
> >
> > (output of iostat -x 30 for the same timeframe as squid stats above)
> > extended device statistics
> > device r/s w/s kr/s kw/s wait actv svc_t %w %b
> > sd8 16.0 26.8 110.0 198.7 1.3 0.3 37.9 2 30
> >
> > extended device statistics
> > device r/s w/s kr/s kw/s wait actv svc_t %w %b
> > sd8 18.1 26.1 120.5 196.1 1.4 0.3 39.3 1 30
> >
> > I can see that squid did 20+14=34 opens/sec and disks have done 18 reads and 44
> > ops in total. Thats fairly efficient. If you sum together all squid disk ops, then
> > we have to face that for about 185 disk ops issued by squid, system is doing about
> > 44 disk accesses, and that sounds way better than what you claimed for UFS.
>
> Is that 14GB spread across how many disks?? Are using using a single 14GB
> spindle? Is sd8 that spindle? Are there more spindles used for caching?

I'm using compaq SMART-2DH controller with 6 spindles in RAID-5, and OS sees them
as a single spindle sd8. All the caching is on this only volume, although there
is a separate logical drive defined (in the same stripe) for OS itself.
To be honest, logical drive size is 16GB, but there's something wrong with it, and
Solaris would not allow squid to fill it past 14G, so its running with 2G unused.
On the bright side, OS has plenty of room to avoid fragmentation.

> 38 URLs/sec over a 30 second period.
> 12 x 4GB disks
> extended disk statistics
> disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b
> sd30 4.7 2.9 31.3 22.3 0.0 0.2 31.6 0 8
> sd31 6.1 11.0 59.4 78.7 0.0 0.9 55.2 0 16
[..]
> sd77 4.7 11.3 32.0 84.8 0.0 1.0 59.6 0 16
>
> We see a total of 57.8 reads/sec, and 110.9 writes/sec.
> That comes out to an average of:
>
> 560/30 = 18.66 object reads/sec for 57.8/18.66 = 3.1 disk ops/object read
> 624/30 = 20.8 object wrutes/sec for 110.9/20.8 = 5.3 disk ops/object write

Hmm, very strange. Especially in light of your box with exceptional specs.
In fact I'm not sure if this comparison is right, because we are comparing
different vers of squid. I'm running modified NOVM.18 while I believe you
run 1.2 with threads? Then again, there is not much difference in the task,
the disk usage is the same anyway.

> This gets worse as load increases to the peaks of about 3.5/7.0 I quoted

BTW, I come to think that this is example that makes me wonder if there is
big difference whether you use single logical volume (stripe) or couple of
separate spindles.
Indeed, OS has to cache 12 times the superblocks, 12 times the root directories,
12 times the L1 and L2 dirs, etc. If you have a default L1/L2 setup of squid
then you most probably have at least 2 times more L1/L2 dirs than you need.
In overall, you seem to need about 12 times more buffer cache than I do...
Given that you have 2 times the ram I have it seems that I'm able to cache
about 5-6 times the metadata, thus avoid 5-6 times the disk ops you have?
This really might prove that having few huge disks is more efficient than
using lots of small disks? I used think other way back.

> thrashing. Also I've turned `fastfs' on for these filesystems in the
> meantime which batches directory updates at the expense of possible corruption
> in the event of a system failure, and that's why the disk writes have
> dropped off a bit too.
> `fastfs' is not a desirable long-term solution.

I'd argue. I've overlived several system failures with fastfs on, and I'm
pretty convinced that the benefits it gives outweights the risk. Yes it takes
time to fsck the damaged volume, and it takes squid to rebuild the store later,
but you really expect the box to run without a failure for at least a year
or so, and even if it is too much a task to repair the volume, for squid it
might be easier to newfs the thing, and go on. With fastfs you avoid about
2 disk accesses per each disk object, and thats alot.

> If your system is performing better, I'm happy for you. We however have
> a genuine need for a filesystem that performs better than what we are seeing.

I understand. If there is no way to optimise your system without a redesign,
then there is no other way for you.
Still, if you implement all the optimisations that are possible for classic
OS fs, then you might find designing special fs not worth the effort...

> You can't dismiss a need just because you can't see one for yourself.

I've seen awful behaviour way back when squid used to select random directory
for the next write. Then it was changed to pass all L2 dirs sequentially. This
improved alot. Then I realised that inode and L2 dir caching is essential, and
I increased my OS caches for these, things got even better. Then I decided
to change fileno selection algoritm even more and this allowed OS to cache only
those L2 dirs that are really used, that is much more compact subset of all dirs.
This also allowed me to avoid most of the unlink calls easily, and this improved
my perf by yet another step.
Then I decided to go RAID5 instead of several separate spindles, and in the
light of OS caches this seems to be right decision now. Although I loose in
concurrency of several disks, I seem to win alot in reduced memory need and
I'm able to cache and thus avoid much more disk ops...

So, I'd say I question the need for new FS not because I can't see the problem,
but because I can see other ways to solve them...

> > In my view, there is too much work to overcome something that can be and
> > should be fixed by efficient caching. And what needs to be done around squid,
>
> OS's do general purpose optimisations. For squid at V.high loads OSes
> break.

here I disagree.
Its about the same as compiler does general purpose optimisations. If you use
weak algoritms, it cannot make things right. If you want the compiler to do
cute optimisations, you would try to not confuse it.

> I'd rather have a specially designed FS thats consistent for everyone,
> than one I can only get at by poking OS dependant variables.

here I agree ;)
If you come up with stable squidFS, you'd deserve a monument ;) And if it proves
to be more efficient than what I can achieve with OS tuning, then I see no reason
why I wouldn't switch.

> > This sounds good. But
> > - would you implement cache in ram? how'd you lock cache from being paged out to swap?
>
> Disk buffer cache RAM would be mlock()ed. The amount of this is user
> definable.

Please note that mlock() needs root priviledges, and to force squid to run as
root is not very good idea...

I'd rather suggest to install a special background task (thread?) that "touches"
all pages every say 15 secs. This should avoid pageouts in most cases. Btw, why
not even implement some sort of "registration" to the task, so that squid could
"mark" some other structures also for being kept in ram? This would be pretty
lightweight task.

> > - work over network (LAN)?
>
> Not relevant. If someone wanted to write a user-level NFS style access to
> the filesystem they could. Do you foresee a need for this?

Perhaps Clusters? But not seriously ;)

In the end, I definitely wait for your squidfs to come out, and most probably I'll
have 2 equal squid boxes running side-by-side. I'd install your work on one and
will try to beat its perf by tuning another ;) Only then we'll really see if it
was worth the effort.

----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:53 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:54 MST