Re: System Requirements from Karn.k on 2000-08-06 (squid-users)

From: Karn.k <karn.k@dont-contact.us>
Date: Mon, 7 Aug 2000 09:35:30 +0700

----- Original Message -----
From: Kevin Ruggiero <kdr@cse.Buffalo.EDU>
To: <squid-users@ircache.net>
Sent: Friday, August 04, 2000 9:10 PM
Subject: System Requirements

>
> I'm currently looking into proxying solutions, and I noticed that the
> recommended configuration on the squid site is PII 300 with SCSI
> hard drives (that was listed as of 1998). When I look at the requirements
> for Microsoft Proxy Server, they are actually very low (P133 w/ 64RAM
> suggested for supporting 0-300 PCs).
>
> I know sys requirements can be very subjective, and the squid site
> didn't mention how many desktops that would support, but I'm wondering why
> there's such a large gap. Can anyone offer some insight here? What would
> I really need on squid to get good performance while supporting about 100
> PCs?
>
> Thanks.
>
>
> -----
> Kevin Ruggiero -- SUNY at Buffalo
> E-mail: kdr@Buffalo.EDU
> URL : http://www.cse.buffalo.edu/~kdr

Dear Kevin,

I think it depends on how often your users access squid. My squid is on
486/66 64M RAM, Cache size is about 1.5 GB. Total user about 300 PC, 1.5
request/second. Avg CPU loading 30%. From the user guide it does not
require a lot of RAM which conform with my case. I attach a copy of Squid
Hardware Requirement FYI.

Karn Kanjanarat.
----------------------------------------------------------------------------
---------------------
Chapter 3. Installing Squid
Table of Contents
Hardware Requirements
Choosing an Operating System
Basic System Setup
Getting Squid
Compiling Squid
Hardware Requirements
Caching stresses certain hardware subsystems more than others. Although the
key to good cache performance is good overall system performance, the
following list is arranged in order of decreasing importance:

Disk random seek time

Amount of system memory

Sustained disk throughput

CPU power

Do not drastically underpower any one subsystem, or performance will suffer.
In the case of catastrophic hardware failure you must have a ready supply of
alternate parts. When your cache is critical, you should have a (working!)
standby machine with operating system and Squid installed. This can be kept
ready for nearly instantaneous swap-out. This will, of course, increase your
costs, something that you may want to take into account. Chapter 13 covers
standby procedures in detail.

Gathering statistics
When deciding on your cache's horsepower, many factors must be taken into
account. To decide on your machine, you need an idea of the load that it
will need to sustain: the peak number of requests per minute. This number
indicates the number of 'objects' downloaded in a minute by clients, and can
be used to get an idea of your cache load.

Computing the peak number of requests is difficult, since it depends on the
browsing habits of users. This, in turn, makes deciding on the required
hardware difficult. If you don't have many statistics as to your Internet
usage, it is probably worth your while installing a test cache server (on
any machine that you have handy) and pointing some of your staff at it.
Using ratios you can estimate the number of requests with a larger user
base.

When gathering statistics, make sure that you judge the 'peak' number of
requests, rather than an average value. You shouldn't take the number of
requests per day and divide, since your peak (during, for example, lunch
hour) can be many times your average number of requests.

It's a very good idea to over-estimate hardware requirements. Stay ahead of
the growth curve too, since an overloaded cache can spiral out of control
due to a transient network problems If a cache cannot deal with incoming
requests for some reason (say a DNS outage), it still continues to accept
incoming requests, in the hope that it can deal with them. If no requests
can be handled, the number of concurrent connections will increase at the
rate that new requests arrive.

If your cache runs close to capacity, a temporary glitch can increase the
number of concurrent, waiting, requests tremendously. If your cache can't
cope with this number of established connections, it may never be able to
recover, with current connections never being cleared while it tries to deal
with a huge backlog.

Squid 2.0 may be configured to use threads to perform asynchronous
Input/Output on operating systems that supports Posix threads. Including
async-IO can dramatically reduce your cache latency, allowing you to use a
less powerful machine. Unfortunately not all systems support Posix threads
correctly, so your choice of hardware can depend on the abilities of your
operating system. Your choice of operating system is discussed in the next
section - see if your system will support threads there.

Hard Disks
There are numerous things to consider when buying disks. Earlier on we
mentioned the importance of disks with a fast random-seek time, and with
high sustained-throughput. Having the world's fastest drive is not useful,
though, if it holds a tiny amount of data. To cache effectively you need
disks that can hold a significant amount of downloaded data, but that are
fast enough to not slow your cache to a crawl.

Seek time is one of the most important considerations if your cache is going
to be loaded. If you have a look at a disk's documentation there is normally
a random seek time figure. The smaller this value the better: it is the
average time that the disk's heads take to move from a random track to
another (in milliseconds). Operating systems do all sorts of interesting
things (which are not covered here) to attempt to speed up disk access
times: waiting for disks can slow a machine down dramatically. These
operating system features make it difficult to estimate how many requests
per second your cache can handle before being slowed by disk access times
(rather than by network speed). In the next few paragraphs we ignore
operating system readahead, inode update seeks and more: it's a back of the
envelope approximation for your use.

If your cache does not use asynchronous Input-Output (described in the
Operating system section shortly) then your cache loses a lot of the
advantage gained by multiple disks. If your cache is going to be loaded (or
is running anywhere approaching capacity according to the formulae below)
you must ensure that your operating system supports posix threads!

A cache with one disk has to seek at least once per request (ignoring RAM
caching of the disk and inode update times). If you have only one disk, the
formula for working out seeks per second (and hence requests per second) is
quite simple:

requests per second = 1000/seek time

Squid load-balances writes between multiple cache disks, so if you have more
than one data disk your seeks-per-second per disk will be lower. Almost all
operating systems will increase random seek time in a semi-linear fashion as
you add more disks, though others may have a small performance penalty. If
you add more disks to the equation, the requests per second value becomes
even more approximate! To simplify things in the meantime, we are going to
assume that you use only disks with the same seek time. Our formula thus
becomes:
1000
theoretical requests per second = -----------------
(seek time)/(number of disks)

Let's consider a less theoretical example: I have three disks - all have
12ms seek times. I can thus (theoretically, as always) handle:

requests per second = 1000/(12/3) = 1000/4 = 250 requests per second

While we are on this topic: many people query the use of IDE disks in
caches. IDE disks these days generally have very similar seek times to SCSI
disks, and (with DMA-compatible IDE controllers) approach the speed of data
transfer without slowing the whole machine down.

Deciding how much disk space to allocate to Squid is difficult. For the
pilot project you can simply allocate a few megabytes, but this is unlikely
to be useful on a production cache.

The amount of disk space required depends on quite a few factors.

Assume that you were to run a cache just for yourself. If you were to
allocate 1 gig of disk, and you browse pages at a rate of 10 megabytes per
day, it will take at least 100 days for you to fill the cache.

You can thus see that the rate of incoming cache queries influences the
amount of disk to allocate.

If you examine the other end of the scale, where you have 10 megabytes of
disk, and 10 incoming queries per second, you will realize that at this rate
your disk space will not last very long. Objects are likely to be pushed out
of the cache as they arrive, so getting a hit would require two people to be
downloading the object at almost exactly the same time. Note that the latter
is definitely not impossible, but it happens only occasionally on loaded
caches.

The above certainly appears simple, but many people do not extrapolate. The
same relationships govern the expulsion of objects from your cache at larger
cache store sizes. When deciding on the amount of disk space to allocate,
you should determine approximately how much data will pass through the cache
each day. If you are unable to determine this, you could simply use your
theoretical maximum transfer rate of your line as a basis. A 1mb/s line can
transfer about 125000 bytes per second. If all clients were setup to access
the cache, disk would be used at about 125k per second, which translates to
about 450 megabytes per hour. If the bulk of your traffic is transferred
during the day, you are probably transferring 3.6 gigabytes per day. If your
line was 100% used, however, you would probably have upgraded it a while
ago, so let's assume you transfer 2 gigabytes per day. If you wanted to keep
ALL data for a day, you would have to have 2 gigabytes of disk for Squid.

The feasibility of caching depends on two or more users visiting the same
page while the object is still on disk. This is quite likely to happen with
the large sites (search engines, and the default home pages in respective
browsers), but the chances of a user visiting the same obscure page is slim,
simply due to the volume of pages. In many cases the obscure pages are on
the slowest links, frustrating users. Depending on the number of users
requesting pages you should keep pages for longer, so that the chances of
different users accessing the same page twice is higher. Determining this
value, however, is difficult, since it also depends on the average object
size, which, in turn, depends on user habits.

Some people use RAID systems on their caches. This can dramatically increase
availability, but a RAID-5 system can reduce disk throughput significantly.
If you are really concerned with uptime, you may find a RAID system useful.
Since the actual data in the cache store is not vital, though, you may
prefer to manually fail-over the cache, simply re-formatting or replacing
drives. Sure, your cache may have a lower hit-ratio for a short while, but
you can easily balance this minute cost against what hardware to do
automatic failover would have cost you.

You should probably base your purchase on the bandwidth description above,
and use the data discussed in chapter 11 to decide when to add more disk.

RAM requirements
Squid keeps an in-memory table of objects in RAM. Because of the way that
Squid checks if objects are in the file store, fast access to the table is
very important. Squid slows down dramatically when parts of the table are in
swap.

Since Squid is one large process, swapping is particularly bad. If the
operating system has to swap data, Squid is placed on the 'sleeping tasks'
queue, and cannot service other established connections. (? hmm. it will
actually get woken up straight away. I wonder if this is relevant ?)

Each object stored on disk uses about 75 bytes (? get exact value ?) of RAM
in the index. The average size of an object on the Internet is about 13kb,
so if you have a gigabyte of disk space you will probably store around about
80 000 objects.

At 75 bytes of RAM per object, 80 000 objects require about six megabytes of
RAM. If you have 8gigs of disk you will need 48Mb of RAM just for the object
index. It is important to note that this excludes memory for your operating
system, the Squid binary, memory for in-transit objects and spare RAM for
for disk cache.

So, what should your sustained-thoughput of your disks be? Squid tends to
read in small blocks, so throughput is of lesser importance than random seek
times. Generally disks with fast seeks are high throughput, and most disks
(even IDE disks these days) can transfer data faster than clients can
download it from you. Don't blow a year's budget on really high-speed disks,
go for lower-seek times instead - or add more disks.

CPU Power
Squid is not generally CPU intensive. On startup Squid can use a lot of CPU
while it works out what is in the cache, and a slow CPU can slow down access
to the cache for the first few minutes upon startup. A Pentium 133 machine
generally runs pretty idle, while receiving 7 TCP requests a second A
multiprocessor machine generally doesn't increase speed dramatically: only
certain portions of the Squid code are threaded. These sections of code are
not processor intensive either: they are the code paths where Squid is
waiting for the operating system to complete something. A multiprocessor
machine generally does not reduce these wait times: more memory (for caching
of data) and more disks may help more.

----------------------------------------------------------------------------
---------------------
Received on Sun Aug 06 2000 - 20:39:21 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:54:44 MST