From: "Andres Kroonmaa" <andre@ml.ee>
Date:          Fri, 10 Jan 1997 23:36:36 +0300 (EETDST)
Subject:       Threading and Squid
==============================================================================

             Squid - current problems, future considerations.
               (for free reading and thinking and comments)


      Writing of this is inspired by a wish to make squid better even
  more. Squid is good, but there are always ways to make it better. I
  have used squid for quite a long time and it serves here quite a big
  population of users, which totally depend on it, love it, damn it,
  depending on how it performs. There are some design considerations
  which if implemented could bring squid to a new level of functionality
  performance, and robustness. Then again, it could prove to be too much
  of rework to make it useful.
  To better express by thoughts, I'd like to start from some general
  mumble, some of which is nothing new to you, some of which may be new
  or scary, some of which may be also simply wrong.

  *  *  *

  Currently more and more vendors are starting to provide multiprocessor
  systems with competetive pricing. More servers are built with 2 or more
  CPU-s to cope with more complex tasks faster. Just adding more CPU-s
  gives reasonable performance upgrade if there are lots of independant
  and separate tasks, but no performance shift for any singlethreaded
  process. To take advantage of many CPU's, programs must be written to
  be more parallel. Simply splitting into several proccesses may prove
  to be quite inefficient in terms of memory requirements, request
  service times and complexity. Until recently most unix services used
  fork() to service each separate request, consider classical httpd as
  an example. Such a design appeares to be very inefficient when pushed
  by very many short requests per second or many simultaneous requests.
  Fork is slow, memory consuming copying itself and reserving memory
  of 1MB average for small process, to only serve a request of few hundred
  bytes. 100 simultaneous proccesses would need at least 100MB of RAM to
  serve them without noticable swapping activity. Slow fork times reduces
  considerably serviced request count per any time unit.

  Of Web servers many vendors have made a long step forward to reduce the
  time needed to service short requests, apache preforking httpd as an
  example. This design saves time on forking but do not in any way reduce
  memory requirements of running identical proccesses. The more clients
  one need to service at a time, the more memory is needed, and in fact,
  wasted.

  Writing fully non-forking service can give incredible performance boost,
  as the Harvest/Squid project has shown.
  Good reading about Harvest cache design, from which Squid is derivered
  can be found at
        http://excalibur.usc.edu/cache-html/cache.html
        http://netweb.usc.edu/danzig/cache-description/harvest_desc.html
  Strongly recommended to understand Squid internal principles.

  Squid's principle design is a huge select() loop that waits for incoming
  request and does all the job. Non-blocking IO is used and main loop
  dispatches all the data flow between client/cachedisk/httpserver. There
  is a huge amount of possiblities to take into account, every session
  must be tracked through its states and this makes Squid VERY complex.
  If forking code simply relies on OS to do the multitasking and all needed
  state data is contained with forked proccess, then squid has to keep
  track and do almost all multitasking within itself. Squid is trying to
  minimize time spend in OS, uses dynamically allocated memory to keep
  state data, and "hopes" that it can serve all requests in best manner
  possible. Any single job (thread?) that is run, is expected to be done
  shortly or redispatch its further run for some other time in future
  and then return to select loop. If some job blocks for more time, the
  whole proccess sits doing nothing, and severe delays in serving requests
  are introduced. As example - requesting VM objects or Objects via
  cachemgr interface makes Squid very busy and during that time actually
  no traffic occurs between client and cache, cache and remote server.
  This can cause connections dropping, or TCP windows blowing, in any
  case this causes trouble.


  Unix is preemtive multitasking system. Squid is cooperative multitasking
  process if viewed as a threaded code. Problem with that is long known -
  any single piece of code thats broken can bring whole process down and
  possibly hundreds of sessions. MS Windows and Netware are designed
  in this manner, for example. We all know how single task in Win3 can bring
  whole PC down and it is hated for that. Netware is the same, it only is
  so stable because its usage patterns are very monotonous and debugged
  well.

  In that sense forking code is more bug tolerant - a bug that crashes a
  service proccess will not affect other sessions (if, of course, the bug
  is not of a severe kind, but appeares in some special circumstances and
  rarely enough to consider the code usable overall). So, for speed one
  must pay a price - crashing Squid may mean hundreds half finished
  sessions to brake, wasting possibly hours of retrieving time.
  Restarting Squid takes also considerable amount of time, during which
  no clients are served.

  Still, Squid has some same problems as forking services has - it uses
  external ftpget processes to retrieve ftp objects. Each new ftp request
  means forking another ftpget that actually retrieves needed object. Busy
  cache serving possibly thousands of simultaneous retrievals simply
  cannot provide enough memory - if we consider 500K average per process
  we would need hundreds of MBs of RAM to do the job without swapping.
  Yet worse is that one can never fix Squid's memory needs at some amount,
  and the memory needs are totally dependant on usage pattern. Our cache
  has been several times on swapping because of numerous users retrieving
  ftp objects from some cool but slow sites. Being on swap means
  disasterous disk paging upon almost every request and incredible
  increase of servicetimes, up to tens of seconds to serve nearby http
  server.

  Another forked part of Squid is dnsserver, which is less of a problem,
  as it is preforked fixed amount of times and never changes.

  So, actually Squid as a whole represents all design models,
    1. non-forking select.  ( squid main process )
    2. preforking,          ( dnsservers )
    3. fork per request,    ( ftpget )
  Quite a nice academic piece of software. ;)


  Current Squid problems.

  Fork per request ftpget.

  Now whats wrong with that, and what can be done? First, forking code
  is quite a big problem with its memory usage. Squid itself needs alots
  of memory to run fast, keeping most used objects and hash tables in
  ram. Typical squid memory usage is well over 50MB of ram, far more
  on busy caches. Ours, relatively small cache with about 15000 req/hour
  and 200 concurrent sessions at peak times, we need about 100 MB of ram
  for squid only, plus lots more to retrieve ftp objects, we have 128MB
  of ram now and its far too little, we are going to upgrade to 256MB.
  And this box does absolutely nothing else but caching.
  Maybe some sites use half a GIG of ram, but to me, memory usage is
  still a big problem. Anyway keeping hundreds of megs of ram on cache
  box just in case is always hard to accept. But you never know when
  your users will push your cache to its limits, and with squid these
  limits are not controllable. If it gets out of memory, it simply exits
  on the very first error from OS, or if swapspace is huge, it slows
  down to a crawl. In any case, when this happens, you are in deep soup.

  It is highly desirable to integrate ftpget into main squid process
  thus making it non-forking model. It is now kept as separate because
  implementing ftp would make squid even more complex and prone to
  severe bugs. Then again, as separate process, it duplicates all its
  code in vain instead of reusing the code. Implementing ftpget as
  in squid with select loop, but a single separate process could give
  both independance of squid from ftpget and resource conservation, but
  also has same level of complexity to deal with as when integrating into
  squid. Keeping track of all possible states during ftp session is not
  easy as http is connectionless while ftp is connection oriented. Of
  course, all is possible, but looking at current ftpget code, which is
  a plain retriever, its hard to think that it would be an easy task.

  What about multihtreading?

  Then again, several unixes has a support for multithreading, that is
  multiple virtual proccesses in a single one. All resources are shared
  between threads as is shared actual code. Starting a new thread is
  (almost) as fast as starting a subfunction and yet they can be made
  reasonably independant from each other and very parallel. Scaling on
  multi CPU systems is best as switching between proccesses needs quite
  a costy (in measure of time) context-switch while switching between
  threads is as fast as switching to other procedure call.
  Creating a thread is making a virtual fork and a virtual process, so
  this actually kind of virtual fork per request model, which used to
  be easy to program, remember?

  Sadly, not very many OS-es support standartized multithreading support,
  and writing squid using threads may become a porting nightmare. I have
  had some experience with SUN Solaris threads. SUN is looking at POSIX
  threads standard development and already now ship its OSes with support
  to posix threads, although the standard itself hasn't yet been fixed.
  There are also some free posix thread libraries that can be built and
  may give benefit on other oses. One that might be of interest can be
  found at ftp://sipb.mit.edu/pub/pthreads/

  Writing multithreaded program is in theory very simple. Make one thread
  accept connections, on each request create a new thread that will service
  that request and pass to it the only argument - socket descriptor, and
  wait for the next request. Classical - accept - fork - service - exit,
  but using fast thread creation instead of fork and shared code.
  When created thread has done, it exits. Whole process is quite small, as
  only thread-tracking data and data used by threads themselves is in use,
  all the code is reentrant and reused by many virtual processes. All we
  need is to make sure that no thread will interfere with others, never
  overwrites any data some other thread might need or is using and that
  no thread blocks in OS-es library that cannot be used reentrantly.

  If threads are created to be sheduled by OS and OS supports reentrant
  socket operations, it is lovely to write threads that block on read or
  write, thus giving away their cpu time, no need for a select loop and
  no need for select's huge overhead when using thousands of open files,
  no need to fear about select's ability to handle more than 1024 files.

  Threads housekeeping is totally different from usual programming and
  introduces many new and bizarre situations, is much harder to debug,
  but it is also much faster and much more interesting possibilities
  arise.

  Writing ftpget as multithreaded process could be done probably,
  as it is already written by keeping in mind possible future integration
  into squid main process. Then again, if it works as a standalone MT
  process, why not integrate it into squid as separate thread? The main
  logic of squid is select loop, all interfacing is done through sockets
  or filedescriptors. It can be definitely possible to interact with this
  thread through yet another FD, in first go just slightly rewriting
  ftpget to meet MT-Safe standards and replacing fork algoritm with
  thread creation. Then we end up with a process that connects to itself,
  accepts connections from itself and transfers data from itself to
  itself. Not very beautiful. At this point of time, I'm still not very
  confident with all of squid's internals, but it seems to me, that to
  make squid as a whole into multithreaded process would need quite a lot
  of rewriting, as it as of now is not suited for reentrant code. Of
  course, I may be wrong.

  Ideally, some day, squid could be totally MT (multithreaded) software,
  having each request serviced by separate thread from beginning to the
  end. All threads sharing such cruicial structures as all metadata and
  updating it in controlled fashion. Only system imposed limits on thread
  count and open files would limit maximum simultaneous sessions, it
  would appear more event-driven and no single session or event could
  block whole service for any longer time. (of course, bug can)

  Most beneficial in multithreading should be considered possibility to
  divide all kinds of tasks into separate selfcontained parts that may
  be more easily readable. Many parts of code would become simpler to
  write, simpler to read, debug. Teamwork could be more easy. Currently
  all parts of squid are tightly dependant on each other, yet code is
  very dispersed and quite hard to track. Ironically, squid, currently
  being singlethreaded in fact, is already multithreaded in reality,
  using its own dispatching mechanism, which is very hard to track when
  reading the code. When written in MT, each request would be as virtual
  process that runs from some beginning to its end and it should be much
  simpler to track and understand its logical flow.

  Each maintenance job inside squid could be made a separate thread, all
  running in parallel if possible, making most out any multi cpu system.

  Yeah, dreaming too much...


  Seen pros, what about cons.

  I'll try to show some main problems with multithreading, sure there are
  better gurus who have written more for MT, maybe they'll speak up, here
  I express my experience.

  Main problem with MT is intercommunication between threads in effective
  and safe way. Shared data should be updated only within locks as should
  be reading it. Threads can be either OS sheduled (bound to system timer)
  or threads-library sheduled, bound who knows to what, but usually passing
  thread sheduler each time system calls are called that are hooked by the
  library. Need for extensive locking arises from totally asyncronous
  nature of threads, one never knows exactly when thread switch occurs.
  As there is lots of data that needs to be locked, there is expected
  quite considerable overhead by the thread library itself. The code
  must be designed from start to minimize needed locking and at the
  same time minimize code regions where any data is locked.

  MT introduces many troubles in thread deadlocking when design in wrong
  way. Zombie threads, if locked up in very bad regions can even block all
  the proccess from accessing some system resource. I've seen lockups that
  cause impossibility to kill -9 the offending proccess. Yet, although
  process was dead, other thread accepted connections, holding incoming
  socket bound and not permitting a restart. Bad news for regular user,
  the only way out is to reboot the whole box. And it can take weeks to
  find out where the hell the deadlock occurs.

  So, main task in designing MT process should be careful access to shared
  resources. If that care is taken ok, MT begins to show its benefits.

  MT lib overhead can become quite noticable and could be compared in some
  way with large database systems that seem to be clumsy and slow in small
  tasks where much cheeper and smaller systems would do better, but when
  pushed to limits, large systems can sustain much more load and overhead
  deminishes.

  There are lots of system calls that cannot be used concurrently by many
  threads inside the process, as system allocates some resources in a per
  process manner. This can lead to many sections of code that need to be
  locked by threads before entering OS/library, reducing concurrency.
  Usually threads library takes care of serializing access to such resources,
  blocking running threads and making sure that only one thread at a time
  accesses these resources. There are many calls that are totally unsafe
  to be reentered by other threads and which are not protected by threads
  lib, so writing MT code one must know exactly whether it is safe call,
  and safe on all supported platforms. Then again, writer must know whether
  protected library call can provide any concurrency, and if not, possibly
  redesign algoritm or there might not be any benefit in threading overall.

  To illustrate this, consider:
  As of exercise, I tried to integrate dnsservers into squid as threads,
  just because it was easiest try. Instead of forking several separate
  dnsserver processes I rewrote dnsserver main loop as a procedure call
  and started off as a thread. All other principles stayed the same.
  During initalization dnsserver waited on accept, while squid main thread
  connected there, they did the handshake and established socket connection
  as in original dnsserver. I started max 32 dnsservers and there was not
  much difference whether there were 2 or 32 of them either in starting
  speed or memory usage. each thread added some few pages to the total.
  Now, when stresstesting the dns resolving, it shortly appeared that dns
  service times actually went very high, up to 100 secs. Investigating
  showed resolver library uses static storage that makes them unsafe in
  threaded code. Reentrant alternates are given with threads lib, but
  alas, they provided no concurrency. All of 32 threads where accessing
  resolver calls one at a time while others where suspened in a runqueue,
  giving me 1 single dnsserver effectively. So, actually, interating
  dnsserver using threads proved to be "not so good idea (tm)", at least
  if using straightforward approach. The dnsserver was moved to separate
  processes because of its blocking nature, which squid is trying to
  avoid to do other tasks, bringing it back in with threading did not
  block squids main loop, but introduced blocking type calls of slightly
  different nature. This sort of gotchas must be constantly expected.

  Actually, to overcome resolver problems, there are solutions. There
  are (or is) free librarys that implement non-blocking resolver calls,
  (arlib in bind-4.9.5 contrib by Darren Reed) meanwhile knowing nothing
  (caring) of multithreading. Idea is simple - upon each request library
  returns a handle, (actually opened socket descriptor on wich request
  has been sent to nearest dns server) internally keeping track of all
  pending requests. It is responsibility of coder to wait in a select
  loop until dns server responds or timeout is reached. When data is
  recieved, other library call with that FD grabs the data and parses.
  Thus, many concurrent dns searches can be implemented without blocking
  in gethostby*(). But hey, this would be fairly easy to put into squids
  current select model. So, no need to mess around with threads and
  passing dns data around through many sockets and FDs. Right. But
  squids complexity grows and grows. Thread would not need a select,
  instead would block on read and would wait for socket timeout.


  There are other implications when using threads. Threads are not
  protected from each other. A bug in one thread can bring down whole
  service, while separate proccesses are immune from each other. This
  is nice feature of forking code. If some user constantly requesting
  something nasty that brakes squid suffers only himself with forking
  code as only process serving his request dies, then with threads it
  might blow away hundreds of sessions bringing down the only process.
  But this problem is present in current squid design too.


  Squids current select model has its limits, all the way growing
  functionality, object database and userbase, it will have to cope
  with more and more processing between select sleeps. And at some
  point, if requests come in faster that it can process its data in
  between, squid hits its speed limits quite hard and squids current
  design will not benefit from multiple CPU systems.

  In conclusion, all three models have their positive sides and negative.
  Maybe it would be best to mix and match these to get most benefit
  from all of them.


  Whether squid would become multithreaded or not, is for sure not the
  main idea of this writing. It may simply prove to be too much of
  rework that a separate project would be appropriate. But threading
  is coming into standard and it has lots of benefits, and hell, its
  fun to play with, so just consider this possibility, at least don't
  burn your bridges when writing code, keep in mind that someday,
  squid may want to become multithreaded.

  think about it,


-------------------------------------------------------------------
 Andres Kroonmaa                          Telefon:        6308 909
 Network administrator
 E-mail: andre@ml.ee                        Phone: (+372) 6308 909
 Organization:    MicroLink Online
 EE0001, Estonia, Tallinn, Sakala 19          Fax: (+372) 6308 901
-------------------------------------------------------------------