From: "Andres Kroonmaa" Date: Fri, 10 Jan 1997 23:36:36 +0300 (EETDST) Subject: Threading and Squid ============================================================================== Squid - current problems, future considerations. (for free reading and thinking and comments) Writing of this is inspired by a wish to make squid better even more. Squid is good, but there are always ways to make it better. I have used squid for quite a long time and it serves here quite a big population of users, which totally depend on it, love it, damn it, depending on how it performs. There are some design considerations which if implemented could bring squid to a new level of functionality performance, and robustness. Then again, it could prove to be too much of rework to make it useful. To better express by thoughts, I'd like to start from some general mumble, some of which is nothing new to you, some of which may be new or scary, some of which may be also simply wrong. * * * Currently more and more vendors are starting to provide multiprocessor systems with competetive pricing. More servers are built with 2 or more CPU-s to cope with more complex tasks faster. Just adding more CPU-s gives reasonable performance upgrade if there are lots of independant and separate tasks, but no performance shift for any singlethreaded process. To take advantage of many CPU's, programs must be written to be more parallel. Simply splitting into several proccesses may prove to be quite inefficient in terms of memory requirements, request service times and complexity. Until recently most unix services used fork() to service each separate request, consider classical httpd as an example. Such a design appeares to be very inefficient when pushed by very many short requests per second or many simultaneous requests. Fork is slow, memory consuming copying itself and reserving memory of 1MB average for small process, to only serve a request of few hundred bytes. 100 simultaneous proccesses would need at least 100MB of RAM to serve them without noticable swapping activity. Slow fork times reduces considerably serviced request count per any time unit. Of Web servers many vendors have made a long step forward to reduce the time needed to service short requests, apache preforking httpd as an example. This design saves time on forking but do not in any way reduce memory requirements of running identical proccesses. The more clients one need to service at a time, the more memory is needed, and in fact, wasted. Writing fully non-forking service can give incredible performance boost, as the Harvest/Squid project has shown. Good reading about Harvest cache design, from which Squid is derivered can be found at http://excalibur.usc.edu/cache-html/cache.html http://netweb.usc.edu/danzig/cache-description/harvest_desc.html Strongly recommended to understand Squid internal principles. Squid's principle design is a huge select() loop that waits for incoming request and does all the job. Non-blocking IO is used and main loop dispatches all the data flow between client/cachedisk/httpserver. There is a huge amount of possiblities to take into account, every session must be tracked through its states and this makes Squid VERY complex. If forking code simply relies on OS to do the multitasking and all needed state data is contained with forked proccess, then squid has to keep track and do almost all multitasking within itself. Squid is trying to minimize time spend in OS, uses dynamically allocated memory to keep state data, and "hopes" that it can serve all requests in best manner possible. Any single job (thread?) that is run, is expected to be done shortly or redispatch its further run for some other time in future and then return to select loop. If some job blocks for more time, the whole proccess sits doing nothing, and severe delays in serving requests are introduced. As example - requesting VM objects or Objects via cachemgr interface makes Squid very busy and during that time actually no traffic occurs between client and cache, cache and remote server. This can cause connections dropping, or TCP windows blowing, in any case this causes trouble. Unix is preemtive multitasking system. Squid is cooperative multitasking process if viewed as a threaded code. Problem with that is long known - any single piece of code thats broken can bring whole process down and possibly hundreds of sessions. MS Windows and Netware are designed in this manner, for example. We all know how single task in Win3 can bring whole PC down and it is hated for that. Netware is the same, it only is so stable because its usage patterns are very monotonous and debugged well. In that sense forking code is more bug tolerant - a bug that crashes a service proccess will not affect other sessions (if, of course, the bug is not of a severe kind, but appeares in some special circumstances and rarely enough to consider the code usable overall). So, for speed one must pay a price - crashing Squid may mean hundreds half finished sessions to brake, wasting possibly hours of retrieving time. Restarting Squid takes also considerable amount of time, during which no clients are served. Still, Squid has some same problems as forking services has - it uses external ftpget processes to retrieve ftp objects. Each new ftp request means forking another ftpget that actually retrieves needed object. Busy cache serving possibly thousands of simultaneous retrievals simply cannot provide enough memory - if we consider 500K average per process we would need hundreds of MBs of RAM to do the job without swapping. Yet worse is that one can never fix Squid's memory needs at some amount, and the memory needs are totally dependant on usage pattern. Our cache has been several times on swapping because of numerous users retrieving ftp objects from some cool but slow sites. Being on swap means disasterous disk paging upon almost every request and incredible increase of servicetimes, up to tens of seconds to serve nearby http server. Another forked part of Squid is dnsserver, which is less of a problem, as it is preforked fixed amount of times and never changes. So, actually Squid as a whole represents all design models, 1. non-forking select. ( squid main process ) 2. preforking, ( dnsservers ) 3. fork per request, ( ftpget ) Quite a nice academic piece of software. ;) Current Squid problems. Fork per request ftpget. Now whats wrong with that, and what can be done? First, forking code is quite a big problem with its memory usage. Squid itself needs alots of memory to run fast, keeping most used objects and hash tables in ram. Typical squid memory usage is well over 50MB of ram, far more on busy caches. Ours, relatively small cache with about 15000 req/hour and 200 concurrent sessions at peak times, we need about 100 MB of ram for squid only, plus lots more to retrieve ftp objects, we have 128MB of ram now and its far too little, we are going to upgrade to 256MB. And this box does absolutely nothing else but caching. Maybe some sites use half a GIG of ram, but to me, memory usage is still a big problem. Anyway keeping hundreds of megs of ram on cache box just in case is always hard to accept. But you never know when your users will push your cache to its limits, and with squid these limits are not controllable. If it gets out of memory, it simply exits on the very first error from OS, or if swapspace is huge, it slows down to a crawl. In any case, when this happens, you are in deep soup. It is highly desirable to integrate ftpget into main squid process thus making it non-forking model. It is now kept as separate because implementing ftp would make squid even more complex and prone to severe bugs. Then again, as separate process, it duplicates all its code in vain instead of reusing the code. Implementing ftpget as in squid with select loop, but a single separate process could give both independance of squid from ftpget and resource conservation, but also has same level of complexity to deal with as when integrating into squid. Keeping track of all possible states during ftp session is not easy as http is connectionless while ftp is connection oriented. Of course, all is possible, but looking at current ftpget code, which is a plain retriever, its hard to think that it would be an easy task. What about multihtreading? Then again, several unixes has a support for multithreading, that is multiple virtual proccesses in a single one. All resources are shared between threads as is shared actual code. Starting a new thread is (almost) as fast as starting a subfunction and yet they can be made reasonably independant from each other and very parallel. Scaling on multi CPU systems is best as switching between proccesses needs quite a costy (in measure of time) context-switch while switching between threads is as fast as switching to other procedure call. Creating a thread is making a virtual fork and a virtual process, so this actually kind of virtual fork per request model, which used to be easy to program, remember? Sadly, not very many OS-es support standartized multithreading support, and writing squid using threads may become a porting nightmare. I have had some experience with SUN Solaris threads. SUN is looking at POSIX threads standard development and already now ship its OSes with support to posix threads, although the standard itself hasn't yet been fixed. There are also some free posix thread libraries that can be built and may give benefit on other oses. One that might be of interest can be found at ftp://sipb.mit.edu/pub/pthreads/ Writing multithreaded program is in theory very simple. Make one thread accept connections, on each request create a new thread that will service that request and pass to it the only argument - socket descriptor, and wait for the next request. Classical - accept - fork - service - exit, but using fast thread creation instead of fork and shared code. When created thread has done, it exits. Whole process is quite small, as only thread-tracking data and data used by threads themselves is in use, all the code is reentrant and reused by many virtual processes. All we need is to make sure that no thread will interfere with others, never overwrites any data some other thread might need or is using and that no thread blocks in OS-es library that cannot be used reentrantly. If threads are created to be sheduled by OS and OS supports reentrant socket operations, it is lovely to write threads that block on read or write, thus giving away their cpu time, no need for a select loop and no need for select's huge overhead when using thousands of open files, no need to fear about select's ability to handle more than 1024 files. Threads housekeeping is totally different from usual programming and introduces many new and bizarre situations, is much harder to debug, but it is also much faster and much more interesting possibilities arise. Writing ftpget as multithreaded process could be done probably, as it is already written by keeping in mind possible future integration into squid main process. Then again, if it works as a standalone MT process, why not integrate it into squid as separate thread? The main logic of squid is select loop, all interfacing is done through sockets or filedescriptors. It can be definitely possible to interact with this thread through yet another FD, in first go just slightly rewriting ftpget to meet MT-Safe standards and replacing fork algoritm with thread creation. Then we end up with a process that connects to itself, accepts connections from itself and transfers data from itself to itself. Not very beautiful. At this point of time, I'm still not very confident with all of squid's internals, but it seems to me, that to make squid as a whole into multithreaded process would need quite a lot of rewriting, as it as of now is not suited for reentrant code. Of course, I may be wrong. Ideally, some day, squid could be totally MT (multithreaded) software, having each request serviced by separate thread from beginning to the end. All threads sharing such cruicial structures as all metadata and updating it in controlled fashion. Only system imposed limits on thread count and open files would limit maximum simultaneous sessions, it would appear more event-driven and no single session or event could block whole service for any longer time. (of course, bug can) Most beneficial in multithreading should be considered possibility to divide all kinds of tasks into separate selfcontained parts that may be more easily readable. Many parts of code would become simpler to write, simpler to read, debug. Teamwork could be more easy. Currently all parts of squid are tightly dependant on each other, yet code is very dispersed and quite hard to track. Ironically, squid, currently being singlethreaded in fact, is already multithreaded in reality, using its own dispatching mechanism, which is very hard to track when reading the code. When written in MT, each request would be as virtual process that runs from some beginning to its end and it should be much simpler to track and understand its logical flow. Each maintenance job inside squid could be made a separate thread, all running in parallel if possible, making most out any multi cpu system. Yeah, dreaming too much... Seen pros, what about cons. I'll try to show some main problems with multithreading, sure there are better gurus who have written more for MT, maybe they'll speak up, here I express my experience. Main problem with MT is intercommunication between threads in effective and safe way. Shared data should be updated only within locks as should be reading it. Threads can be either OS sheduled (bound to system timer) or threads-library sheduled, bound who knows to what, but usually passing thread sheduler each time system calls are called that are hooked by the library. Need for extensive locking arises from totally asyncronous nature of threads, one never knows exactly when thread switch occurs. As there is lots of data that needs to be locked, there is expected quite considerable overhead by the thread library itself. The code must be designed from start to minimize needed locking and at the same time minimize code regions where any data is locked. MT introduces many troubles in thread deadlocking when design in wrong way. Zombie threads, if locked up in very bad regions can even block all the proccess from accessing some system resource. I've seen lockups that cause impossibility to kill -9 the offending proccess. Yet, although process was dead, other thread accepted connections, holding incoming socket bound and not permitting a restart. Bad news for regular user, the only way out is to reboot the whole box. And it can take weeks to find out where the hell the deadlock occurs. So, main task in designing MT process should be careful access to shared resources. If that care is taken ok, MT begins to show its benefits. MT lib overhead can become quite noticable and could be compared in some way with large database systems that seem to be clumsy and slow in small tasks where much cheeper and smaller systems would do better, but when pushed to limits, large systems can sustain much more load and overhead deminishes. There are lots of system calls that cannot be used concurrently by many threads inside the process, as system allocates some resources in a per process manner. This can lead to many sections of code that need to be locked by threads before entering OS/library, reducing concurrency. Usually threads library takes care of serializing access to such resources, blocking running threads and making sure that only one thread at a time accesses these resources. There are many calls that are totally unsafe to be reentered by other threads and which are not protected by threads lib, so writing MT code one must know exactly whether it is safe call, and safe on all supported platforms. Then again, writer must know whether protected library call can provide any concurrency, and if not, possibly redesign algoritm or there might not be any benefit in threading overall. To illustrate this, consider: As of exercise, I tried to integrate dnsservers into squid as threads, just because it was easiest try. Instead of forking several separate dnsserver processes I rewrote dnsserver main loop as a procedure call and started off as a thread. All other principles stayed the same. During initalization dnsserver waited on accept, while squid main thread connected there, they did the handshake and established socket connection as in original dnsserver. I started max 32 dnsservers and there was not much difference whether there were 2 or 32 of them either in starting speed or memory usage. each thread added some few pages to the total. Now, when stresstesting the dns resolving, it shortly appeared that dns service times actually went very high, up to 100 secs. Investigating showed resolver library uses static storage that makes them unsafe in threaded code. Reentrant alternates are given with threads lib, but alas, they provided no concurrency. All of 32 threads where accessing resolver calls one at a time while others where suspened in a runqueue, giving me 1 single dnsserver effectively. So, actually, interating dnsserver using threads proved to be "not so good idea (tm)", at least if using straightforward approach. The dnsserver was moved to separate processes because of its blocking nature, which squid is trying to avoid to do other tasks, bringing it back in with threading did not block squids main loop, but introduced blocking type calls of slightly different nature. This sort of gotchas must be constantly expected. Actually, to overcome resolver problems, there are solutions. There are (or is) free librarys that implement non-blocking resolver calls, (arlib in bind-4.9.5 contrib by Darren Reed) meanwhile knowing nothing (caring) of multithreading. Idea is simple - upon each request library returns a handle, (actually opened socket descriptor on wich request has been sent to nearest dns server) internally keeping track of all pending requests. It is responsibility of coder to wait in a select loop until dns server responds or timeout is reached. When data is recieved, other library call with that FD grabs the data and parses. Thus, many concurrent dns searches can be implemented without blocking in gethostby*(). But hey, this would be fairly easy to put into squids current select model. So, no need to mess around with threads and passing dns data around through many sockets and FDs. Right. But squids complexity grows and grows. Thread would not need a select, instead would block on read and would wait for socket timeout. There are other implications when using threads. Threads are not protected from each other. A bug in one thread can bring down whole service, while separate proccesses are immune from each other. This is nice feature of forking code. If some user constantly requesting something nasty that brakes squid suffers only himself with forking code as only process serving his request dies, then with threads it might blow away hundreds of sessions bringing down the only process. But this problem is present in current squid design too. Squids current select model has its limits, all the way growing functionality, object database and userbase, it will have to cope with more and more processing between select sleeps. And at some point, if requests come in faster that it can process its data in between, squid hits its speed limits quite hard and squids current design will not benefit from multiple CPU systems. In conclusion, all three models have their positive sides and negative. Maybe it would be best to mix and match these to get most benefit from all of them. Whether squid would become multithreaded or not, is for sure not the main idea of this writing. It may simply prove to be too much of rework that a separate project would be appropriate. But threading is coming into standard and it has lots of benefits, and hell, its fun to play with, so just consider this possibility, at least don't burn your bridges when writing code, keep in mind that someday, squid may want to become multithreaded. think about it, ------------------------------------------------------------------- Andres Kroonmaa Telefon: 6308 909 Network administrator E-mail: andre@ml.ee Phone: (+372) 6308 909 Organization: MicroLink Online EE0001, Estonia, Tallinn, Sakala 19 Fax: (+372) 6308 901 -------------------------------------------------------------------