Re: eventio and thoughts from Andres Kroonmaa on 2001-09-20 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Thu, 20 Sep 2001 20:28:00 +0200

On 17 Sep 2001, at 23:07, Henrik Nordstrom <hno@squid-cache.org> wrote:

> dequeueing is also relatively lightweight.. there is no schedule point
>
> However, independent measurements have shown that the CPU burn
> difference between returning one at a time and all at one is quite
> minimal for a normal but quite optimized application. Exactly why I
> don't know as I don't know the exact measurements being made.

Probably simply because time between the two syscalls is so large,
that its overhead is unnoticable. Or, time between two syscalls is
so small that effects of CPU caching doesn't show up (caches warm).

> To get some perspective on the time involved here, Linux syscalls
> measure as having slightly more than one order extra overhead compared
> to a plain functioncall, or about 0.7 usec my laptop. This is exactly
> (within +-5%) of the time of a 1K memcopy.
>
> (Pentium III mobile, 450 MHz, 256 Kb cache)

Thats pretty good: 315 cpu ticks or 70 FSB (100MHz) ticks, most
probably bound to FSB throughput. Sets limit of 1.4M syscalls/sec
if doing nothing else. Looks like alot. But thats pure overhead,
not sure how derived (tight loop measurement is abit misleading).
Checking for nonblocking flag or timeout spec only adds to it.
Any checks, if taken in relation to so such small overhead will
add up to considerable added overhead.
Also, not all OS'es are that good in syscall overhead.

Its very difficult to measure the total overhead of single at a
time versus many. I've been playing while doing the cpuProfiling
branch and could literally see how wildly things depend on CPU caches.

I just did few tests, see at the bottom for your pleasure.
Measures in CPU ticks, by way of rdtsc opcode. Probing is done
with 2 nested loops. Internal loop is very tight, only repeating
iterations, outer loop is gathering stats and picking samples.
I ran the tests with 3 samples, 1 iteration each, so that effects
of cold/warm caches can be seen. Note how wildly differs max time
from best time for very many operations. Its due to first pass
is mostly cache misses, subsequent passes have warm caches.
The difference is typically 10 times(!), and typically for calls
that need to touch some (kernel) memory. To be honest, this
suprises me more than I expected.

So, I expect that what you say 0.7usec pure syscall overhead is
actually heavily dependant on when any given syscall is made. If
CPU caches are flooded between two syscalls, then this time will
be way higher. This is the effect of locality. Thats also why I
think that we should stay longer in either kernel or userland -
to exploit locality. These examples below also illustrate very
well why high-MHz FSB systems, wide datapath and very large L2-3
caches easily help huge applications like Squid, where high miss-
rate is normal and expected thing.

If we take 1000 req/sec target, look how many syscalls squid is
doing per request, how many interrupts have to be handled per tcp
sessions, this 1.4M/sec budget doesn't seem so high any more.
We are doing roughly 70-80 syscalls per tcpmiss of average 11K size
transfer. With no disk io. On large scale we can say that we'd do
about 80K syscalls/sec under 1K/sec load. Thats about 5-6% of cpu
and seems not too much at all. But if our workingset is such that
cpu caches are flooded, then consult below tests and try to imagine
what actual hidden overhead might be.

We can do 2 things: make sure that caches are not flooded between
2 successive syscalls of same type, which is what we to date are
expecting, or if that is impossible, try to batch stuff together
so that we create suitable conditions knowingly.
In essence, it is more efficient to sort work by type instead of
being random - like in typical factory pipeline. ;)

I admit that batching job for kernel might be slight overkill,
as this expects too much. Although exactly this seems quite
natural, probably very closely same goal can be achieved by
making sure batch of job is given to kernel one at a time but
in fast succession.

> It does seem to be true that the faster the CPU, the higher the
> proportional syscall overhead, but I do not have any conclusive
> measurements on this (too many different variables between my two
> computers, different kernel version and compiler for one thing)

I'm actually worried by upcoming CPU's, 2-3GHz while FSB still
at 100-400MHz, pipelines getting deeper, and cpu caches still at
bare 256K...

> > This all is sensed only when syscall rate is very high, when code
> > leaves process very often doing only very little work at a time in
> > either kernel or userspace. We should stay longer in userspace,
> > preparing several sockets for IO, and then stay longer in kernel,
> > handling IO.
>
> This reasoning I do not buy. The "overhead" of the syscalls should be
> linear to the amount of processing you are doing. There is not less
> amount of syscall per unit of work done in total at low rates than there
> is at high rates. But I buy that the more optimized the code is, the
> more noticeable the per-syscall overhead will be unless you optimize
> these in the same level.

Yes, but I also try to point out that "overhead" splits into two
parts: easily accountable, and hardly accountable. Latter mostly due
to cpu pipelines blown, prefetch queues cleared, L1/2 cache polluted.
Its easy to make a test for any given syscall, doing it few millions
times and measuring time it takes on average. But this would be
misleading. The real time depends heavily on previous code trace
upto depending on how stuff happens to be located in ram.
All that we can do about it is to try to increase locality. Both
data locality and code trace.

> > Imagine we had to loop through all FD's in poll_array and poll()
> > each FD individually. This is where we are today with IO queueing.
>
> Which would cost us about 7ms for 10K filedescriptors of raw syscall
> overhead on my laptop, discounting the work poll() needs to do which in
> non-blocking mode should be quite linear to the amount of

Yes, this seems low. But poll is far from the only thing we'd need
to do 10K times.
Besides, if we just poll(), then its ok. But if we do enough work
between 2 polls so that caches are reused from previous poll, then
poll overhead will be few times higher.

> > dunno, maybe this packing is a job for actual io-model behind the api.
> > It just seems to me that it would be nice if we can pass array of
> > such IOCB's to the api.
>
> Kernel or userspace API?

Kernel, but I also thought it might make sense for user api too.

> For userspace it does not make much sense as there generally only is one
> operation known at a time.

Imho it makes it easy to extend eventio framework without a need to
change existing code so much. So to say, backwards compatibility,
by only adding fields into IOCB and adding commands. IOCB is also
good candidate for mempools. less stack usage with single pointer.

> For kernel API it makes sense to cut down on the per-syscall overhead
> when per-syscall overhead becomes a major problem. Build a queue of IO
> requests and send them in batches to the kernel which then splits the
> batch again into parallell processing, and returning batches of
> completed events.

Exactly. This has potential to vastly reduce syscall rate.

> > pthreads. all DNS lookups could be done by one worker-thread, which
> > might setup and handle multiple dnsservers, whatever. ACL-checks
> > that eat CPU could also be done on separate cpu-threads.
>
> inter-thread communication unfortunately isn't very lightweight if you
> need syncronisation between them.

I think we don't need synchronous syncronisation. we need mutexless
queueing and dequeueing. In this context, thread is no different form
sending batches of requests to kernel and returning batches of
completions some time later. Any syscall is inter-thread comm if you
like.

> Because of this I prefer to distribute
> all processing on all CPUs, where all procesing of a single request is
> contained on the same CPU / thread. Or put in other words, one main
> thread per processing unit

Maybe indeed. I'm not trying to push forward some threading model.
It simply seems to me that if we dedicate threads per specific tasks
like aufs is done, we can get away with much less headaches related
to concurrent equal threads doing all request and needing syncronised
access to shared structs. I hope we could do with less redesign.
Eventio framework idea makes that possible, but not current API.

> There is a few areas that need to be synronised.
> Of these the disk cache and index is the biggest challenge, but it
> should be doable I think.

Sure. Just might add unexpected overhead, needing deeper redesign.

> > Suppose that we read from /dev/poll 10 results that notify about 10
> > accepted sockets. We can now check all ACL's in main thread, or we
> > can queue all 10 to separate thread, freeing main thread. If we also
> > had 20 sockets ready for write, we could either handle them in main
> > thread, or enqueue-append (gather) to a special worker-thread.
>
> Or you could have N main threads, each caring for it's own
> filedescriptors, and a supervisor directing the main threads to eaven
> out the load between them. Requires far less syncronisation and
> inter-thread communication and also allows the work done by the main
> thread to scale.

Agree, but this needs full rewrite...

> > Yes. Just seems that it would be easier to start using threads
> > for limited tasks, at least meanwhile.
>
> Not so sure about this..
> What I can tell is that your idea with worker-threads fits very well
> into the eventio API, so once eventio is in place things like this can
> very easily be tried.

This idea depends on eventio being mature. In fact eventio sparked
my thoughts in this direction. Its why I wonder if we should try
to generalise eventio framework bit further so that we can plug
stuff not directly related to network or disk io into it.
Some selfcontained functions, like DNS resolving, could be hidden
behind the api, for eg.
First we need eventio, only then we can start using threads more
extensively.

----

Syscall times examples.
All compiled with -O3 and ppro optimisations.

1) Solaris7, PII-450/100:
We measure 3 samples, 1 iterations each:
             For Loop overhead time = 34 ct best, 84 ct avg, 85 ct max, 3 cnt
                   Gethrtime() time = 65 ct best, 83 ct avg, 119 ct max, 3 cnt
                gettimeofday() time = 2067 ct best, 67519 ct avg, 192831 ct max, 3 cnt
          memset(tobj,0,16384) time = 2942 ct best, 37384 ct avg, 93796 ct max, 3 cnt
   memcpy(obj,obj+16384,16384) time = 27106 ct best, 86359 ct avg, 201909 ct max, 3 cnt
                       clock() time = 697 ct best, 8185 ct avg, 23107 ct max, 3 cnt
                times(&tmsbuf) time = 612 ct best, 624 ct avg, 649 ct max, 3 cnt
                   poll(0,0,0) time = 589 ct best, 1133 ct avg, 2206 ct max, 3 cnt
               poll(fdset,0,0) time = 589 ct best, 791 ct avg, 1196 ct max, 3 cnt
               poll(fdset,1,0) time = 1649 ct best, 5953 ct avg, 12458 ct max, 3 cnt
              poll(fdset,11,0) time = 3782 ct best, 7159 ct avg, 13410 ct max, 3 cnt
              poll(fdset,21,0) time = 5885 ct best, 8841 ct avg, 14523 ct max, 3 cnt
              poll(fdset,31,0) time = 7881 ct best, 10824 ct avg, 16464 ct max, 3 cnt
fd[fd++]=open(/aaa,O_NONBLOCK, 0)) = 9443 ct best, 19410 ct avg, 39025 ct max, 3 cnt
                    close(fd[fd++]) = 1540 ct best, 6573 ct avg, 16494 ct max, 3 cnt
    close(open(/aaa,O_NONBLOCK, 0)) = 11212 ct best, 18615 ct avg, 32962 ct max, 3 cnt
           fd[fd++]=socket(AF_INET) = 29294 ct best, 42012 ct avg, 66279 ct max, 3 cnt
      getsockopt(sock,SO_ERROR,&on) = 2162 ct best, 8269 ct avg, 20192 ct max, 3 cnt
      setsockopt(sock,NO_DELAY,&on) = 7587 ct best, 14553 ct avg, 27127 ct max, 3 cnt
               fcntl(sock, F_GETFL) = 3479 ct best, 8809 ct avg, 19224 ct max, 3 cnt
        fcntl(sock,SET, O_NONBLOCK) = 4778 ct best, 7657 ct avg, 13069 ct max, 3 cnt
                  close(sock[fd++]) = 18303 ct best, 53224 ct avg, 104455 ct max, 3 cnt
             close(socket(AF_INET)) = 40700 ct best, 55503 ct avg, 84502 ct max, 3 cnt
          lockf(fd, fop, 512 ) time = 2677 ct best, 11889 ct avg, 27089 ct max, 3 cnt
     lseek(fd, 1234, SEEK_CUR) time = 706 ct best, 2563 ct avg, 6240 ct max, 3 cnt
              read(fd, buf, 1) time = 2598 ct best, 7137 ct avg, 16025 ct max, 3 cnt
              read(fd, buf, 0) time = 612 ct best, 734 ct avg, 978 ct max, 3 cnt
             write(fd, buf, 1) time = 3938 ct best, 7853 ct avg, 15462 ct max, 3 cnt
             write(fd, buf, 0) time = 618 ct best, 709 ct avg, 891 ct max, 3 cnt
            read(sock, buf, 1) time = 1084 ct best, 2290 ct avg, 4690 ct max, 3 cnt
            read(sock, buf, 0) time = 1065 ct best, 1652 ct avg, 2798 ct max, 3 cnt
           write(sock, buf, 1) time = 902 ct best, 1496 ct avg, 2633 ct max, 3 cnt
           write(sock, buf, 0) time = 896 ct best, 1307 ct avg, 2108 ct max, 3 cnt
      sigtimedwait(set,info,nowait) = 774 ct best, 4778 ct avg, 12730 ct max, 3 cnt

2) Solaris8, PIII-800/100, SMP:
We measure 3 samples, 1 iterations each:
             For Loop overhead time = 34 ct best, 74 ct avg, 97 ct max, 3 cnt
                   Gethrtime() time = 65 ct best, 91 ct avg, 145 ct max, 3 cnt
                gettimeofday() time = 772 ct best, 3747 ct avg, 9683 ct max, 3 cnt
          memset(tobj,0,16384) time = 2962 ct best, 34359 ct avg, 94953 ct max, 3 cnt
   memcpy(obj,obj+16384,16384) time = 8691 ct best, 79419 ct avg, 219878 ct max, 3 cnt
                       clock() time = 705 ct best, 13299 ct avg, 38469 ct max, 3 cnt
                times(&tmsbuf) time = 617 ct best, 633 ct avg, 666 ct max, 3 cnt
                   poll(0,0,0) time = 526 ct best, 838 ct avg, 1460 ct max, 3 cnt
               poll(fdset,0,0) time = 525 ct best, 598 ct avg, 743 ct max, 3 cnt
               poll(fdset,1,0) time = 1941 ct best, 9793 ct avg, 21912 ct max, 3 cnt
              poll(fdset,11,0) time = 4179 ct best, 7791 ct avg, 14271 ct max, 3 cnt
              poll(fdset,21,0) time = 6395 ct best, 9669 ct avg, 15898 ct max, 3 cnt
              poll(fdset,31,0) time = 8618 ct best, 11658 ct avg, 17589 ct max, 3 cnt
fd[fd++]=open(/aaa,O_NONBLOCK, 0)) = 2736 ct best, 40975 ct avg, 117129 ct max, 3 cnt
                    close(fd[fd++]) = 722 ct best, 13709 ct avg, 39590 ct max, 3 cnt
    close(open(/aaa,O_NONBLOCK, 0)) = 3874 ct best, 13444 ct avg, 32407 ct max, 3 cnt
           fd[fd++]=socket(AF_INET) = 30086 ct best, 75075 ct avg, 163329 ct max, 3 cnt
      getsockopt(sock,SO_ERROR,&on) = 2166 ct best, 10557 ct avg, 27132 ct max, 3 cnt
      setsockopt(sock,NO_DELAY,&on) = 8347 ct best, 18820 ct avg, 39502 ct max, 3 cnt
               fcntl(sock, F_GETFL) = 3710 ct best, 10358 ct avg, 23397 ct max, 3 cnt
        fcntl(sock,SET, O_NONBLOCK) = 5041 ct best, 7043 ct avg, 10720 ct max, 3 cnt
                  close(sock[fd++]) = 34915 ct best, 70985 ct avg, 138205 ct max, 3 cnt
             close(socket(AF_INET)) = 36927 ct best, 74142 ct avg, 147073 ct max, 3 cnt
          lockf(fd, fop, 512 ) time = 798 ct best, 11171 ct avg, 31855 ct max, 3 cnt
     lseek(fd, 1234, SEEK_CUR) time = 641 ct best, 2652 ct avg, 6665 ct max, 3 cnt
              read(fd, buf, 1) time = 688 ct best, 3005 ct avg, 7630 ct max, 3 cnt
              read(fd, buf, 0) time = 677 ct best, 964 ct avg, 1514 ct max, 3 cnt
             write(fd, buf, 1) time = 665 ct best, 2706 ct avg, 6776 ct max, 3 cnt
             write(fd, buf, 0) time = 665 ct best, 831 ct avg, 1163 ct max, 3 cnt
            read(sock, buf, 1) time = 1166 ct best, 2321 ct avg, 4630 ct max, 3 cnt
            read(sock, buf, 0) time = 1158 ct best, 1510 ct avg, 2194 ct max, 3 cnt
           write(sock, buf, 1) time = 994 ct best, 1382 ct avg, 2118 ct max, 3 cnt
           write(sock, buf, 0) time = 998 ct best, 1251 ct avg, 1744 ct max, 3 cnt
      sigtimedwait(set,info,nowait) = 1105 ct best, 4965 ct avg, 12659 ct max, 3 cnt

3) Linux 2.4.5, PIII-800/100:
We measure 3 samples, 1 iterations each:
             For Loop overhead time = 33 ct best, 36 ct avg, 43 ct max, 3 cnt
                   Gethrtime() time = 66 ct best, 73 ct avg, 89 ct max, 3 cnt
                gettimeofday() time = 454 ct best, 2118 ct avg, 5312 ct max, 3 cnt
          memset(tobj,0,16384) time = 2981 ct best, 16657 ct avg, 43906 ct max, 3 cnt
   memcpy(obj,obj+16384,16384) time = 5040 ct best, 34672 ct avg, 93486 ct max, 3 cnt
                       clock() time = 481 ct best, 6309 ct avg, 17948 ct max, 3 cnt
                times(&tmsbuf) time = 372 ct best, 408 ct avg, 480 ct max, 3 cnt
                   poll(0,0,0) time = 500 ct best, 1067 ct avg, 2191 ct max, 3 cnt
               poll(fdset,0,0) time = 467 ct best, 543 ct avg, 682 ct max, 3 cnt
               poll(fdset,1,0) time = 1389 ct best, 2419 ct avg, 4394 ct max, 3 cnt
              poll(fdset,11,0) time = 3614 ct best, 3884 ct avg, 4373 ct max, 3 cnt
              poll(fdset,21,0) time = 5779 ct best, 6074 ct avg, 6656 ct max, 3 cnt
              poll(fdset,31,0) time = 7926 ct best, 8193 ct avg, 8729 ct max, 3 cnt
fd[fd++]=open(/aaa,O_NONBLOCK, 0)) = 1163 ct best, 11167 ct avg, 31074 ct max, 3 cnt
                    close(fd[fd++]) = 370 ct best, 4763 ct avg, 13532 ct max, 3 cnt
    close(open(/aaa,O_NONBLOCK, 0)) = 1558 ct best, 5892 ct avg, 14487 ct max, 3 cnt
           fd[fd++]=socket(AF_INET) = 5540 ct best, 16032 ct avg, 36960 ct max, 3 cnt
      getsockopt(sock,SO_ERROR,&on) = 1386 ct best, 3970 ct avg, 9114 ct max, 3 cnt
      setsockopt(sock,NO_DELAY,&on) = 650 ct best, 2211 ct avg, 5322 ct max, 3 cnt
               fcntl(sock, F_GETFL) = 427 ct best, 2332 ct avg, 6122 ct max, 3 cnt
        fcntl(sock,SET, O_NONBLOCK) = 461 ct best, 649 ct avg, 992 ct max, 3 cnt
                  close(sock[fd++]) = 4251 ct best, 8914 ct avg, 17646 ct max, 3 cnt
             close(socket(AF_INET)) = 4840 ct best, 15015 ct avg, 34981 ct max, 3 cnt
          lockf(fd, fop, 512 ) time = 444 ct best, 4964 ct avg, 13986 ct max, 3 cnt
     lseek(fd, 1234, SEEK_CUR) time = 394 ct best, 1999 ct avg, 5207 ct max, 3 cnt
              read(fd, buf, 1) time = 386 ct best, 1405 ct avg, 3421 ct max, 3 cnt
              read(fd, buf, 0) time = 385 ct best, 445 ct avg, 565 ct max, 3 cnt
             write(fd, buf, 1) time = 399 ct best, 1409 ct avg, 3420 ct max, 3 cnt
             write(fd, buf, 0) time = 385 ct best, 460 ct avg, 612 ct max, 3 cnt
            read(sock, buf, 1) time = 730 ct best, 2277 ct avg, 5370 ct max, 3 cnt
            read(sock, buf, 0) time = 450 ct best, 527 ct avg, 674 ct max, 3 cnt
           write(sock, buf, 1) time = 857 ct best, 3172 ct avg, 7781 ct max, 3 cnt
           write(sock, buf, 0) time = 455 ct best, 528 ct avg, 669 ct max, 3 cnt
      sigtimedwait(set,info,nowait) = 523 ct best, 2285 ct avg, 5793 ct max, 3 cnt

Small differences are due to gcc 2.95.2 on Solaris'es and 2.96 on linux

------------------------------------
Andres Kroonmaa <andre@online.ee>
CTO, Microlink Online
Tel: 6501 731, Fax: 6501 725
Pärnu mnt. 158, Tallinn,
11317 Estonia
Received on Thu Sep 20 2001 - 12:34:29 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:14:22 MST