Re: memory mapped store entries

From: Stephen R. van den Berg <srb@dont-contact.us>
Date: Tue, 25 Aug 1998 10:43:57 +0200

Michael O'Reilly wrote:
>"Stephen R. van den Berg" <srb@cuci.nl> writes:
>> Stewart Forster wrote:
>> >Stephen van den Berg wrote:
>> How would it do this? The kernel doesn't really have a way of arresting
>> a process if it writes to a mmapped page it currently writes back to disk.

>I'll talk about linux because that's what you use , and that's what I
>know.

>Disk writes are ALWAYS done in a process context, and that process is
>blocked for the duration.

What happens if several processes/threads hold the mmap. In which
context is the write done. Are the others allowed to continue?

>> Also, why would the kernel want to do this?
>> The kernel simply sees a dirty page, schedules a write, and is done with
>> it. The write will take effect in the background, grabbing whatever
>> data is in there at that moment. If the userprogram scribbled into
>> the page in the meantime, then so be it, the new scribbles jump over
>> to the disk.

>See above. Consider the case when squid is doing lots of writing, a
>portion of the mmap is paged out, a page fault is taken, it needs a
>page to page it back in, but all the available blocks are dirty....
>This happens more often than it sounds, particularly on boxes that are
>very busy.

Hmmm..., so you're telling me that it is likely to happen that when
squid writes many things and fills up the buffer cache faster than
that it could be flushed to disk, that *then* the kernel might
decide to nibble off one of our mmaped pages, the process might
block for an extended period when this page is being accessed and
it tries to page it back in?

So, how is this going to be worse than the same scenario without the
mmapped file in there? I.e. now the memory is part of the userspace of
squid, the buffer cache effectively is smaller (previously the mmapped
file was part of the buffer cache), even more so, some extra buffers are
going to be needed to cover for new logfile entries (in the mmapped case,
we'd simply be reusing freed entries in the non-growing file).
Squid is still writing a lot, will fill up the buffer cache and will
start stalling because some part of the buffer cache needs to be flushed
first. It is not able to steal pages from the mmapped file anymore,
but it could decide to swap out a page here and there, creating effectively
the same problems as before. Only the kernel has been forced to use swap
and has less memory to work with than in the mmapped case, where it still
could make decisions.

>> > What about consistency in
>> >the event of a crash?

>> Works like a charm. I can "kill -9" my squid any time I want, it
>> will come back smooth and fast with *no* false URLs and no race conditions.
>> As to kernel-crash consistency, I have to admit that I don't have
>> much experience with that due to the simple fact that my proxy server's
>> kernel (Linux) has not crashed since more than a year.
>> If you're concerned, you could build in checkpoints, where squid tells
>> the kernel to sync mmap and disk content *now* (this will happen
>> asynchronously again).

>I think you've been lucky so far... There's two cases here. SIGKILL
>when a structure is only 1/2 written,

This has been catered for by putting the swap_file_number field
at the *end* of the struct. Also, by making sure that the swap_file_number
field is updated first upon entry-deletion, and updated last upon
entry-creation. This should make it 100% SIGKILL (or SEGV :-) proof.

> and some fatal when a structure
>crosses a page boundry and only one page gets written to disk.

Yes, when the kernel crashes. Normally not a very frequent event.

>Preventing structures crossing page boundries would make me more
>comfortable,

It could easily be done by *not* using the ones on a page boundary,
of course.

> and paying attention to the order in which structures are
>filled would probably eliminate the other.

Ah, yes, did that.

>> Not much of a problem. I preallocate a bit (adaptively), then when
>> the limit is reached, a simple unmap and new mmap will do nicely.
>> This will repeat a few times while your cache is filling the first
>> time. After having reached a steady state, there is no fluctuation
>> there anymore.

>When you blithely talk about unmapping and remapping 200 and 300 meg
>structures I get a bit worried. :) This would want to be a VERY rare
>occurence.

On production caches which have been filled, this *is* an event which
does not occur anymore.

>> > Are we
>> >happy to mlock() pages into RAM to ensure mmap()'s performance?

>> Like I said, I don't think this will make much of a difference.

>See above. I would think it's essential.

Ok, using mlock() changes the odds a bit in the write overflow case. I.e. it
would force the kernel to be happy with a smaller buffer-cache to play
with, it might even make the kernel page out some of squid's other data
structures (not part of the mmap), unless we mlock() them too.
We end up with the same problem here, though. We give the kernel
less choices, is that good? Won't it stall squid regardless, only
this time on a buffer-write?

>> >How about under extremely high loads where disks are busy and take up to
>> >100ms to get a page in/page out and that 100ms means 10 requests that you
>> >can't service while that happens?

>> This will be just as bad in the case of the occasional swapped out page.

>Yes, but the squid executable is MUCH smaller, and more frequently
>accessed than the index data. Also the swap is on a seperate disk, not
>on the VERY busy cache disks.

The squid swaplog file we're mmapping *should* be on a separate disk
as well (it is here, it's on the log disk, it could be put on the swap
disk).

>Hmm. I'm doing over 8000 requests/min, on a Pentium-II 300, with
>512Meg of ram. I'm running at over 80% CPU. Using async-io with 32
>threads (16 won't keep up) with a 40Gig cache over a 6 disk array.

Interesting, what is your byte-hit-ratio?

-- 
Sincerely,                                                          srb@cuci.nl
           Stephen R. van den Berg (AKA BuGless).
"To err is human, to debug ... divine."
Received on Tue Jul 29 2003 - 13:15:52 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:53 MST