Re: memory mapped store entries

From: Michael O'Reilly <michael@dont-contact.us>
Date: 25 Aug 1998 12:44:34 +0800

"Stephen R. van den Berg" <srb@cuci.nl> writes:
> Stewart Forster wrote:
> >Stephen van den Berg wrote:
> >> The process shouldn't be waiting on this, since the kernel would be
> >> doing this in the background (if at all; since we're the only one opening
> >> the file, the system could keep all pages in memory until we terminate
> >> and then write them back).
>
> > I'm unclear as to how the kernel handles the flushing of memory
> >pages out to disk. If this is in progress will the kernel block the
> >current page access until that flush is complete? I thought it did,
>
> How would it do this? The kernel doesn't really have a way of arresting
> a process if it writes to a mmapped page it currently writes back to disk.

I'll talk about linux because that's what you use , and that's what I
know.

Disk writes are ALWAYS done in a process context, and that process is
blocked for the duration.

Most of the time, this will be update(8) or bdflush(8) or at a pinch
kswapd(8) depending on your vintage.

The exceptions are whenever any process does any VM activity that
would result in a page out (such as shrinking the buffer cache). In
this case, the process will sleep waiting for the buffer to become
unlocked => process is stopped dead.

This can have some very nasty effects. Something I see fairly commonly
is a machine with ~250meg of disk cache, and a process that writes a
large file ( > 250 megs). At some point the process will want another
buffer to dirty (so it can fill it with data), but because it's
writing much faster than the disk can handle, the entire buffer cache
is filled with dirty pages. So it picks a page, and sleeps waiting for
it to become available (note that it's NOT waiting for _any_ page to
become available, but what it considered the 'best' page to free up).

This frequently results in the process sleeping on disk for 20 seconds
and up (while it waits for a large portion of the disk cache to be
written out to disk until that particular page is unlocked).

(The process BTW is normally squid; it's writing out the swap logs).

> Also, why would the kernel want to do this?
> The kernel simply sees a dirty page, schedules a write, and is done with
> it. The write will take effect in the background, grabbing whatever
> data is in there at that moment. If the userprogram scribbled into
> the page in the meantime, then so be it, the new scribbles jump over
> to the disk.

See above. Consider the case when squid is doing lots of writing, a
portion of the mmap is paged out, a page fault is taken, it needs a
page to page it back in, but all the available blocks are dirty....
This happens more often than it sounds, particularly on boxes that are
very busy.

[ .. ]
> > What about consistency in
> >the event of a crash?
>
> Works like a charm. I can "kill -9" my squid any time I want, it
> will come back smooth and fast with *no* false URLs and no race conditions.
> As to kernel-crash consistency, I have to admit that I don't have
> much experience with that due to the simple fact that my proxy server's
> kernel (Linux) has not crashed since more than a year.
> If you're concerned, you could build in checkpoints, where squid tells
> the kernel to sync mmap and disk content *now* (this will happen
> asynchronously again).

I think you've been lucky so far... There's two cases here. SIGKILL
when a structure is only 1/2 written, and some fatal when a structure
crosses a page boundry and only one page gets written to disk.

Preventing structures crossing page boundries would make me more
comfortable, and paying attention to the order in which structures are
filled would probably eliminate the other.
 
> > What about needing to extend the file which mmap()
> >can't do transparantly? (Sure we can just pre-allocate that one)
>
> Not much of a problem. I preallocate a bit (adaptively), then when
> the limit is reached, a simple unmap and new mmap will do nicely.
> This will repeat a few times while your cache is filling the first
> time. After having reached a steady state, there is no fluctuation
> there anymore.

When you blithely talk about unmapping and remapping 200 and 300 meg
structures I get a bit worried. :) This would want to be a VERY rare
occurence.
 
> > Are we
> >happy to mlock() pages into RAM to ensure mmap()'s performance?
>
> Like I said, I don't think this will make much of a difference.

See above. I would think it's essential.
 
> >How about under extremely high loads where disks are busy and take up to
> >100ms to get a page in/page out and that 100ms means 10 requests that you
> >can't service while that happens?
>
> This will be just as bad in the case of the occasional swapped out page.

Yes, but the squid executable is MUCH smaller, and more frequently
accessed than the index data. Also the swap is on a seperate disk, not
on the VERY busy cache disks.
 
> >My concern is always at the bleeding edge of performance and I'm ever happy
> >to sacrifice some low-end speed by 5-10% if it means high end speed is 10%
> >faster.
>
> I'm servicing 400 requests per minute average on peak time, using a
> Pentium 133 and 192MB of RAM
> Available buffercache: 118MB
> Resident size of squid: 29MB+46MB
> Mmapped storeSwapLogData + shared libs: 29MB

Hmm. I'm doing over 8000 requests/min, on a Pentium-II 300, with
512Meg of ram. I'm running at over 80% CPU. Using async-io with 32
threads (16 won't keep up) with a 40Gig cache over a 6 disk array.

[..]
> I'm not sure if this qualifies as low-load or high-load. I can tell you
> that it works, and I can also tell you that also works with 96MB of RAM
> probably, but not without the mmap patch.

Low load, very low load. :) Stew is doing about the some load as I am
(higher on some of his boxes I think), but he's doing it on a solaris
kernel so he cares more about kernel performance (cos it's so low to
start with ;) ;)

Michael.
Received on Tue Jul 29 2003 - 13:15:52 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:53 MST