(Fwd) Re: memory mapped store entries from Andres Kroonmaa on 1998-08-25 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Tue, 25 Aug 1998 22:48:54 +0300 (EETDST)

I'm terribly sorry for reposting, but I managed to screw my mailer
to the extent, that the message was hardly readable. Apologies.

------- Forwarded Message Follows -------
From: Self <ENTER1/ANDRE>
To: Stewart Forster <slf@connect.com.au>
Subject: Re: memory mapped store entries
Copies to: squid-dev@ircache.net
Date sent: Tue, 25 Aug 1998 22:14:03 +0300 (EETDST)

On 25 Aug 98, at 10:33, Stewart Forster <slf@connect.com.au> wrote:

hi,

mmap() is NOT as bad as it is usually afraid of, but ofcourse it should
be used where it fits best.

> > The process shouldn't be waiting on this, since the kernel would be
> > doing this in the background (if at all; since we're the only one opening
> > the file, the system could keep all pages in memory until we terminate
> > and then write them back).
>
> I'm unclear as to how the kernel handles the flushing of memory
> pages out to disk. If this is in progress will the kernel block the
> current page access until that flush is complete? I thought it did,
> since to do otherwise would require the kernel to copy the data and then
> write that out. However mmap() states that it doesn't do this and just
> writes from the same page copy in RAM as the process has access to. So
> my point still stands that the main thread will stall while these kinds
> of flushes happen. I suppose we could pre-empt this flushing by calling
> fsync(), but then we're still stalling the main thread. If we push fsync()
> into a sub-thread the main thread will still stall while the fsync() completes.
>
> I'm happy to be corrected on the above point if you can prove otherwise.

There seems to be lots of confusion regarding how OS manages page cache.
To make this clear, and to compare relative merits of mmap or read/write
you'd really need to know more about VM system of any given OS.
I can talk about Solaris, but I believe these points stands also for many
others. For good reference on solaris, read the paper
http://www.sun.com/sun-on-net/performance/vmsizing.ps.Z
(for anyone wishing to really understand Solaris VM, this is "a must" reading)

(also some others:
   http://www.sunworld.com/sunworldonline/swol-05-1997/swol-05-perf.html
   http://opcom.sun.ca/white-papers/white-papers.html
   http://www.sunworld.com/swol-12-1997/swol-12-insidesolaris.html
)

For those, who wouldn't read postscript file, here a short overview.
VM operates with pages. There are 3 kinds of pages:
  1) attached page, accessed recently
  2) attached page, dirty or modified and not yet flushed
  3) unattached page, on free-list

Kernel maintains a free list constantly to have a pool from which it can
allocate real memory. To have such free list, kernel scans all pages in
physical memory every now-and-then. It does this with 2 pointers: one that
goes ahead and marks pages as not-accessed, then the 2nd pointer comes
later and checks that these bits are not changed by MMU hardware. If page
is not touched, the page is idle and candidate for pageout or reuse,
(but only after second verify). Noting the time any page have been idle,
OS implements LRU.
Page scan runs 4/sec and only activates if there is a shortage of free list.
Then it scans only fraction of all pages available. If the shortage can't
be removed then the fraction of pages scanned is increased as also frequency
of running page scan (up until it is scanning max no of pages upon every
clock tick, if there is desperate shortage of ram).
The total time needed to pass all pages in system is dependant on ram usage
and shortage of it, and is highly dynamic figure.

Basically, if there is lots of RAM, OS tends to scan very slowly (if at all)
and thus pages are marked for reuse after relatively long time (up to hours)
If there is very little RAM, then pages are scanned very fast and only most
active ones are left out of pageout selection. There are some exceptions.
For eg. pages that are known to be (map_)shared between processes are paged
out only if there are no other pages to free and each shared page has been
checked many times(8) in a row by page scanner. This allows process text and
libs to remain in ram in preference to other pages.

If page is dirty, it cannot be added to a freelist until it is flushed
to backing store, be it either swapfile or mmaped file. After a page is
flushed, it "is touched" and becomes candidate for pageout only after page
scanner has detected it to be idle.

So, dirty pages are less likely to get paged out than RO idle pages.

Dirty pages are noted for time they have been dirty. Then, they are given
time to "accumulate changes", ie. they are not scheduled to disk until max
allowable time passes. (for eg. Solaris would not pageout the page until it
has been dirty for 30 secs on idle system, dictated by special process fsflush
run times), or by time it takes to scan all pages in system, ie. possibly much
faster. Ofcourse, fsync or alikes can enforce flushes, but y'know what it means.

Whenever kernel "looks" at VM pages, all processes are waiting, there is no
concurrent accesses to page translation tables. Kernel basically looks at
system's whole VM-to-physical mapping. When kernel decides to schedule some
pages to disks, it most probably (I'm not deadsure here) marks them as
copy-on-write and simply allows pageout to write them out.
By copy-on-write, kernel avoids the need to copy page off and also guarantees
integrity of page data, as modifying it while pageout is pending would make
a copy of it in user-space.

So, basically, when using mmaped files and modifying them, they tend to stay
in memory compared to usual RO file cache pages, but are paged out before text
pages would be. Also, dirty pages are flushed every 30 secs by default without
any need for special flush() call.

To give a comparison to usual read/write, every page in squid's memory map
(like anonymous malloced memory) competes equally with system's file system
cache pages and in case of prolonged memory shortage its malloced memory
is made candidate for pageouts, and paged out to swap device.
When using mmaped file, the file itself is the swap for its memory, and
there is no need to pageout idle page, it is simply reused, avoiding
physical disk io.

Also, as the same page scanner and flusher runs over fs cache pages, they
are flushed to disks every 30 secs, and for that time most of the system
is stalled to wait it finish. (On Solaris you can see that using perfmeter
and noting that there are disk activity spikes about every 25-30 secs and
that at the same time there are holes in CPU utilisation, most probably
because of waiting on io)

One more. Squid's current store database is very scattered over the memory,
it's large and lots of its parts are mostly idle, that is they are eventually
paged out to swap. By using much more compact index with mmap it is much
easier to arrange so that most of its pages are constantly "touched" and
thus not paged out. To achieve this, I think it is right to arrange Store
index as array of fixed-length items, that is filled ground-up always.
This way the whole structure is as hot as possible and less prone to pageouts.
The same goes for file allocation on disks - the more together they are
packed, the less impact there is on directory and inode caches.
It is okey to allocate mmap storage for all possible files (storage entries)
on disks as a fixed length file, because if they are used from ground-up,
then unused pages of mmap are simply reclaimed, there is no waste of ram,
and there is always a guaranteed amount of disk space for storage database.

When speaking of usability of mmap, they say it should be used for long-lived,
large and random access data. For short-lived, sequential access read/write
is much more preferred. One reason is that modern OS'es use read-ahead and
free-behind algoritms that speedup reads and reduce memory usage of rare
sequantial reads. mmap disables such algoritms. Then ofcourse, mmap is pretty
expensive call.

Taking all that into account, I'd vote for using mmap for store index data,
but NOT for actual object data.

----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:52 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:53 MST