Re: memory mapped store entries from Andres Kroonmaa on 1998-08-25 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Tue, 25 Aug 1998 22:14:03 +0300 (EETDST)

On 25 Aug 98, at 10:33, Stewart Forster <<slf@connect.com.au> wrote:

hi,

mmap() is NOT as bad as it is usually afraid of, but ofcourse it should

be used where it fits best.

<color><param>7F00,0000,0000</param>> > The process shouldn't be waiting on this, since the kernel would be

> > doing this in the background (if at all; since we're the only one opening

> > the file, the system could keep all pages in memory until we terminate

> > and then write them back).

> I'm unclear as to how the kernel handles the flushing of memory

> pages out to disk. If this is in progress will the kernel block the

> current page access until that flush is complete? I thought it did,

> since to do otherwise would require the kernel to copy the data and then

> write that out. However mmap() states that it doesn't do this and just

> writes from the same page copy in RAM as the process has access to. So

> my point still stands that the main thread will stall while these kinds

> of flushes happen. I suppose we could pre-empt this flushing by calling

> fsync(), but then we're still stalling the main thread. If we push fsync()

> into a sub-thread the main thread will still stall while the fsync() completes.

> I'm happy to be corrected on the above point if you can prove otherwise.

</color> There seems to be lots of confusion regarding how OS manages page cache.

To make this clear, and to compare relative merits of mmap or read/write

you'd really need to know more about VM system of any given OS.

I can talk about Solaris, but I believe these points stands also for many

others. For good reference on solaris, read the paper

<underline><color><param>0000,8000,0000</param>http://www.sun.com/sun-on-net/performance/vmsizing.ps.Z

</underline> (for anyone wishing to really understand Solaris VM, this is "a must" reading)<underline>

</underline></color> (also some others:

http://www.sunworld.com/sunworldonline/swol-05-1997/swol-05-perf.html

http://opcom.sun.ca/white-papers/white-papers.html

http://www.sunworld.com/swol-12-1997/swol-12-insidesolaris.html

)

For those, who wouldn't read postscript file, here a short overview.

VM operates with pages. There are 3 kinds of pages:

1) attached page, accessed recently

2) attached page, dirty or modified and not yet flushed

3) unattached page, on free-list

Kernel maintains a free list constantly to have a pool from which it can

allocate real memory. To have such free list, kernel scans all pages in

physical memory every now-and-then. It does this with 2 pointers: one that

goes ahead and marks pages as not-accessed, then the 2nd pointer comes

later and checks that these bits are not changed by MMU hardware. If page

is not touched, the page is idle and candidate for pageout or reuse,

(but only after second verify). Noting the time any page have been idle,

OS implements LRU.

Page scan runs 4/sec and only activates if there is a shortage of free list.

Then it scans only fraction of all pages available. If the shortage can't

be removed then the fraction of pages scanned is increased as also frequency

of running page scan (up until it is scanning max no of pages upon every

clock tick, if there is desperate shortage of ram).

The total time needed to pass all pages in system is dependant on ram usage

and shortage of it, and is highly dynamic figure.

Basically, if there is lots of RAM, OS tends to scan very slowly (if at all)

and thus pages are marked for reuse after relatively long time (up to hours)

If there is very little RAM, then pages are scanned very fast and only most

active ones are left out of pageout selection. There are some exceptions.

For eg. pages that are known to be (map_)shared between processes are paged

out only if there are no other pages to free and each shared page has been

checked many times(8) in a row by page scanner. This allows process text and

libs to remain in ram in preference to other pages.

If page is dirty, it cannot be added to a freelist until it is flushed

to backing store, be it either swapfile or mmaped file. After a page is

flushed, it "is touched" and becomes candidate for pageout only after page

scanner has detected it to be idle.

So, dirty pages are less likely to get paged out than RO idle pages.

Dirty pages are noted for time they have been dirty. Then, they are given

time to "accumulate changes", ie. they are not scheduled to disk until max

allowable time passes. (for eg. Solaris would not pageout the page until it

has been dirty for 30 secs on idle system, dictated by special process fsflush

run times), or by time it takes to scan all pages in system, ie. possibly much

faster. Ofcourse, fsync or alikes can enforce flushes, but y'know what it means.

Whenever kernel "looks" at VM pages, all processes are waiting, there is no

concurrent accesses to page translation tables. Kernel basically looks at

system's whole VM-to-physical mapping. When kernel decides to schedule some

pages to disks, it most probably (I'm not deadsure here) marks them as

copy-on-write and simply allows pageout to write them out.

By copy-on-write, kernel avoids the need to copy page off and also guarantees

integrity of page data, as modifying it while pageout is pending would make

a copy of it in user-space.

So, basically, when using mmaped files and modifying them, they tend to stay

in memory compared to usual RO file cache pages, but are paged out before text

pages would be. Also, dirty pages are flushed every 30 secs by default without

any need for special flush() call.

To give a comparison to usual read/write, every page in squid's memory map

(like anonymous malloced memory) competes equally with system's file system

cache pages and in case of prolonged memory shortage its malloced memory

is made candidate for pageouts, and paged out to swap device.

When using mmaped file, the file itself is the swap for its memory, and

there is no need to pageout idle page, it is simply reused, avoiding

physical disk io.

Also, as the same page scanner and flusher runs over fs cache pages, they

are flushed to disks every 30 secs, and for that time most of the system

is stalled to wait it finish. (On Solaris you can see that using perfmeter

and noting that there are disk activity spikes about every 25-30 secs and

that at the same time there are holes in CPU utilisation, most probably

because of waiting on io)

One more. Squid's current store database is very scattered over the memory,

it's large and lots of its parts are mostly idle, that is they are eventually

paged out to swap. By using much more compact index with mmap it is much

easier to arrange so that most of its pages are constantly "touched" and

thus not paged out. To achieve this, I think it is right to arrange Store

index as array of fixed-length items, that is filled ground-up always.

This way the whole structure is as hot as possible and less prone to pageouts.

The same goes for file allocation on disks - the more together they are

packed, the less impact there is on directory and inode caches.

It is okey to allocate mmap storage for all possible files (storage entries)

on disks as a fixed length file, because if they are used from ground-up,

then unused pages of mmap are simply reclaimed, there is no waste of ram,

and there is always a guaranteed amount of disk space for storage database.

When speaking of usability of mmap, they say it should be used for long-lived,

large and random access data. For short-lived, sequential access read/write

is much more preferred. One reason is that modern OS'es use read-ahead and

free-behind algoritms that speedup reads and reduce memory usage of rare

sequantial reads. mmap disables such algoritms. Then ofcourse, mmap is pretty

expensive call.

Taking all that into account, I'd vote for using mmap for store index data,

but NOT for actual object data.

<nofill>
----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:52 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:53 MST