Re: memory mapped store entries

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Tue, 25 Aug 1998 22:14:03 +0300 (EETDST)

On 25 Aug 98, at 10:33, Stewart Forster <<slf@connect.com.au> wrote:

 hi,

 mmap() is NOT as bad as it is usually afraid of, but ofcourse it should

 be used where it fits best.

<color><param>7F00,0000,0000</param>> > The process shouldn't be waiting on this, since the kernel would be

> > doing this in the background (if at all; since we're the only one opening

> > the file, the system could keep all pages in memory until we terminate

> > and then write them back).

>

> I'm unclear as to how the kernel handles the flushing of memory

> pages out to disk. If this is in progress will the kernel block the

> current page access until that flush is complete? I thought it did,

> since to do otherwise would require the kernel to copy the data and then

> write that out. However mmap() states that it doesn't do this and just

> writes from the same page copy in RAM as the process has access to. So

> my point still stands that the main thread will stall while these kinds

> of flushes happen. I suppose we could pre-empt this flushing by calling

> fsync(), but then we're still stalling the main thread. If we push fsync()

> into a sub-thread the main thread will still stall while the fsync() completes.

>

> I'm happy to be corrected on the above point if you can prove otherwise.

 

</color> There seems to be lots of confusion regarding how OS manages page cache.

 To make this clear, and to compare relative merits of mmap or read/write

 you'd really need to know more about VM system of any given OS.

 I can talk about Solaris, but I believe these points stands also for many

 others. For good reference on solaris, read the paper

   <underline><color><param>0000,8000,0000</param>http://www.sun.com/sun-on-net/performance/vmsizing.ps.Z

</underline> (for anyone wishing to really understand Solaris VM, this is "a must" reading)<underline>

</underline></color> (also some others:

   http://www.sunworld.com/sunworldonline/swol-05-1997/swol-05-perf.html

   http://opcom.sun.ca/white-papers/white-papers.html

   http://www.sunworld.com/swol-12-1997/swol-12-insidesolaris.html

 )

 For those, who wouldn't read postscript file, here a short overview.

 VM operates with pages. There are 3 kinds of pages:

  1) attached page, accessed recently

  2) attached page, dirty or modified and not yet flushed

  3) unattached page, on free-list

 Kernel maintains a free list constantly to have a pool from which it can

 allocate real memory. To have such free list, kernel scans all pages in

 physical memory every now-and-then. It does this with 2 pointers: one that

 goes ahead and marks pages as not-accessed, then the 2nd pointer comes

 later and checks that these bits are not changed by MMU hardware. If page

 is not touched, the page is idle and candidate for pageout or reuse,

 (but only after second verify). Noting the time any page have been idle,

 OS implements LRU.

 Page scan runs 4/sec and only activates if there is a shortage of free list.

 Then it scans only fraction of all pages available. If the shortage can't

 be removed then the fraction of pages scanned is increased as also frequency

 of running page scan (up until it is scanning max no of pages upon every

 clock tick, if there is desperate shortage of ram).

 The total time needed to pass all pages in system is dependant on ram usage

 and shortage of it, and is highly dynamic figure.

 Basically, if there is lots of RAM, OS tends to scan very slowly (if at all)

 and thus pages are marked for reuse after relatively long time (up to hours)

 If there is very little RAM, then pages are scanned very fast and only most

 active ones are left out of pageout selection. There are some exceptions.

 For eg. pages that are known to be (map_)shared between processes are paged

 out only if there are no other pages to free and each shared page has been

 checked many times(8) in a row by page scanner. This allows process text and

 libs to remain in ram in preference to other pages.

 If page is dirty, it cannot be added to a freelist until it is flushed

 to backing store, be it either swapfile or mmaped file. After a page is

 flushed, it "is touched" and becomes candidate for pageout only after page

 scanner has detected it to be idle.

 So, dirty pages are less likely to get paged out than RO idle pages.

 Dirty pages are noted for time they have been dirty. Then, they are given

 time to "accumulate changes", ie. they are not scheduled to disk until max

 allowable time passes. (for eg. Solaris would not pageout the page until it

 has been dirty for 30 secs on idle system, dictated by special process fsflush

 run times), or by time it takes to scan all pages in system, ie. possibly much

 faster. Ofcourse, fsync or alikes can enforce flushes, but y'know what it means.

 Whenever kernel "looks" at VM pages, all processes are waiting, there is no

 concurrent accesses to page translation tables. Kernel basically looks at

 system's whole VM-to-physical mapping. When kernel decides to schedule some

 pages to disks, it most probably (I'm not deadsure here) marks them as

 copy-on-write and simply allows pageout to write them out.

 By copy-on-write, kernel avoids the need to copy page off and also guarantees

 integrity of page data, as modifying it while pageout is pending would make

 a copy of it in user-space.

 So, basically, when using mmaped files and modifying them, they tend to stay

 in memory compared to usual RO file cache pages, but are paged out before text

 pages would be. Also, dirty pages are flushed every 30 secs by default without

 any need for special flush() call.

 To give a comparison to usual read/write, every page in squid's memory map

 (like anonymous malloced memory) competes equally with system's file system

 cache pages and in case of prolonged memory shortage its malloced memory

 is made candidate for pageouts, and paged out to swap device.

 When using mmaped file, the file itself is the swap for its memory, and

 there is no need to pageout idle page, it is simply reused, avoiding

 physical disk io.

 Also, as the same page scanner and flusher runs over fs cache pages, they

 are flushed to disks every 30 secs, and for that time most of the system

 is stalled to wait it finish. (On Solaris you can see that using perfmeter

 and noting that there are disk activity spikes about every 25-30 secs and

 that at the same time there are holes in CPU utilisation, most probably

 because of waiting on io)

 One more. Squid's current store database is very scattered over the memory,

 it's large and lots of its parts are mostly idle, that is they are eventually

 paged out to swap. By using much more compact index with mmap it is much

 easier to arrange so that most of its pages are constantly "touched" and

 thus not paged out. To achieve this, I think it is right to arrange Store

 index as array of fixed-length items, that is filled ground-up always.

 This way the whole structure is as hot as possible and less prone to pageouts.

 The same goes for file allocation on disks - the more together they are

 packed, the less impact there is on directory and inode caches.

 It is okey to allocate mmap storage for all possible files (storage entries)

 on disks as a fixed length file, because if they are used from ground-up,

 then unused pages of mmap are simply reclaimed, there is no waste of ram,

 and there is always a guaranteed amount of disk space for storage database.

 When speaking of usability of mmap, they say it should be used for long-lived,

 large and random access data. For short-lived, sequential access read/write

 is much more preferred. One reason is that modern OS'es use read-ahead and

 free-behind algoritms that speedup reads and reduce memory usage of rare

 sequantial reads. mmap disables such algoritms. Then ofcourse, mmap is pretty

 expensive call.

 Taking all that into account, I'd vote for using mmap for store index data,

 but NOT for actual object data.

<nofill>
 ----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
 ----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:52 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:53 MST