(Fwd) Re: memory mapped store entries

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Tue, 25 Aug 1998 22:48:54 +0300 (EETDST)

 I'm terribly sorry for reposting, but I managed to screw my mailer
 to the extent, that the message was hardly readable. Apologies.
  
------- Forwarded Message Follows -------
From: Self <ENTER1/ANDRE>
To: Stewart Forster <slf@connect.com.au>
Subject: Re: memory mapped store entries
Copies to: squid-dev@ircache.net
Date sent: Tue, 25 Aug 1998 22:14:03 +0300 (EETDST)

On 25 Aug 98, at 10:33, Stewart Forster <slf@connect.com.au> wrote:

 hi,

 mmap() is NOT as bad as it is usually afraid of, but ofcourse it should
 be used where it fits best.

> > The process shouldn't be waiting on this, since the kernel would be
> > doing this in the background (if at all; since we're the only one opening
> > the file, the system could keep all pages in memory until we terminate
> > and then write them back).
>
> I'm unclear as to how the kernel handles the flushing of memory
> pages out to disk. If this is in progress will the kernel block the
> current page access until that flush is complete? I thought it did,
> since to do otherwise would require the kernel to copy the data and then
> write that out. However mmap() states that it doesn't do this and just
> writes from the same page copy in RAM as the process has access to. So
> my point still stands that the main thread will stall while these kinds
> of flushes happen. I suppose we could pre-empt this flushing by calling
> fsync(), but then we're still stalling the main thread. If we push fsync()
> into a sub-thread the main thread will still stall while the fsync() completes.
>
> I'm happy to be corrected on the above point if you can prove otherwise.
 
 There seems to be lots of confusion regarding how OS manages page cache.
 To make this clear, and to compare relative merits of mmap or read/write
 you'd really need to know more about VM system of any given OS.
 I can talk about Solaris, but I believe these points stands also for many
 others. For good reference on solaris, read the paper
   http://www.sun.com/sun-on-net/performance/vmsizing.ps.Z
 (for anyone wishing to really understand Solaris VM, this is "a must" reading)

 (also some others:
   http://www.sunworld.com/sunworldonline/swol-05-1997/swol-05-perf.html
   http://opcom.sun.ca/white-papers/white-papers.html
   http://www.sunworld.com/swol-12-1997/swol-12-insidesolaris.html
 )

 For those, who wouldn't read postscript file, here a short overview.
 VM operates with pages. There are 3 kinds of pages:
  1) attached page, accessed recently
  2) attached page, dirty or modified and not yet flushed
  3) unattached page, on free-list

 Kernel maintains a free list constantly to have a pool from which it can
 allocate real memory. To have such free list, kernel scans all pages in
 physical memory every now-and-then. It does this with 2 pointers: one that
 goes ahead and marks pages as not-accessed, then the 2nd pointer comes
 later and checks that these bits are not changed by MMU hardware. If page
 is not touched, the page is idle and candidate for pageout or reuse,
 (but only after second verify). Noting the time any page have been idle,
 OS implements LRU.
 Page scan runs 4/sec and only activates if there is a shortage of free list.
 Then it scans only fraction of all pages available. If the shortage can't
 be removed then the fraction of pages scanned is increased as also frequency
 of running page scan (up until it is scanning max no of pages upon every
 clock tick, if there is desperate shortage of ram).
 The total time needed to pass all pages in system is dependant on ram usage
 and shortage of it, and is highly dynamic figure.

 Basically, if there is lots of RAM, OS tends to scan very slowly (if at all)
 and thus pages are marked for reuse after relatively long time (up to hours)
 If there is very little RAM, then pages are scanned very fast and only most
 active ones are left out of pageout selection. There are some exceptions.
 For eg. pages that are known to be (map_)shared between processes are paged
 out only if there are no other pages to free and each shared page has been
 checked many times(8) in a row by page scanner. This allows process text and
 libs to remain in ram in preference to other pages.

 If page is dirty, it cannot be added to a freelist until it is flushed
 to backing store, be it either swapfile or mmaped file. After a page is
 flushed, it "is touched" and becomes candidate for pageout only after page
 scanner has detected it to be idle.

 So, dirty pages are less likely to get paged out than RO idle pages.

 Dirty pages are noted for time they have been dirty. Then, they are given
 time to "accumulate changes", ie. they are not scheduled to disk until max
 allowable time passes. (for eg. Solaris would not pageout the page until it
 has been dirty for 30 secs on idle system, dictated by special process fsflush
 run times), or by time it takes to scan all pages in system, ie. possibly much
 faster. Ofcourse, fsync or alikes can enforce flushes, but y'know what it means.

 Whenever kernel "looks" at VM pages, all processes are waiting, there is no
 concurrent accesses to page translation tables. Kernel basically looks at
 system's whole VM-to-physical mapping. When kernel decides to schedule some
 pages to disks, it most probably (I'm not deadsure here) marks them as
 copy-on-write and simply allows pageout to write them out.
 By copy-on-write, kernel avoids the need to copy page off and also guarantees
 integrity of page data, as modifying it while pageout is pending would make
 a copy of it in user-space.

 So, basically, when using mmaped files and modifying them, they tend to stay
 in memory compared to usual RO file cache pages, but are paged out before text
 pages would be. Also, dirty pages are flushed every 30 secs by default without
 any need for special flush() call.

 To give a comparison to usual read/write, every page in squid's memory map
 (like anonymous malloced memory) competes equally with system's file system
 cache pages and in case of prolonged memory shortage its malloced memory
 is made candidate for pageouts, and paged out to swap device.
 When using mmaped file, the file itself is the swap for its memory, and
 there is no need to pageout idle page, it is simply reused, avoiding
 physical disk io.

 Also, as the same page scanner and flusher runs over fs cache pages, they
 are flushed to disks every 30 secs, and for that time most of the system
 is stalled to wait it finish. (On Solaris you can see that using perfmeter
 and noting that there are disk activity spikes about every 25-30 secs and
 that at the same time there are holes in CPU utilisation, most probably
 because of waiting on io)

 One more. Squid's current store database is very scattered over the memory,
 it's large and lots of its parts are mostly idle, that is they are eventually
 paged out to swap. By using much more compact index with mmap it is much
 easier to arrange so that most of its pages are constantly "touched" and
 thus not paged out. To achieve this, I think it is right to arrange Store
 index as array of fixed-length items, that is filled ground-up always.
 This way the whole structure is as hot as possible and less prone to pageouts.
 The same goes for file allocation on disks - the more together they are
 packed, the less impact there is on directory and inode caches.
 It is okey to allocate mmap storage for all possible files (storage entries)
 on disks as a fixed length file, because if they are used from ground-up,
 then unused pages of mmap are simply reclaimed, there is no waste of ram,
 and there is always a guaranteed amount of disk space for storage database.

 When speaking of usability of mmap, they say it should be used for long-lived,
 large and random access data. For short-lived, sequential access read/write
 is much more preferred. One reason is that modern OS'es use read-ahead and
 free-behind algoritms that speedup reads and reduce memory usage of rare
 sequantial reads. mmap disables such algoritms. Then ofcourse, mmap is pretty
 expensive call.

 Taking all that into account, I'd vote for using mmap for store index data,
 but NOT for actual object data.

 ----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
 ----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:52 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:53 MST