@@ -1763,3 +1763,254 @@ message can get through to the mailing list cleanly
17631763
17641764
17651765
1766+ From pgsql-committers-owner+M9273=maillist=candle.pha.pa.us@postgresql.org Thu Mar 6 19:37:25 2003
1767+ Return-path: <pgsql-committers-owner+M9273=maillist=candle.pha.pa.us@postgresql.org>
1768+ Received: from relay2.pgsql.com (relay2.pgsql.com [64.49.215.143])
1769+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id h270bM624923
1770+ for <maillist@candle.pha.pa.us>; Thu, 6 Mar 2003 19:37:24 -0500 (EST)
1771+ Received: from postgresql.org (postgresql.org [64.49.215.8])
1772+ by relay2.pgsql.com (Postfix) with ESMTP id 4D5CDEE0411
1773+ for <maillist@candle.pha.pa.us>; Thu, 6 Mar 2003 19:37:23 -0500 (EST)
1774+ X-Original-To: pgsql-committers@postgresql.org
1775+ Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251])
1776+ by postgresql.org (Postfix) with ESMTP
1777+ id 3120E47646F; Thu, 6 Mar 2003 19:36:58 -0500 (EST)
1778+ Received: by perrin.int.nxad.com (Postfix, from userid 1001)
1779+ id 9CBE42105B; Thu, 6 Mar 2003 16:36:40 -0800 (PST)
1780+ Date: Thu, 6 Mar 2003 16:36:40 -0800
1781+ From: Sean Chittenden <sean@chittenden.org>
1782+ To: Tom Lane <tgl@sss.pgh.pa.us>
1783+ cc: Christopher Kings-Lynne <chriskl@familyhealth.com.au>,
1784+ pgsql-committers@postgresql.org, pgsql-performance@postgresql.org
1785+ Subject: Re: [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ...
1786+ Message-ID: <20030307003640.GF79234@perrin.int.nxad.com>
1787+ References: <20030306031656.1876F4762E0@postgresql.org> <032f01c2e390$b1842b20$6500a8c0@fhp.internal> <11077.1046921667@sss.pgh.pa.us> <033f01c2e392$71476570$6500a8c0@fhp.internal> <12228.1046922471@sss.pgh.pa.us> <20030306094117.GA79234@perrin.int.nxad.com> <15071.1046964336@sss.pgh.pa.us>
1788+ MIME-Version: 1.0
1789+ Content-Type: multipart/signed; micalg=pgp-sha1;
1790+ protocol="application/pgp-signature"; boundary="HjNkcEWJ4DMx36DP"
1791+ Content-Disposition: inline
1792+ In-Reply-To: <15071.1046964336@sss.pgh.pa.us>
1793+ User-Agent: Mutt/1.4i
1794+ X-PGP-Key: finger seanc@FreeBSD.org
1795+ X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341
1796+ X-Web-Homepage: http://sean.chittenden.org/
1797+ Precedence: bulk
1798+ Sender: pgsql-committers-owner@postgresql.org
1799+ Status: OR
1800+
1801+ --HjNkcEWJ4DMx36DP
1802+ Content-Type: text/plain; charset=us-ascii
1803+ Content-Disposition: inline
1804+ Content-Transfer-Encoding: quoted-printable
1805+
1806+ [moving to -performance, please drop -committers from replies]
1807+
1808+ > > I've toyed with the idea of adding this because it is monstrously more
1809+ > > efficient than select()/poll() in basically every way, shape, and
1810+ > > form.
1811+ >=20
1812+ > From what I've looked at, kqueue only wins when you are watching a
1813+ > large number of file descriptors at the same time; which is an
1814+ > operation done nowhere in Postgres. I think the above would be a
1815+ > complete waste of effort.
1816+
1817+ It scales very well to many thousands of descriptors, but it also
1818+ works well on small numbers as well. kqueue is about 5x faster than
1819+ select() or poll() on the low end of number of fd's. As I said
1820+ earlier, I don't think there is _much_ to gain in this regard, but I
1821+ do think that it would be a speed improvement but only to one OS
1822+ supported by PostgreSQL. I think that there are bigger speed
1823+ improvements to be had elsewhere in the code.
1824+
1825+ > > Is this one of the areas of PostgreSQL that just needs to get
1826+ > > slowly migrated to use mmap() or are there any gaping reasons why
1827+ > > to not use the family of system calls?
1828+ >=20
1829+ > There has been much speculation on this, and no proof that it
1830+ > actually buys us anything to justify the portability hit.
1831+
1832+ Actually, I think that it wouldn't be that big of a portability hit
1833+ because you still would read() and write() as always, but in
1834+ performance sensitive areas, an #ifdef HAVE_MMAP section would have
1835+ the appropriate mmap() calls. If the system doesn't have mmap(),
1836+ there isn't much to loose and we're in the same position we're in now.
1837+
1838+ > There would be some nontrivial problems to solve, such as the
1839+ > mechanics of accessing a large number of files from a large number
1840+ > of backends without running out of virtual memory. Also, is it
1841+ > guaranteed that multiple backends mmap'ing the same block will
1842+ > access the very same physical buffer, and not multiple copies?
1843+ > Multiple copies would be fatal. See the acrhives for more
1844+ > discussion.
1845+
1846+ Have read through the archives. Making a call to madvise() will speed
1847+ up access to the pages as it gives hints to the VM about what order
1848+ the pages are accessed/used. Here are a few bits from the BSD mmap()
1849+ and madvise() man pages:
1850+
1851+ mmap(2):
1852+ MAP_NOSYNC Causes data dirtied via this VM map to be flushed to
1853+ physical media only when necessary (usually by the
1854+ pager) rather then gratuitously. Typically this pre-
1855+ vents the update daemons from flushing pages dirtied
1856+ through such maps and thus allows efficient sharing =
1857+ of
1858+ memory across unassociated processes using a file-
1859+ backed shared memory map. Without this option any VM
1860+ pages you dirty may be flushed to disk every so often
1861+ (every 30-60 seconds usually) which can create perfo=
1862+ r-
1863+ mance problems if you do not need that to occur (such
1864+ as when you are using shared file-backed mmap regions
1865+ for IPC purposes). Note that VM/filesystem coherency
1866+ is maintained whether you use MAP_NOSYNC or not. Th=
1867+ is
1868+ option is not portable across UNIX platforms (yet),
1869+ though some may implement the same behavior by defau=
1870+ lt.
1871+
1872+ WARNING! Extending a file with ftruncate(2), thus c=
1873+ re-
1874+ ating a big hole, and then filling the hole by modif=
1875+ y-
1876+ ing a shared mmap() can lead to severe file fragment=
1877+ a-
1878+ tion. In order to avoid such fragmentation you shou=
1879+ ld
1880+ always pre-allocate the file's backing store by
1881+ write()ing zero's into the newly extended area prior=
1882+ to
1883+ modifying the area via your mmap(). The fragmentati=
1884+ on
1885+ problem is especially sensitive to MAP_NOSYNC pages,
1886+ because pages may be flushed to disk in a totally ra=
1887+ n-
1888+ dom order.
1889+
1890+ The same applies when using MAP_NOSYNC to implement a
1891+ file-based shared memory store. It is recommended t=
1892+ hat
1893+ you create the backing store by write()ing zero's to
1894+ the backing file rather then ftruncate()ing it. You
1895+ can test file fragmentation by observing the KB/t
1896+ (kilobytes per transfer) results from an ``iostat 1''
1897+ while reading a large file sequentially, e.g. using
1898+ ``dd if=3Dfilename of=3D/dev/null bs=3D32k''.
1899+
1900+ The fsync(2) function will flush all dirty data and
1901+ metadata associated with a file, including dirty NOS=
1902+ YNC
1903+ VM data, to physical media. The sync(8) command and
1904+ sync(2) system call generally do not flush dirty NOS=
1905+ YNC
1906+ VM data. The msync(2) system call is obsolete since
1907+ BSD implements a coherent filesystem buffer cache.
1908+ However, it may be used to associate dirty VM pages
1909+ with filesystem buffers and thus cause them to be
1910+ flushed to physical media sooner rather then later.
1911+
1912+ madvise(2):
1913+ MADV_NORMAL Tells the system to revert to the default paging beha=
1914+ v-
1915+ ior.
1916+
1917+ MADV_RANDOM Is a hint that pages will be accessed randomly, and
1918+ prefetching is likely not advantageous.
1919+
1920+ MADV_SEQUENTIAL Causes the VM system to depress the priority of pages
1921+ immediately preceding a given page when it is faulted
1922+ in.
1923+
1924+ mprotect(2):
1925+ The mprotect() system call changes the specified pages to have protect=
1926+ ion
1927+ prot. Not all implementations will guarantee protection on a page bas=
1928+ is;
1929+ the granularity of protection changes may be as large as an entire
1930+ region. A region is the virtual address space defined by the start and
1931+ end addresses of a struct vm_map_entry.
1932+
1933+ Currently these protection bits are known, which can be combined, OR'd
1934+ together:
1935+
1936+ PROT_NONE No permissions at all.
1937+
1938+ PROT_READ The pages can be read.
1939+
1940+ PROT_WRITE The pages can be written.
1941+
1942+ PROT_EXEC The pages can be executed.
1943+
1944+ msync(2):
1945+ The msync() system call writes any modified pages back to the filesyst=
1946+ em
1947+ and updates the file modification time. If len is 0, all modified pag=
1948+ es
1949+ within the region containing addr will be flushed; if len is non-zero,
1950+ only those pages containing addr and len-1 succeeding locations will be
1951+ examined. The flags argument may be specified as follows:
1952+
1953+ MS_ASYNC Return immediately
1954+ MS_SYNC Perform synchronous writes
1955+ MS_INVALIDATE Invalidate all cached data
1956+
1957+
1958+ A few thoughts come to mind:
1959+
1960+ 1) backends could share buffers by mmap()'ing shared regions of data.
1961+ While I haven't seen any numbers to reflect this, I'd wager that
1962+ mmap() is a faster interface than ipc.
1963+
1964+ 2) It looks like while there are various file IO schemes scattered all
1965+ over the place, the bulk of the critical routines that would need
1966+ to be updated are in backend/storage/file/fd.c, more specifically:
1967+
1968+ *) fileNameOpenFile() would need the appropriate mmap() call made
1969+ to it.
1970+
1971+ *) FileTruncate() would need some attention to avoid fragmentation.
1972+
1973+ *) a new "sync" GUC would have to be introduced to handle msync
1974+ (affects only pg_fsync() and pg_fdatasync()).
1975+
1976+ 3) There's a bit of code in pgsql/src/backend/storage/smgr that could
1977+ be gutted/removed. Which of those storage types are even used any
1978+ more? There's a reference in the code to PostgreSQL 3.0. :)
1979+
1980+ And I think that'd be it. The LRU code could be used if necessary to
1981+ help manage the amount of mmap()'ed in the VM at any one time, at the
1982+ very least that could be a handled by a shm var that various backends
1983+ would increment/decrement as files are open()'ed/close()'ed.
1984+
1985+ I didn't spend too long looking at this, but I _think_ that'd cover
1986+ 80% of PostgreSQL's disk access needs. The next bit to possibly add
1987+ would be passing a flag on FileOpen operations that'd act as a hint to
1988+ madvise() that way the VM could proactively react to PostgreSQL's
1989+ needs.
1990+
1991+ I don't have my copy of Steven's handy (it's some 700mi away atm
1992+ otherwise I'd cite it), but if Tom or someone else has it handy, look
1993+ up the example re: the performance gain from read()'ing an mmap()'ed
1994+ file versus a non-mmap()'ed file. The difference is non-trivial and
1995+ _WELL_ worth the time given the speed increase. The same speed
1996+ benefit held true for writes as well, iirc. It's been a while, but I
1997+ think it was around page 330. The index has it listed and it's not
1998+ that hard of an example to find. -sc
1999+
2000+ --=20
2001+ Sean Chittenden
2002+
2003+ --HjNkcEWJ4DMx36DP
2004+ Content-Type: application/pgp-signature
2005+ Content-Disposition: inline
2006+
2007+ -----BEGIN PGP SIGNATURE-----
2008+ Comment: Sean Chittenden <sean@chittenden.org>
2009+
2010+ iD8DBQE+Z+mY3ZnjH7yEs0ERAjVkAJwMI1V7+HvMAA5ODadD5znsekI8TQCgvH0C
2011+ KwvG7YLsJ+xpsTUS67KD+4M=
2012+ =w8/7
2013+ -----END PGP SIGNATURE-----
2014+
2015+ --HjNkcEWJ4DMx36DP--
2016+