Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit2e6887d

Browse files
committed
Add to mmap emails.
1 parentb7e089f commit2e6887d

File tree

1 file changed

+251
-0
lines changed

1 file changed

+251
-0
lines changed

‎doc/TODO.detail/mmap

Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1763,3 +1763,254 @@ message can get through to the mailing list cleanly
17631763

17641764

17651765

1766+
From pgsql-committers-owner+M9273=maillist=candle.pha.pa.us@postgresql.org Thu Mar 6 19:37:25 2003
1767+
Return-path: <pgsql-committers-owner+M9273=maillist=candle.pha.pa.us@postgresql.org>
1768+
Received: from relay2.pgsql.com (relay2.pgsql.com [64.49.215.143])
1769+
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id h270bM624923
1770+
for <maillist@candle.pha.pa.us>; Thu, 6 Mar 2003 19:37:24 -0500 (EST)
1771+
Received: from postgresql.org (postgresql.org [64.49.215.8])
1772+
by relay2.pgsql.com (Postfix) with ESMTP id 4D5CDEE0411
1773+
for <maillist@candle.pha.pa.us>; Thu, 6 Mar 2003 19:37:23 -0500 (EST)
1774+
X-Original-To: pgsql-committers@postgresql.org
1775+
Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251])
1776+
by postgresql.org (Postfix) with ESMTP
1777+
id 3120E47646F; Thu, 6 Mar 2003 19:36:58 -0500 (EST)
1778+
Received: by perrin.int.nxad.com (Postfix, from userid 1001)
1779+
id 9CBE42105B; Thu, 6 Mar 2003 16:36:40 -0800 (PST)
1780+
Date: Thu, 6 Mar 2003 16:36:40 -0800
1781+
From: Sean Chittenden <sean@chittenden.org>
1782+
To: Tom Lane <tgl@sss.pgh.pa.us>
1783+
cc: Christopher Kings-Lynne <chriskl@familyhealth.com.au>,
1784+
pgsql-committers@postgresql.org, pgsql-performance@postgresql.org
1785+
Subject: Re: [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ...
1786+
Message-ID: <20030307003640.GF79234@perrin.int.nxad.com>
1787+
References: <20030306031656.1876F4762E0@postgresql.org> <032f01c2e390$b1842b20$6500a8c0@fhp.internal> <11077.1046921667@sss.pgh.pa.us> <033f01c2e392$71476570$6500a8c0@fhp.internal> <12228.1046922471@sss.pgh.pa.us> <20030306094117.GA79234@perrin.int.nxad.com> <15071.1046964336@sss.pgh.pa.us>
1788+
MIME-Version: 1.0
1789+
Content-Type: multipart/signed; micalg=pgp-sha1;
1790+
protocol="application/pgp-signature"; boundary="HjNkcEWJ4DMx36DP"
1791+
Content-Disposition: inline
1792+
In-Reply-To: <15071.1046964336@sss.pgh.pa.us>
1793+
User-Agent: Mutt/1.4i
1794+
X-PGP-Key: finger seanc@FreeBSD.org
1795+
X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341
1796+
X-Web-Homepage: http://sean.chittenden.org/
1797+
Precedence: bulk
1798+
Sender: pgsql-committers-owner@postgresql.org
1799+
Status: OR
1800+
1801+
--HjNkcEWJ4DMx36DP
1802+
Content-Type: text/plain; charset=us-ascii
1803+
Content-Disposition: inline
1804+
Content-Transfer-Encoding: quoted-printable
1805+
1806+
[moving to -performance, please drop -committers from replies]
1807+
1808+
> > I've toyed with the idea of adding this because it is monstrously more
1809+
> > efficient than select()/poll() in basically every way, shape, and
1810+
> > form.
1811+
>=20
1812+
> From what I've looked at, kqueue only wins when you are watching a
1813+
> large number of file descriptors at the same time; which is an
1814+
> operation done nowhere in Postgres. I think the above would be a
1815+
> complete waste of effort.
1816+
1817+
It scales very well to many thousands of descriptors, but it also
1818+
works well on small numbers as well. kqueue is about 5x faster than
1819+
select() or poll() on the low end of number of fd's. As I said
1820+
earlier, I don't think there is _much_ to gain in this regard, but I
1821+
do think that it would be a speed improvement but only to one OS
1822+
supported by PostgreSQL. I think that there are bigger speed
1823+
improvements to be had elsewhere in the code.
1824+
1825+
> > Is this one of the areas of PostgreSQL that just needs to get
1826+
> > slowly migrated to use mmap() or are there any gaping reasons why
1827+
> > to not use the family of system calls?
1828+
>=20
1829+
> There has been much speculation on this, and no proof that it
1830+
> actually buys us anything to justify the portability hit.
1831+
1832+
Actually, I think that it wouldn't be that big of a portability hit
1833+
because you still would read() and write() as always, but in
1834+
performance sensitive areas, an #ifdef HAVE_MMAP section would have
1835+
the appropriate mmap() calls. If the system doesn't have mmap(),
1836+
there isn't much to loose and we're in the same position we're in now.
1837+
1838+
> There would be some nontrivial problems to solve, such as the
1839+
> mechanics of accessing a large number of files from a large number
1840+
> of backends without running out of virtual memory. Also, is it
1841+
> guaranteed that multiple backends mmap'ing the same block will
1842+
> access the very same physical buffer, and not multiple copies?
1843+
> Multiple copies would be fatal. See the acrhives for more
1844+
> discussion.
1845+
1846+
Have read through the archives. Making a call to madvise() will speed
1847+
up access to the pages as it gives hints to the VM about what order
1848+
the pages are accessed/used. Here are a few bits from the BSD mmap()
1849+
and madvise() man pages:
1850+
1851+
mmap(2):
1852+
MAP_NOSYNC Causes data dirtied via this VM map to be flushed to
1853+
physical media only when necessary (usually by the
1854+
pager) rather then gratuitously. Typically this pre-
1855+
vents the update daemons from flushing pages dirtied
1856+
through such maps and thus allows efficient sharing =
1857+
of
1858+
memory across unassociated processes using a file-
1859+
backed shared memory map. Without this option any VM
1860+
pages you dirty may be flushed to disk every so often
1861+
(every 30-60 seconds usually) which can create perfo=
1862+
r-
1863+
mance problems if you do not need that to occur (such
1864+
as when you are using shared file-backed mmap regions
1865+
for IPC purposes). Note that VM/filesystem coherency
1866+
is maintained whether you use MAP_NOSYNC or not. Th=
1867+
is
1868+
option is not portable across UNIX platforms (yet),
1869+
though some may implement the same behavior by defau=
1870+
lt.
1871+
1872+
WARNING! Extending a file with ftruncate(2), thus c=
1873+
re-
1874+
ating a big hole, and then filling the hole by modif=
1875+
y-
1876+
ing a shared mmap() can lead to severe file fragment=
1877+
a-
1878+
tion. In order to avoid such fragmentation you shou=
1879+
ld
1880+
always pre-allocate the file's backing store by
1881+
write()ing zero's into the newly extended area prior=
1882+
to
1883+
modifying the area via your mmap(). The fragmentati=
1884+
on
1885+
problem is especially sensitive to MAP_NOSYNC pages,
1886+
because pages may be flushed to disk in a totally ra=
1887+
n-
1888+
dom order.
1889+
1890+
The same applies when using MAP_NOSYNC to implement a
1891+
file-based shared memory store. It is recommended t=
1892+
hat
1893+
you create the backing store by write()ing zero's to
1894+
the backing file rather then ftruncate()ing it. You
1895+
can test file fragmentation by observing the KB/t
1896+
(kilobytes per transfer) results from an ``iostat 1''
1897+
while reading a large file sequentially, e.g. using
1898+
``dd if=3Dfilename of=3D/dev/null bs=3D32k''.
1899+
1900+
The fsync(2) function will flush all dirty data and
1901+
metadata associated with a file, including dirty NOS=
1902+
YNC
1903+
VM data, to physical media. The sync(8) command and
1904+
sync(2) system call generally do not flush dirty NOS=
1905+
YNC
1906+
VM data. The msync(2) system call is obsolete since
1907+
BSD implements a coherent filesystem buffer cache.
1908+
However, it may be used to associate dirty VM pages
1909+
with filesystem buffers and thus cause them to be
1910+
flushed to physical media sooner rather then later.
1911+
1912+
madvise(2):
1913+
MADV_NORMAL Tells the system to revert to the default paging beha=
1914+
v-
1915+
ior.
1916+
1917+
MADV_RANDOM Is a hint that pages will be accessed randomly, and
1918+
prefetching is likely not advantageous.
1919+
1920+
MADV_SEQUENTIAL Causes the VM system to depress the priority of pages
1921+
immediately preceding a given page when it is faulted
1922+
in.
1923+
1924+
mprotect(2):
1925+
The mprotect() system call changes the specified pages to have protect=
1926+
ion
1927+
prot. Not all implementations will guarantee protection on a page bas=
1928+
is;
1929+
the granularity of protection changes may be as large as an entire
1930+
region. A region is the virtual address space defined by the start and
1931+
end addresses of a struct vm_map_entry.
1932+
1933+
Currently these protection bits are known, which can be combined, OR'd
1934+
together:
1935+
1936+
PROT_NONE No permissions at all.
1937+
1938+
PROT_READ The pages can be read.
1939+
1940+
PROT_WRITE The pages can be written.
1941+
1942+
PROT_EXEC The pages can be executed.
1943+
1944+
msync(2):
1945+
The msync() system call writes any modified pages back to the filesyst=
1946+
em
1947+
and updates the file modification time. If len is 0, all modified pag=
1948+
es
1949+
within the region containing addr will be flushed; if len is non-zero,
1950+
only those pages containing addr and len-1 succeeding locations will be
1951+
examined. The flags argument may be specified as follows:
1952+
1953+
MS_ASYNC Return immediately
1954+
MS_SYNC Perform synchronous writes
1955+
MS_INVALIDATE Invalidate all cached data
1956+
1957+
1958+
A few thoughts come to mind:
1959+
1960+
1) backends could share buffers by mmap()'ing shared regions of data.
1961+
While I haven't seen any numbers to reflect this, I'd wager that
1962+
mmap() is a faster interface than ipc.
1963+
1964+
2) It looks like while there are various file IO schemes scattered all
1965+
over the place, the bulk of the critical routines that would need
1966+
to be updated are in backend/storage/file/fd.c, more specifically:
1967+
1968+
*) fileNameOpenFile() would need the appropriate mmap() call made
1969+
to it.
1970+
1971+
*) FileTruncate() would need some attention to avoid fragmentation.
1972+
1973+
*) a new "sync" GUC would have to be introduced to handle msync
1974+
(affects only pg_fsync() and pg_fdatasync()).
1975+
1976+
3) There's a bit of code in pgsql/src/backend/storage/smgr that could
1977+
be gutted/removed. Which of those storage types are even used any
1978+
more? There's a reference in the code to PostgreSQL 3.0. :)
1979+
1980+
And I think that'd be it. The LRU code could be used if necessary to
1981+
help manage the amount of mmap()'ed in the VM at any one time, at the
1982+
very least that could be a handled by a shm var that various backends
1983+
would increment/decrement as files are open()'ed/close()'ed.
1984+
1985+
I didn't spend too long looking at this, but I _think_ that'd cover
1986+
80% of PostgreSQL's disk access needs. The next bit to possibly add
1987+
would be passing a flag on FileOpen operations that'd act as a hint to
1988+
madvise() that way the VM could proactively react to PostgreSQL's
1989+
needs.
1990+
1991+
I don't have my copy of Steven's handy (it's some 700mi away atm
1992+
otherwise I'd cite it), but if Tom or someone else has it handy, look
1993+
up the example re: the performance gain from read()'ing an mmap()'ed
1994+
file versus a non-mmap()'ed file. The difference is non-trivial and
1995+
_WELL_ worth the time given the speed increase. The same speed
1996+
benefit held true for writes as well, iirc. It's been a while, but I
1997+
think it was around page 330. The index has it listed and it's not
1998+
that hard of an example to find. -sc
1999+
2000+
--=20
2001+
Sean Chittenden
2002+
2003+
--HjNkcEWJ4DMx36DP
2004+
Content-Type: application/pgp-signature
2005+
Content-Disposition: inline
2006+
2007+
-----BEGIN PGP SIGNATURE-----
2008+
Comment: Sean Chittenden <sean@chittenden.org>
2009+
2010+
iD8DBQE+Z+mY3ZnjH7yEs0ERAjVkAJwMI1V7+HvMAA5ODadD5znsekI8TQCgvH0C
2011+
KwvG7YLsJ+xpsTUS67KD+4M=
2012+
=w8/7
2013+
-----END PGP SIGNATURE-----
2014+
2015+
--HjNkcEWJ4DMx36DP--
2016+

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp