Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit91c3e49

Browse files
committed
CFS branch for PGPROEE9_6
1 parent89ace65 commit91c3e49

File tree

3 files changed

+1309
-0
lines changed

3 files changed

+1309
-0
lines changed

‎doc/src/sgml/cfs.sgml‎

Lines changed: 300 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,300 @@
1+
<!-- doc/src/sgml/cfs.sgml -->
2+
3+
<chapter id="cfs">
4+
<title>Compressed file system</title>
5+
6+
<para>
7+
This chapter explains page level compression and encryption in
8+
<productname>PostgreSQL</> database system.
9+
</para>
10+
11+
<sect1 id="cfs-overview">
12+
<title>Why database compression/encryption may be useful</title>
13+
14+
<para>
15+
Databases are used to store larger number of text and duplicated information. This is why compression of most of databases
16+
can be quite efficient and reduce used storage size 3..5 times. Postgres performs compression of TOAST data, but small
17+
text fields which fits in the page are not compressed. Also not only heap pages can be compressed, indexes on text keys
18+
or indexes with larger number of duplicate values are also good candidates for compression.
19+
</para>
20+
21+
<para>
22+
Postgres is working with disk data through buffer pool which accumulates most frequently used buffers.
23+
Interface between buffer manager and file system is the most natural place for performing compression.
24+
Buffers are stored on the disk in compressed for reducing disk usage and minimizing amount of data to be read.
25+
And in-memory buffer pool contains uncompressed buffers, providing access to the records at the same speed as without
26+
compression. As far as modern server have large enough size of RAM, substantial part of the database can be cached in
27+
memory and accessed without any compression overhead penalty.
28+
</para>
29+
30+
<para>
31+
Except obvious advantage: saving disk space, compression can also improve system performance.
32+
There are two main reasons for it:
33+
</para>
34+
35+
<variablelist>
36+
<varlistentry>
37+
<term>Reducing amount of disk IO</term>
38+
<listitem>
39+
<para>
40+
Compression help to reduce size of data which should be written to the disk or read from it.
41+
Compression ratio 3 actually means that you need to read 3 times less data or same number of records can be fetched
42+
3 times faster
43+
</para>
44+
</listitem>
45+
</varlistentry>
46+
47+
<varlistentry>
48+
<term>Improving locality</term>
49+
<listitem>
50+
<para>
51+
When modified buffers are flushed from buffer pool to the disk, them are now written to the random locations
52+
on the disk. Postgres cache replacement algorithm makes a decision about throwing away buffer from the pool
53+
based on its access frequency and ignoring its location on the disk. So two subsequently written buffers can be
54+
located in completely different parts of the disk. For HDD seek time is quite large - about 10msec, which corresponds
55+
to 100 random writes per second. And speed of sequential write can be about 100Mb/sec, which corresponds to
56+
10000 buffers per second (100 times faster). For SSD gap between sequential and random write speed is smaller,
57+
but still sequential writers are more efficient. How it relates to data compression?
58+
Size of buffer in PostgreSQL is fixed (8kb by default). Size of compressed buffer depends on the content of the buffer.
59+
So updated buffer can not always fit in its old location on the disk. This is why we can not access pages directly
60+
by its address. Instead of it we have to use map which translates logical address of the page to its physical location
61+
on the disk. Definitely this extra level of indirection adds overhead. For in most cases this map can fir in memory,
62+
so page lookup is nothing more than just accessing array element. But presence of this map also have positive effect:
63+
we can now write updated pages sequentially, just updating their map entries.
64+
Postgres is doing much to avoid "write storm" intensive flushing of data to the disk when buffer poll space is
65+
exhausted. Compression allows to significantly reduce disk load.
66+
</para>
67+
</listitem>
68+
</varlistentry>
69+
</variablelist>
70+
71+
<para>
72+
Another useful feature which can be combined with compression is database encryption.
73+
Encryption allows to protected you database from unintended access (if somebody stole your notebook, hard drive or make
74+
copy from it, thief will not be able to extract information from your database if it is encrypted).
75+
Postgres provide contrib module pgcrypto, allowing you to encrypt some particular types/columns.
76+
But safer and convenient way is to encrypt all data in the database. Encryption can be combined with compression.
77+
Data should be stored at disk in encrypted form and decrypted when page is loaded in buffer pool.
78+
It is essential that compression should be performed before encryption, otherwise encryption eliminates regularities in
79+
data and compression rate will be close to 1.
80+
</para>
81+
82+
<para>
83+
Why do we need to perform compression/encryption in Postgres and do not use correspondent features of underlying file
84+
systems? First answer is that there are not so much file system supporting compression and encryption for all OSes.
85+
And even if such file systems are available, it is not always possible/convenient to install such file system just
86+
to compress/protect your database. Second question is that performing compression at database level can be more efficient,
87+
because here we can here use knowledge about size of database page and performs compression more efficiently.
88+
</para>
89+
90+
</sect1>
91+
92+
<sect1 id="cfs-implementation">
93+
<title>How compression/encryption are integrated in Postgres</title>
94+
95+
<para>
96+
To improve efficiency of disk IO, Postgres is working with files through buffer manager, which pins in memory
97+
most frequently used pages. Each page is fixed size (8kb by default). But if we compress page, then
98+
its size will depend on its content. So updated page can require more (or less) space than original page.
99+
So we may not always perform in-place update of the page. Instead of it we have to locate new space for the page and somehow release
100+
old space. There are two main apporaches of solving this problem:
101+
</para>
102+
103+
<variablelist>
104+
<varlistentry>
105+
<term>Memory allocator</term>
106+
<listitem>
107+
<para>
108+
We should implement our own allocator of file space. Usually, to reduce fragmentation, fixed size block allocator is used.
109+
it means that we allocates pace using some fixed quantum. For example if compressed page size is 932 bytes, then we will
110+
allocate 1024 block for it in the file.
111+
</para>
112+
</listitem>
113+
</varlistentry>
114+
115+
<varlistentry>
116+
<term>Garbage collector</term>
117+
<listitem>
118+
<para>
119+
We can always allocate space for the pages sequentially at the end of the file and periodically do
120+
compactification (defragmentation) of the file, moving all used pages to the beginning of the file.
121+
Such garbage collection process can be performed in background.
122+
As it was explained in the previous section, sequential write of the flushed pages can significantly
123+
increase IO speed and some increase performance. This is why we have used this approach in CFS.
124+
</para>
125+
</listitem>
126+
</varlistentry>
127+
</variablelist>
128+
129+
<para>
130+
As far as page location is not fixed and page an be moved, we can not any more access page directly by its address and need
131+
to use extra level of indirection to map logical address of the page to it physical location in the disk.
132+
It is done using memory-mapped files. In most cases this mapped will be kept in memory (size of the map is 1000 times smaller size
133+
of the file) and address translation adds almost no overhead to page access time.
134+
But we need to maintain this extra files: flush them during checkpoint, remove when table is dropped, include them in backup and
135+
so on...
136+
</para>
137+
138+
<para>
139+
Postgres is storing relation in set of files, size of each file is not exceeding 2Gb. Separate page map is constructed for each file.
140+
Garbage collection in CFS is done by several background workers. Number of this workers and pauses in their work can be
141+
configured by database administrator. This workers are splitting work based on inode hash, so them do not conflict with each other.
142+
Each file is proceeded separately. The files is blocked for access at the time of garbage collection but complete relation is not
143+
blocked. To ensure data consistency GC creates copies of original data and map files. Once them are flushed to the disk,
144+
new version of data file is atomically renamed to original file name. And then new page map data is copied to memory-mapped file
145+
and backup file for page map is removed. In case of recovery after crash we first inspect if there is backup of data file.
146+
If such file exists, then original file is not yet updated and we can safely remove backup files. If such file doesn't exist,
147+
then we check for presence of map file backup. If it is present, then defragmentation of this file was not completed
148+
because of crash and we complete this operation by copying map from backup file.
149+
</para>
150+
151+
<para>
152+
CFS can be build with several compression libraries: Postgres lz, zlib, lz4, snappy, lzfse...
153+
But this is build time choice: it is not possible now to dynamically choose compression algorithm.
154+
CFS stores in tablespace information about used compression algorithm and produce error if Postgres is build with different
155+
library.
156+
</para>
157+
158+
<para>
159+
Encryption is performed using RC4 algorithm. Cipher key is obtained from <varname>PG_CIPHER_KEY</varname> environment variable.
160+
Please notice that catalog relations are not encrypted as well as non-main forks of relation.
161+
</para>
162+
163+
</sect1>
164+
165+
<sect1 id="cfs-usage">
166+
<title>Using of compression/encryption</title>
167+
168+
<para>
169+
Compression can be enabled for particular tablespaces. System relations are not compressed in any case.
170+
It is not currently possible to alter tablespace compression option, i.e. it is not possible to compress existed tablespace
171+
or visa versa - decompress compressed tablespace.
172+
</para>
173+
174+
<para>
175+
So to use compression/encryption you need to create table space with <varname>compression=true</varname> option.
176+
You can make this table space default tablespace - in this case all tables will be implicitly created in this database:
177+
</para>
178+
179+
<programlisting>
180+
postgres=# create tablespace zfs location '/var/data/cfs' with (compression=true);
181+
postgres=# set default_tablespace=zfs;
182+
</programlisting>
183+
184+
<para>
185+
Encryption right now can be only combined with compression: it is not possible to use encryption without compression.
186+
To enable encryption you should set <varname>cfs_encryption</varname> parameter to true and provide cipher use by setting
187+
<varname>PG_CIPHER_KEY</varname> environment variable.
188+
</para>
189+
190+
<para>
191+
CFS provides the following configuration parameters:
192+
</para>
193+
194+
<variablelist>
195+
196+
<varlistentry id="cfs-encryption" xreflabel="cfs_encryption">
197+
<term><varname>cfs_encryption</varname> (<type>boolean</type>)
198+
<indexterm>
199+
<primary><varname>cfs_encryption</> configuration parameter</primary>
200+
</indexterm>
201+
</term>
202+
<listitem>
203+
<para>
204+
Enables encryption of compressed pages. Switched off by default.
205+
</para>
206+
</listitem>
207+
</varlistentry>
208+
209+
<varlistentry id="cfs-gc-workers" xreflabel="cfs_gc_workers">
210+
<term><varname>cfs_gc_workers</varname> (<type>integer</type>)
211+
<indexterm>
212+
<primary><varname>cfs_gc_workers</> configuration parameter</primary>
213+
</indexterm>
214+
</term>
215+
<listitem>
216+
<para>
217+
Number of CFS background garbage collection workers (default: 1).
218+
</para>
219+
</listitem>
220+
</varlistentry>
221+
222+
<varlistentry id="cfs-gc-threshold" xreflabel="cfs_gc_threshold">
223+
<term><varname>cfs_gc_threshold</varname> (<type>integer</type>)
224+
<indexterm>
225+
<primary><varname>cfs_gc_threshold</> configuration parameter</primary>
226+
</indexterm>
227+
</term>
228+
<listitem>
229+
<para>
230+
Percent of garbage in file after which file should be compactified (default: 50%).
231+
</para>
232+
</listitem>
233+
</varlistentry>
234+
235+
<varlistentry id="cfs-gc-period" xreflabel="cfs_gc_period">
236+
<term><varname>cfs_gc_period</varname> (<type>integer</type>)
237+
<indexterm>
238+
<primary><varname>cfs_gc_period</> configuration parameter</primary>
239+
</indexterm>
240+
</term>
241+
<listitem>
242+
<para>
243+
Interval in milliseconds between CFS garbage collection iterations (default: 5 seconds)
244+
</para>
245+
</listitem>
246+
</varlistentry>
247+
248+
<varlistentry id="cfs-gc-delay" xreflabel="cfs_gc_delay">
249+
<term><varname>cfs_gc_delay</varname> (<type>integer</type>)
250+
<indexterm>
251+
<primary><varname>cfs_gc_delay</> configuration parameter</primary>
252+
</indexterm>
253+
</term>
254+
<listitem>
255+
<para>
256+
Delay in milliseconds between files defragmentation (default: 0)
257+
</para>
258+
</listitem>
259+
</varlistentry>
260+
261+
<varlistentry id="cfs-level" xreflabel="cfs_level">
262+
<term><varname>cfs_level</varname> (<type>integer</type>)
263+
<indexterm>
264+
<primary><varname>cfs_level</> configuration parameter</primary>
265+
</indexterm>
266+
</term>
267+
<listitem>
268+
<para>
269+
CFS compression level (default: 1). 0 is no compression, 1 is fastest compression.
270+
Maximal compression level depends on particular compression algorithm: 9 for zlib, 19 for zstd...
271+
</para>
272+
</listitem>
273+
</varlistentry>
274+
275+
</variablelist>
276+
277+
<para>
278+
By default CFS is configured with one background worker performing garbage collection.
279+
Garbage collector traverses tablespace directory, locating map files in it and checking percent of garbage in this file.
280+
When ratio of used and allocated spaces exceeds <varname>cfs_gc_threshold</> threshold, this file is defragmented.
281+
The file is locked at the period of defragmentation, preventing any access to this part of relation.
282+
When defragmentation is completed, garbage collection waits <varname>cfs_gc_delay</varname> milliseconds and continue directory traversal.
283+
After the end of traversal, GC waits <varname>cfs_gc_period</varname> milliseconds and starts new GC iteration.
284+
If there are more than one GC workers, then them split work based on hash of file inode.
285+
</para>
286+
287+
<para>
288+
It is also possible to initiate GC manually using <varname>cfs_start_gc(n_workers)</varname> function.
289+
This function returns number of workers which are actually started. Please notice that if <varname>cfs_gc_workers</varname>
290+
parameter is non zero, then GC is performed in background and <varname>cfs_start_gc</varname> function does nothing and returns 0.
291+
</para>
292+
293+
</para>
294+
It is possible to estimate effect of table compression using <varname>cfs_estimate(relation)</varname> function.
295+
This function takes first ten blocks of relation and tries to compress them ands returns average compress ratio.
296+
So if returned value is 7.8 then compressed table occupies about eight time less space than original table.
297+
</para>
298+
299+
</sect1>
300+
</chapter>

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp