Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitcfd4804

Browse files
committed
Improve documentation for contrib/bloom.
Michael Paquier, David Johnston, Tom LaneDiscussion: <CAB7nPqQB8dcFmY1uodmiJOSZdhBFOx-us-uW6rfYrzhpEiBR2g@mail.gmail.com>
1 parente7880e5 commitcfd4804

File tree

1 file changed

+144
-99
lines changed

1 file changed

+144
-99
lines changed

‎doc/src/sgml/bloom.sgml

Lines changed: 144 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -8,51 +8,51 @@
88
</indexterm>
99

1010
<para>
11-
<literal>bloom</> is a module that implements an index access method. It comes
12-
as an example of custom access methods and generic WAL record usage. But it
13-
is also useful in itself.
11+
<literal>bloom</> provides an index access method based on
12+
<ulink url="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filters</ulink>.
1413
</para>
1514

16-
<sect2>
17-
<title>Introduction</title>
15+
<para>
16+
A Bloom filter is a space-efficient data structure that is used to test
17+
whether an element is a member of a set. In the case of an index access
18+
method, it allows fast exclusion of non-matching tuples via signatures
19+
whose size is determined at index creation.
20+
</para>
1821

19-
<para>
20-
The implementation of a
21-
<ulink url="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filter</ulink>
22-
allows fast exclusion of non-candidate tuples via signatures.
23-
Since a signature is a lossy representation of all indexed attributes,
24-
search results must be rechecked using heap information.
25-
The user can specify signature length in bits (default 80, maximum 4096)
26-
and the number of bits generated for each index column (default 2,
27-
maximum 4095).
28-
</para>
22+
<para>
23+
A signature is a lossy representation of the indexed attribute(s), and as
24+
such is prone to reporting false positives; that is, it may be reported
25+
that an element is in the set, when it is not. So index search results
26+
must always be rechecked using the actual attribute values from the heap
27+
entry. Larger signatures reduce the odds of a false positive and thus
28+
reduce the number of useless heap visits, but of course also make the index
29+
larger and hence slower to scan.
30+
</para>
2931

30-
<para>
31-
This index is useful if a table has many attributes and queries include
32-
arbitrary combinations of them. A traditional <literal>btree</> index is
33-
faster than a bloom index, but it can require many indexes to support all
34-
possible queries where one needs only a single bloom index. A Bloom index
35-
supports only equality comparison. Since it's a signature file, and not a
36-
tree, it always must be read fully, but sequentially, so that index search
37-
performance is constant and doesn't depend on a query.
38-
</para>
39-
</sect2>
32+
<para>
33+
This type of index is most useful when a table has many attributes and
34+
queries test arbitrary combinations of them. A traditional btree index is
35+
faster than a bloom index, but it can require many btree indexes to support
36+
all possible queries where one needs only a single bloom index. Note
37+
however that bloom indexes only support equality queries, whereas btree
38+
indexes can also perform inequality and range searches.
39+
</para>
4040

4141
<sect2>
4242
<title>Parameters</title>
4343

4444
<para>
45-
<literal>bloom</> indexes accept the following parameters in the
46-
<literal>WITH</>
47-
clause.
45+
A <literal>bloom</> index accepts the following parameters in its
46+
<literal>WITH</> clause:
4847
</para>
4948

5049
<variablelist>
5150
<varlistentry>
5251
<term><literal>length</></term>
5352
<listitem>
5453
<para>
55-
Length of signature in bits
54+
Length of each signature (index entry) in bits. The default
55+
is <literal>80</> bits and maximum is <literal>4096</>.
5656
</para>
5757
</listitem>
5858
</varlistentry>
@@ -62,7 +62,10 @@
6262
<term><literal>col1 &mdash; col32</></term>
6363
<listitem>
6464
<para>
65-
Number of bits generated for each index column
65+
Number of bits generated for each index column. Each parameter's name
66+
refers to the number of the index column that it controls. The default
67+
is <literal>2</> bits and maximum is <literal>4095</>. Parameters for
68+
index columns not actually used are ignored.
6669
</para>
6770
</listitem>
6871
</varlistentry>
@@ -73,7 +76,7 @@
7376
<title>Examples</title>
7477

7578
<para>
76-
An example ofanindex definition is given below.
79+
This isanexample of creating a bloom index:
7780
</para>
7881

7982
<programlisting>
@@ -82,92 +85,135 @@ CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3)
8285
</programlisting>
8386

8487
<para>
85-
Here, we created a bloom index with a signature length of 80 bits,
86-
and attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits.
88+
The index is created with a signature length of 80 bits, with attributes
89+
i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits. We could
90+
have omitted the <literal>length</>, <literal>col1</>,
91+
and <literal>col2</> specifications since those have the default values.
8792
</para>
8893

8994
<para>
90-
Here is a fuller example of index definition and usage:
95+
Here is a more complete example of bloom index definition and usage, as
96+
well as a comparison with equivalent btree indexes. The bloom index is
97+
considerably smaller than the btree index, and can perform better.
9198
</para>
9299

93100
<programlisting>
94-
CREATE TABLE tbloom AS
95-
SELECT
96-
random()::int as i1,
97-
random()::int as i2,
98-
random()::int as i3,
99-
random()::int as i4,
100-
random()::int as i5,
101-
random()::int as i6,
102-
random()::int as i7,
103-
random()::int as i8,
104-
random()::int as i9,
105-
random()::int as i10,
106-
random()::int as i11,
107-
random()::int as i12,
108-
random()::int as i13
109-
FROM
110-
generate_series(1,1000);
111-
CREATE INDEX bloomidx ON tbloom USING
112-
bloom (i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12);
113-
SELECT pg_relation_size('bloomidx');
114-
CREATE index btree_idx ON tbloom(i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12);
115-
SELECT pg_relation_size('btree_idx');
101+
=# CREATE TABLE tbloom AS
102+
SELECT
103+
(random() * 1000000)::int as i1,
104+
(random() * 1000000)::int as i2,
105+
(random() * 1000000)::int as i3,
106+
(random() * 1000000)::int as i4,
107+
(random() * 1000000)::int as i5,
108+
(random() * 1000000)::int as i6
109+
FROM
110+
generate_series(1,10000000);
111+
SELECT 10000000
112+
=# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6);
113+
CREATE INDEX
114+
=# SELECT pg_size_pretty(pg_relation_size('bloomidx'));
115+
pg_size_pretty
116+
----------------
117+
153 MB
118+
(1 row)
119+
=# CREATE index btreeidx ON tbloom (i1, i2, i3, i4, i5, i6);
120+
CREATE INDEX
121+
=# SELECT pg_size_pretty(pg_relation_size('btreeidx'));
122+
pg_size_pretty
123+
----------------
124+
387 MB
125+
(1 row)
116126
</programlisting>
117127

128+
<para>
129+
A sequential scan over this large table takes a long time:
118130
<programlisting>
119-
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 =20 ANDi10 =15;
120-
QUERY PLAN
121-
-----------------------------------------------------------------------------------------------------------------
122-
Bitmap HeapScan on tbloom (cost=1.50..5.52 rows=1 width=52) (actual time=0.057..0.057 rows=0 loops=1)
123-
Recheck Cond: ((i2 =20) AND (i10 =15))
124-
-> Bitmap Index Scan on bloomidx (cost=0.00..1.50 rows=1 width=0) (actual time=0.041..0.041 rows=9 loops=1)
125-
Index Cond: ((i2 = 20) AND (i10 = 15))
126-
Total runtime: 0.081 ms
131+
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 =898732 ANDi5 =123451;
132+
QUERY PLAN
133+
------------------------------------------------------------------------------------------------------------
134+
SeqScan on tbloom (cost=0.00..213694.08 rows=1 width=24) (actual time=1445.438..1445.438 rows=0 loops=1)
135+
Filter: ((i2 =898732) AND (i5 =123451))
136+
Rows Removed by Filter: 10000000
137+
Planning time: 0.177 ms
138+
Execution time: 1445.473 ms
127139
(5 rows)
128140
</programlisting>
141+
</para>
129142

130143
<para>
131-
Seqscan is slow.
144+
So the planner will usually select an index scan if possible.
145+
With a btree index, we get results like this:
146+
<programlisting>
147+
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
148+
QUERY PLAN
149+
--------------------------------------------------------------------------------------------------------------------------------
150+
Index Only Scan using btreeidx on tbloom (cost=0.56..298311.96 rows=1 width=24) (actual time=445.709..445.709 rows=0 loops=1)
151+
Index Cond: ((i2 = 898732) AND (i5 = 123451))
152+
Heap Fetches: 0
153+
Planning time: 0.193 ms
154+
Execution time: 445.770 ms
155+
(5 rows)
156+
</programlisting>
132157
</para>
133158

159+
<para>
160+
Bloom is better than btree in handling this type of search:
134161
<programlisting>
135-
=# SET enable_bitmapscan = off;
136-
=# SET enable_indexscan = off;
137-
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 20 AND i10 = 15;
138-
QUERY PLAN
139-
--------------------------------------------------------------------------------------------------
140-
Seq Scan on tbloom (cost=0.00..25.00 rows=1 width=52) (actual time=0.162..0.162 rows=0 loops=1)
141-
Filter: ((i2 = 20) AND (i10 = 15))
142-
Total runtime: 0.181 ms
143-
(3 rows)
162+
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
163+
QUERY PLAN
164+
---------------------------------------------------------------------------------------------------------------------------
165+
Bitmap Heap Scan on tbloom (cost=178435.39..178439.41 rows=1 width=24) (actual time=76.698..76.698 rows=0 loops=1)
166+
Recheck Cond: ((i2 = 898732) AND (i5 = 123451))
167+
Rows Removed by Index Recheck: 2439
168+
Heap Blocks: exact=2408
169+
-&gt; Bitmap Index Scan on bloomidx (cost=0.00..178435.39 rows=1 width=0) (actual time=72.455..72.455 rows=2439 loops=1)
170+
Index Cond: ((i2 = 898732) AND (i5 = 123451))
171+
Planning time: 0.475 ms
172+
Execution time: 76.778 ms
173+
(8 rows)
144174
</programlisting>
175+
Note the relatively large number of false positives: 2439 rows were
176+
selected to be visited in the heap, but none actually matched the
177+
query. We could reduce that by specifying a larger signature length.
178+
In this example, creating the index with <literal>length=200</>
179+
reduced the number of false positives to 55; but it doubled the index size
180+
(to 306 MB) and ended up being slower for this query (125 ms overall).
181+
</para>
145182

146-
<para>
147-
A btree index will be not used for this query.
148-
</para>
149-
183+
<para>
184+
Now, the main problem with the btree search is that btree is inefficient
185+
when the search conditions do not constrain the leading index column(s).
186+
A better strategy for btree is to create a separate index on each column.
187+
Then the planner will choose something like this:
150188
<programlisting>
151-
=# DROP INDEX bloomidx;
152-
=# CREATE INDEX btree_idx ON tbloom(i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12);
153-
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 20 AND i10 = 15;
154-
QUERY PLAN
155-
--------------------------------------------------------------------------------------------------
156-
Seq Scan on tbloom (cost=0.00..25.00 rows=1 width=52) (actual time=0.210..0.210 rows=0 loops=1)
157-
Filter: ((i2 = 20) AND (i10 = 15))
158-
Total runtime: 0.250 ms
159-
(3 rows)
189+
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
190+
QUERY PLAN
191+
------------------------------------------------------------------------------------------------------------------------------
192+
Bitmap Heap Scan on tbloom (cost=9.29..13.30 rows=1 width=24) (actual time=0.148..0.148 rows=0 loops=1)
193+
Recheck Cond: ((i5 = 123451) AND (i2 = 898732))
194+
-&gt; BitmapAnd (cost=9.29..9.29 rows=1 width=0) (actual time=0.145..0.145 rows=0 loops=1)
195+
-&gt; Bitmap Index Scan on tbloom_i5_idx (cost=0.00..4.52 rows=11 width=0) (actual time=0.089..0.089 rows=10 loops=1)
196+
Index Cond: (i5 = 123451)
197+
-&gt; Bitmap Index Scan on tbloom_i2_idx (cost=0.00..4.52 rows=11 width=0) (actual time=0.048..0.048 rows=8 loops=1)
198+
Index Cond: (i2 = 898732)
199+
Planning time: 2.049 ms
200+
Execution time: 0.280 ms
201+
(9 rows)
160202
</programlisting>
203+
Although this query runs much faster than with either of the single
204+
indexes, we pay a large penalty in index size. Each of the single-column
205+
btree indexes occupies 214 MB, so the total space needed is over 1.2GB,
206+
more than 8 times the space used by the bloom index.
207+
</para>
161208
</sect2>
162209

163210
<sect2>
164-
<title>Opclass interface</title>
211+
<title>Operator Class Interface</title>
165212

166213
<para>
167-
The Bloom opclass interface is simple. It requires 1 supporting function:
168-
a hash function for the indexing datatype. It provides 1 search operator:
169-
the equality operator. The example below shows <literal>opclass</>
170-
definition for <literal>text</> datatype.
214+
An operator class for bloom indexes requires only a hash function for the
215+
indexed datatype and an equality operator for searching. This example
216+
shows the opclass definition for the <type>text</> data type:
171217
</para>
172218

173219
<programlisting>
@@ -179,22 +225,21 @@ DEFAULT FOR TYPE text USING bloom AS
179225
</sect2>
180226

181227
<sect2>
182-
<title>Limitation</title>
228+
<title>Limitations</title>
183229
<para>
184-
185230
<itemizedlist>
186231
<listitem>
187232
<para>
188-
For now, only opclassesfor <literal>int4</>, <literal>text</>come
189-
with the module. However, users may define more of them.
233+
Only operator classesfor <type>int4</> and <type>text</>are
234+
includedwith the module.
190235
</para>
191236
</listitem>
192237

193238
<listitem>
194239
<para>
195-
Only the <literal>=</literal> operator is supported for search at the
196-
moment. But it'spossible to add support for arrays withcontains and
197-
intersectionoperations in the future.
240+
Only the <literal>=</literal> operator is supported for search. But
241+
it ispossible to add support for arrays withunion and intersection
242+
operations in the future.
198243
</para>
199244
</listitem>
200245
</itemizedlist>

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp