Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitbc0f080

Browse files
committed
Fix misuse of Lossy Counting (LC) algorithm in compute_tsvector_stats().
We must filter out hashtable entries with frequencies less than thosespecified by the algorithm, else we risk emitting junk entries whoseactual frequency is much less than other lexemes that did not gettabulated. This is bad enough by itself, but even worse is thattsquerysel() believes that the minimum frequency seen in pg_statistic is ahard upper bound for lexemes not included, and was thus underestimatingthe frequency of non-MCEs.Also, set the threshold frequency to something with a little bit of theorybehind it, to wit assume that the input distribution is approximatelyZipfian. This might need adjustment in future, but some preliminaryexperiments suggest that it's not too unreasonable.Back-patch to 8.4, where this code was introduced.Jan Urbanski, with some editorialization by Tom
1 parentb12b7a9 commitbc0f080

File tree

1 file changed

+97
-45
lines changed

1 file changed

+97
-45
lines changed

‎src/backend/tsearch/ts_typanalyze.c

Lines changed: 97 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
*
88
*
99
* IDENTIFICATION
10-
* $PostgreSQL: pgsql/src/backend/tsearch/ts_typanalyze.c,v 1.8 2010/01/02 16:57:53 momjian Exp $
10+
* $PostgreSQL: pgsql/src/backend/tsearch/ts_typanalyze.c,v 1.9 2010/05/30 21:59:02 tgl Exp $
1111
*
1212
*-------------------------------------------------------------------------
1313
*/
@@ -92,21 +92,49 @@ ts_typanalyze(PG_FUNCTION_ARGS)
9292
*http://www.vldb.org/conf/2002/S10P03.pdf
9393
*
9494
*The Lossy Counting (aka LC) algorithm goes like this:
95-
*Let D be a set of triples (e, f, d), where e is an element value, f is
96-
*that element's frequency (occurrence count) and d is the maximum error in
97-
*f.We start with D empty and process the elements in batches of size
98-
*w. (The batch size is also known as "bucket size".) Let the current batch
99-
*number be b_current, starting with 1. For each element e we either
100-
*increment its f count, if it's already in D, or insert a new triple into D
101-
*with values (e, 1, b_current - 1). After processing each batch we prune D,
102-
*by removing from it all elements with f + d <= b_current. Finally, we
103-
*gather elements with largest f. The LC paper proves error bounds on f
104-
*dependent on the batch size w, and shows that the required table size
105-
*is no more than a few times w.
95+
*Let s be the threshold frequency for an item (the minimum frequency we
96+
*are interested in) and epsilon the error margin for the frequency. Let D
97+
*be a set of triples (e, f, delta), where e is an element value, f is that
98+
*element's frequency (actually, its current occurrence count) and delta is
99+
*the maximum error in f. We start with D empty and process the elements in
100+
*batches of size w. (The batch size is also known as "bucket size" and is
101+
*equal to 1/epsilon.) Let the current batch number be b_current, starting
102+
*with 1. For each element e we either increment its f count, if it's
103+
*already in D, or insert a new triple into D with values (e, 1, b_current
104+
*- 1). After processing each batch we prune D, by removing from it all
105+
*elements with f + delta <= b_current. After the algorithm finishes we
106+
*suppress all elements from D that do not satisfy f >= (s - epsilon) * N,
107+
*where N is the total number of elements in the input. We emit the
108+
*remaining elements with estimated frequency f/N. The LC paper proves
109+
*that this algorithm finds all elements with true frequency at least s,
110+
*and that no frequency is overestimated or is underestimated by more than
111+
*epsilon. Furthermore, given reasonable assumptions about the input
112+
*distribution, the required table size is no more than about 7 times w.
106113
*
107-
*We use a hashtable for the D structure and a bucket width of
108-
*statistics_target * 10, where 10 is an arbitrarily chosen constant,
109-
*meant to approximate the number of lexemes in a single tsvector.
114+
*We set s to be the estimated frequency of the K'th word in a natural
115+
*language's frequency table, where K is the target number of entries in
116+
*the MCELEM array plus an arbitrary constant, meant to reflect the fact
117+
*that the most common words in any language would usually be stopwords
118+
*so we will not actually see them in the input. We assume that the
119+
*distribution of word frequencies (including the stopwords) follows Zipf's
120+
*law with an exponent of 1.
121+
*
122+
*Assuming Zipfian distribution, the frequency of the K'th word is equal
123+
*to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of
124+
*words in the language. Putting W as one million, we get roughly 0.07/K.
125+
*Assuming top 10 words are stopwords gives s = 0.07/(K + 10). We set
126+
*epsilon = s/10, which gives bucket width w = (K + 10)/0.007 and
127+
*maximum expected hashtable size of about 1000 * (K + 10).
128+
*
129+
*Note: in the above discussion, s, epsilon, and f/N are in terms of a
130+
*lexeme's frequency as a fraction of all lexemes seen in the input.
131+
*However, what we actually want to store in the finished pg_statistic
132+
*entry is each lexeme's frequency as a fraction of all rows that it occurs
133+
*in. Assuming that the input tsvectors are correctly constructed, no
134+
*lexeme occurs more than once per tsvector, so the final count f is a
135+
*correct estimate of the number of input tsvectors it occurs in, and we
136+
*need only change the divisor from N to nonnull_cnt to get the number we
137+
*want.
110138
*/
111139
staticvoid
112140
compute_tsvector_stats(VacAttrStats*stats,
@@ -133,19 +161,23 @@ compute_tsvector_stats(VacAttrStats *stats,
133161
LexemeHashKeyhash_key;
134162
TrackItem*item;
135163

136-
/* We want statistics_target * 10 lexemes in the MCELEM array */
164+
/*
165+
* We want statistics_target * 10 lexemes in the MCELEM array. This
166+
* multiplier is pretty arbitrary, but is meant to reflect the fact that
167+
* the number of individual lexeme values tracked in pg_statistic ought
168+
* to be more than the number of values for a simple scalar column.
169+
*/
137170
num_mcelem=stats->attr->attstattarget*10;
138171

139172
/*
140-
* We set bucket width equal to the target number of result lexemes. This
141-
* is probably about right but perhaps might need to be scaled up or down
142-
* a bit?
173+
* We set bucket width equal to (num_mcelem + 10) / 0.007 as per the
174+
* comment above.
143175
*/
144-
bucket_width=num_mcelem;
176+
bucket_width=(num_mcelem+10)*1000 /7;
145177

146178
/*
147179
* Create the hashtable. It will be in local memory, so we don't need to
148-
* worry aboutinitial size too much. Also we don't need to pay any
180+
* worry aboutoverflowing the initial size. Also we don't need to pay any
149181
* attention to locking and memory management.
150182
*/
151183
MemSet(&hash_ctl,0,sizeof(hash_ctl));
@@ -155,13 +187,13 @@ compute_tsvector_stats(VacAttrStats *stats,
155187
hash_ctl.match=lexeme_match;
156188
hash_ctl.hcxt=CurrentMemoryContext;
157189
lexemes_tab=hash_create("Analyzed lexemes table",
158-
bucket_width*4,
190+
bucket_width*7,
159191
&hash_ctl,
160192
HASH_ELEM |HASH_FUNCTION |HASH_COMPARE |HASH_CONTEXT);
161193

162194
/* Initialize counters. */
163195
b_current=1;
164-
lexeme_no=1;
196+
lexeme_no=0;
165197

166198
/* Loop over the tsvectors. */
167199
for (vector_no=0;vector_no<samplerows;vector_no++)
@@ -232,6 +264,9 @@ compute_tsvector_stats(VacAttrStats *stats,
232264
item->delta=b_current-1;
233265
}
234266

267+
/* lexeme_no is the number of elements processed (ie N) */
268+
lexeme_no++;
269+
235270
/* We prune the D structure after processing each bucket */
236271
if (lexeme_no %bucket_width==0)
237272
{
@@ -240,7 +275,6 @@ compute_tsvector_stats(VacAttrStats *stats,
240275
}
241276

242277
/* Advance to the next WordEntry in the tsvector */
243-
lexeme_no++;
244278
curentryptr++;
245279
}
246280
}
@@ -252,6 +286,7 @@ compute_tsvector_stats(VacAttrStats *stats,
252286
inti;
253287
TrackItem**sort_table;
254288
inttrack_len;
289+
intcutoff_freq;
255290
intminfreq,
256291
maxfreq;
257292

@@ -264,34 +299,51 @@ compute_tsvector_stats(VacAttrStats *stats,
264299
stats->stadistinct=-1.0;
265300

266301
/*
267-
* Determine the top-N lexemes by simply copying pointers from the
268-
* hashtable into an array and applying qsort()
302+
* Construct an array of the interesting hashtable items, that is,
303+
* those meeting the cutoff frequency (s - epsilon)*N. Also identify
304+
* the minimum and maximum frequencies among these items.
305+
*
306+
* Since epsilon = s/10 and bucket_width = 1/epsilon, the cutoff
307+
* frequency is 9*N / bucket_width.
269308
*/
270-
track_len=hash_get_num_entries(lexemes_tab);
309+
cutoff_freq=9*lexeme_no /bucket_width;
271310

272-
sort_table= (TrackItem**)palloc(sizeof(TrackItem*)*track_len);
311+
i=hash_get_num_entries(lexemes_tab);/* surely enough space */
312+
sort_table= (TrackItem**)palloc(sizeof(TrackItem*)*i);
273313

274314
hash_seq_init(&scan_status,lexemes_tab);
275-
i=0;
315+
track_len=0;
316+
minfreq=lexeme_no;
317+
maxfreq=0;
276318
while ((item= (TrackItem*)hash_seq_search(&scan_status))!=NULL)
277319
{
278-
sort_table[i++]=item;
320+
if (item->frequency>cutoff_freq)
321+
{
322+
sort_table[track_len++]=item;
323+
minfreq=Min(minfreq,item->frequency);
324+
maxfreq=Max(maxfreq,item->frequency);
325+
}
279326
}
280-
Assert(i==track_len);
327+
Assert(track_len <=i);
281328

282-
qsort(sort_table,track_len,sizeof(TrackItem*),
283-
trackitem_compare_frequencies_desc);
329+
/* emit some statistics for debug purposes */
330+
elog(DEBUG3,"tsvector_stats: target # mces = %d, bucket width = %d, "
331+
"# lexemes = %d, hashtable size = %d, usable entries = %d",
332+
num_mcelem,bucket_width,lexeme_no,i,track_len);
284333

285-
/* Suppress any single-occurrence items */
286-
while (track_len>0)
334+
/*
335+
* If we obtained more lexemes than we really want, get rid of
336+
* those with least frequencies. The easiest way is to qsort the
337+
* array into descending frequency order and truncate the array.
338+
*/
339+
if (num_mcelem<track_len)
287340
{
288-
if (sort_table[track_len-1]->frequency>1)
289-
break;
290-
track_len--;
341+
qsort(sort_table,track_len,sizeof(TrackItem*),
342+
trackitem_compare_frequencies_desc);
343+
/* reset minfreq to the smallest frequency we're keeping */
344+
minfreq=sort_table[num_mcelem-1]->frequency;
291345
}
292-
293-
/* Determine the number of most common lexemes to be stored */
294-
if (num_mcelem>track_len)
346+
else
295347
num_mcelem=track_len;
296348

297349
/* Generate MCELEM slot entry */
@@ -301,10 +353,6 @@ compute_tsvector_stats(VacAttrStats *stats,
301353
Datum*mcelem_values;
302354
float4*mcelem_freqs;
303355

304-
/* Grab the minimal and maximal frequencies that will get stored */
305-
minfreq=sort_table[num_mcelem-1]->frequency;
306-
maxfreq=sort_table[0]->frequency;
307-
308356
/*
309357
* We want to store statistics sorted on the lexeme value using
310358
* first length, then byte-for-byte comparison. The reason for
@@ -334,6 +382,10 @@ compute_tsvector_stats(VacAttrStats *stats,
334382
mcelem_values= (Datum*)palloc(num_mcelem*sizeof(Datum));
335383
mcelem_freqs= (float4*)palloc((num_mcelem+2)*sizeof(float4));
336384

385+
/*
386+
* See comments above about use of nonnull_cnt as the divisor
387+
* for the final frequency estimates.
388+
*/
337389
for (i=0;i<num_mcelem;i++)
338390
{
339391
TrackItem*item=sort_table[i];

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp