NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commit8f0ca62

committed

I finally got the time to put together some stuff for fti for

inclusion in pgsql. I have included a README which should be enoughto start using it, plus a BENCH file that describes some timingsI have done.Please have a look at it, and if you think everything is OK, Iwould like it seen included in the contrib-section of pgsql.I don't think I will do any more work in this, but maybe it inspiressomebody else to improve on it.Maarten Boekhold

1 parentaac1633 commit8f0ca62Copy full SHA for 8f0ca62

File tree

5 files changed

+947

-0

lines changed

contrib/fulltextindex

5 files changed

+947

-0

lines changed

`‎contrib/fulltextindex/BENCH`

Lines changed: 97 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,97 @@`
	`1`	`+The following data was generated by the 'timings.sh' script included`
	`2`	`+in this directory. It uses a very large table with music-related`
	`3`	`+articles as a source for the fti-table. The tables used are:`
	`4`	`+`
	`5`	`+product : contains product information : 540.429 rows`
	`6`	`+artist_fti : fti table for product : 4.501.321 rows`
	`7`	`+clustered : same as above, only clustered : 4.501.321 rows`
	`8`	`+`
	`9`	`+A sequential scan of the artist_fti table (and thus also the clustered table)`
	`10`	`+takes around 6:16 minutes....`
	`11`	`+`
	`12`	`+Unfortunately I cannot probide anybody else with this test-date, since I`
	`13`	`+am not allowed to redistribute the data (it's a database being sold by`
	`14`	`+a couple of wholesale companies). Anyways, it's megabytes, so you probably`
	`15`	`+wouldn't want it in this distribution anyways.`
	`16`	`+`
	`17`	`+I haven't tested this with less data.`
	`18`	`+`
	`19`	`+The test-machine is a Pentium 133, 64 MB, Linux 2.0.32 with the database`
	`20`	`+on a 'QUANTUM BIGFOOT_CY4320A, 4134MB w/67kB Cache, CHS=8960/15/63'. This`
	`21`	`+is a very slow disk.`
	`22`	`+`
	`23`	`+The postmaster was running with:`
	`24`	`+`
	`25`	`+ postmaster -i -b /usr/local/pgsql/bin/postgres -S 1024 -B 256 \`
	`26`	`+ -o -o /usr/local/pgsql/debug-output -F -d 1`
	`27`	`+`
	`28`	`+('trashing' means a 'select count(*) from artist_fti' to completely trash`
	`29`	`+any disk-caches and buffers....)`
	`30`	`+`
	`31`	`+TESTING ON UNCLUSTERED FTI`
	`32`	`+trashing`
	`33`	`+1: ^lapton and ^ric : 0.050u 0.000s 5m37.484s 0.01%`
	`34`	`+2: ^lapton and ^ric : 0.050u 0.030s 5m32.447s 0.02%`
	`35`	`+3: ^lapton and ^ric : 0.030u 0.020s 5m28.822s 0.01%`
	`36`	`+trashing`
	`37`	`+1: ^lling and ^tones : 0.020u 0.030s 0m54.313s 0.09%`
	`38`	`+2: ^lling and ^tones : 0.040u 0.030s 0m5.057s 1.38%`
	`39`	`+3: ^lling and ^tones : 0.010u 0.050s 0m2.072s 2.89%`
	`40`	`+trashing`
	`41`	`+1: ^aughan and ^evie : 0.020u 0.030s 0m26.241s 0.19%`
	`42`	`+2: ^aughan and ^evie : 0.050u 0.010s 0m1.316s 4.55%`
	`43`	`+3: ^aughan and ^evie : 0.030u 0.020s 0m1.029s 4.85%`
	`44`	`+trashing`
	`45`	`+1: ^lling : 0.040u 0.010s 0m55.104s 0.09%`
	`46`	`+2: ^lling : 0.030u 0.030s 0m4.716s 1.27%`
	`47`	`+3: ^lling : 0.040u 0.010s 0m2.157s 2.31%`
	`48`	`+trashing`
	`49`	`+1: ^stev and ^ray and ^vaugh : 0.040u 0.000s 1m5.630s 0.06%`
	`50`	`+2: ^stev and ^ray and ^vaugh : 0.050u 0.020s 1m3.561s 0.11%`
	`51`	`+3: ^stev and ^ray and ^vaugh : 0.050u 0.010s 1m5.923s 0.09%`
	`52`	`+trashing`
	`53`	`+1: ^lling (no join) : 0.050u 0.020s 0m24.139s 0.28%`
	`54`	`+2: ^lling (no join) : 0.040u 0.040s 0m1.087s 7.35%`
	`55`	`+3: ^lling (no join) : 0.020u 0.030s 0m0.772s 6.48%`
	`56`	`+trashing`
	`57`	`+1: ^vaughan (no join) : 0.040u 0.030s 0m9.075s 0.77%`
	`58`	`+2: ^vaughan (no join) : 0.030u 0.010s 0m0.609s 6.56%`
	`59`	`+3: ^vaughan (no join) : 0.040u 0.010s 0m0.503s 9.94%`
	`60`	`+trashing`
	`61`	`+1: ^rol (no join) : 0.020u 0.030s 0m49.898s 0.10%`
	`62`	`+2: ^rol (no join) : 0.030u 0.020s 0m3.136s 1.59%`
	`63`	`+3: ^rol (no join) : 0.030u 0.020s 0m1.231s 4.06%`
	`64`	`+`
	`65`	`+TESTING ON CLUSTERED FTI`
	`66`	`+trashing`
	`67`	`+1: ^lapton and ^ric : 0.020u 0.020s 2m17.120s 0.02%`
	`68`	`+2: ^lapton and ^ric : 0.030u 0.020s 2m11.767s 0.03%`
	`69`	`+3: ^lapton and ^ric : 0.040u 0.010s 2m8.128s 0.03%`
	`70`	`+trashing`
	`71`	`+1: ^lling and ^tones : 0.020u 0.030s 0m18.179s 0.27%`
	`72`	`+2: ^lling and ^tones : 0.030u 0.010s 0m1.897s 2.10%`
	`73`	`+3: ^lling and ^tones : 0.040u 0.010s 0m1.619s 3.08%`
	`74`	`+trashing`
	`75`	`+1: ^aughan and ^evie : 0.070u 0.010s 0m11.765s 0.67%`
	`76`	`+2: ^aughan and ^evie : 0.040u 0.010s 0m1.198s 4.17%`
	`77`	`+3: ^aughan and ^evie : 0.030u 0.020s 0m0.872s 5.73%`
	`78`	`+trashing`
	`79`	`+1: ^lling : 0.040u 0.000s 0m28.623s 0.13%`
	`80`	`+2: ^lling : 0.030u 0.010s 0m2.339s 1.70%`
	`81`	`+3: ^lling : 0.030u 0.010s 0m1.975s 2.02%`
	`82`	`+trashing`
	`83`	`+1: ^stev and ^ray and ^vaugh : 0.020u 0.010s 0m17.667s 0.16%`
	`84`	`+2: ^stev and ^ray and ^vaugh : 0.030u 0.010s 0m3.745s 1.06%`
	`85`	`+3: ^stev and ^ray and ^vaugh : 0.030u 0.020s 0m3.439s 1.45%`
	`86`	`+trashing`
	`87`	`+1: ^lling (no join) : 0.020u 0.040s 0m2.218s 2.70%`
	`88`	`+2: ^lling (no join) : 0.020u 0.020s 0m0.506s 7.90%`
	`89`	`+3: ^lling (no join) : 0.030u 0.030s 0m0.510s 11.76%`
	`90`	`+trashing`
	`91`	`+1: ^vaughan (no join) : 0.040u 0.050s 0m2.048s 4.39%`
	`92`	`+2: ^vaughan (no join) : 0.030u 0.020s 0m0.332s 15.04%`
	`93`	`+3: ^vaughan (no join) : 0.040u 0.010s 0m0.318s 15.72%`
	`94`	`+trashing`
	`95`	`+1: ^rol (no join) : 0.020u 0.030s 0m2.384s 2.09%`
	`96`	`+2: ^rol (no join) : 0.020u 0.030s 0m0.676s 7.39%`
	`97`	`+3: ^rol (no join) : 0.020u 0.030s 0m0.697s 7.17%`

`‎contrib/fulltextindex/Makefile`

Lines changed: 24 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,24 @@`
	`1`	`+SRCDIR= ../../src`
	`2`	`+`
	`3`	`+include$(SRCDIR)/Makefile.global`
	`4`	`+`
	`5`	`+CONTRIBDIR=$(LIBDIR)/contrib`
	`6`	`+`
	`7`	`+CFLAGS+=$(CFLAGS_SL) -I$(SRCDIR)/include`
	`8`	`+`
	`9`	`+TARGETS= fti$(DLSUFFIX)`
	`10`	`+CLEANFILES+=$(TARGETS)`
	`11`	+CURDIR=`pwd`
	`12`	`+`
	`13`	`+all::$(TARGETS)`
	`14`	`+`
	`15`	`+%.sql:%.source`
	`16`	`+rm -f$@;\`
	`17`	`+sed -e"s:_CURRENTDIR_:$(CURDIR):g"\`
	`18`	`+-e"s:_DLSUFFIX_:$(DLSUFFIX):g"<$<>$@`
	`19`	`+`
	`20`	`+clean:`
	`21`	`+rm -f$(TARGETS)*.o`
	`22`	`+`
	`23`	`+dist:`
	`24`	`+tar cf fti.tar README BENCH Makefile fti.c timings.sh`

`‎contrib/fulltextindex/README`

Lines changed: 95 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,95 @@`
	`1`	`+An attempt at some sort of Full Text Indexing for PostgreSQL.`
	`2`	`+`
	`3`	`+The included software is an attempt to add some sort of Full Text Indexing`
	`4`	`+support to PostgreSQL. I mean by this that we can ask questions like:`
	`5`	`+`
	`6`	`+Give me all rows that have 'still' and 'nash' in the 'artist' field.`
	`7`	`+`
	`8`	`+Ofcourse we can write this as:`
	`9`	`+`
	`10`	`+select * from cds where artist ~* 'stills' and artist ~* 'nash';`
	`11`	`+`
	`12`	`+But this does not use any indices, and therefore, if your database`
	`13`	`+gets very large, it will not have very high performance (the above query`
	`14`	`+requires at least one sequential scan, it probably takes 2 due to the`
	`15`	`+self-join).`
	`16`	`+`
	`17`	`+The approach used by this add-on is to define a trigger on the table and`
	`18`	`+column you want to do this queries on. On every insert in the table, it`
	`19`	`+takes the value in the specified column, breaks the text in this column`
	`20`	`+up into pieces, and stores all sub-strings into another table, together`
	`21`	`+with a reference to the row in the original table that contained this`
	`22`	`+sub-string (it uses the oid of that row).`
	`23`	`+`
	`24`	`+By now creating an index over the 'fti-table', we can search for`
	`25`	`+substrings that occur in the original table. By making a join between`
	`26`	`+the fti-table and the orig-table, we can get the actual rows we want`
	`27`	`+(this can also be done by using subselects, and maybe there're other`
	`28`	`+ways too).`
	`29`	`+`
	`30`	`+As an example we take the previous query, where we assume we have all`
	`31`	`+sub-strings in the table 'cds-fti':`
	`32`	`+`
	`33`	`+select c.*`
	`34`	`+from cds c, cds-fti f1, cds-fti f2`
	`35`	`+wheref1.string ~ '^stills' and`
	`36`	`+f2.string ~ '^nash' and`
	`37`	`+f1.id = c.oid and`
	`38`	`+f2.id = c.oid ;`
	`39`	`+`
	`40`	`+We can use the ~ (case-sensitive regular expression) here, because of`
	`41`	`+the way sub-strings are built: from right to left, ie. house -> 'se' +`
	`42`	`+'use' + 'ouse' + 'house'. If a ~ search starts with a ^ (match start of`
	`43`	`+string), btree indices can be used by PostgreSQL.`
	`44`	`+`
	`45`	`+Now, how do we create the trigger that maintains the fti-table? First: the`
	`46`	`+fti-table should have the following schema:`
	`47`	`+`
	`48`	`+create cds-fti ( string varchar(N), id oid );`
	`49`	`+`
	`50`	`+Don't change the names of the columns, the varchar() can in fact also`
	`51`	`+be of text-type. If you do use varchar, make sure the largest possible`
	`52`	`+sub-string will fit.`
	`53`	`+`
	`54`	`+The create the function that contains the trigger::`
	`55`	`+`
	`56`	`+create function fti() returns opaque as '/path/to/fti.so' language 'C';`
	`57`	`+`
	`58`	`+And finally define the trigger on the 'cds' table:`
	`59`	`+`
	`60`	`+create trigger cds-fti-trigger after update or insert or delete on cds`
	`61`	`+ for each row execute procedure fti(cds-fti, artist);`
	`62`	`+`
	`63`	`+Here, the trigger will be defined on table 'cds', it will create`
	`64`	`+sub-strings from the field 'artist', and it will place those sub-strings`
	`65`	`+in the table 'cds-fti'.`
	`66`	`+`
	`67`	`+Now populate the table 'cds'. This will also populate the table 'cds-fti'.`
	`68`	`+It's fastest to populate the table before you create the indices.`
	`69`	`+`
	`70`	`+Before you start using the system, you should at least have the following`
	`71`	`+indices:`
	`72`	`+`
	`73`	`+create index cds-fti-idx on cds-fti (string, id);`
	`74`	`+create index cds-oid-idx on cds (oid);`
	`75`	`+`
	`76`	`+To get the most performance out of this, you should have 'cds-fti'`
	`77`	`+clustered on disk, ie. all rows with the same sub-strings should be`
	`78`	`+close to each other. There are 3 ways of doing this:`
	`79`	`+`
	`80`	`+1. After you have created the indices, execute 'cluster cds-fti-idx on cds-fti'.`
	`81`	`+2. Do a 'select * into tmp-table from cds-fti order by string' before`
	`82`	`+ you create the indices, then 'drop table cds-fti' and`
	`83`	`+ 'alter table tmp-table rename to cds-fti'`
	`84`	`+3. Before creating indices, dump the contents of the cds-fti table using`
	`85`	`+ 'pg_dump -a -t cds-fti dbase-name', remove the \connect`
	`86`	`+ from the beginning and the \. from the end, and sort it using the`
	`87`	`+ UNIX 'sort' program, and reload the data.`
	`88`	`+`
	`89`	`+Method 1 is very slow, 2 a lot faster, and for very large tables, 3 is`
	`90`	`+preferred.`
	`91`	`+`
	`92`	`+`
	`93`	`+`
	`94`	`+`
	`95`	`+`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit8f0ca62

File tree

5 files changed

5 files changed

`‎contrib/fulltextindex/BENCH`

`‎contrib/fulltextindex/Makefile`

`‎contrib/fulltextindex/README`

0 commit comments