I want to create a database using any of the possible RDBMS. It will have a table with approximately 150 columns. The objective is to perform nearest neighbor search of some other objects. So it's a NNS in the 150-dimensional space.
I already tried to use some obvious methods like L1 or L2 distances but of course it takes a lot of time for tables with many rows. Also I tried to look at the KD-tree (note I did not test it) and PG-Strom but they are not a good solution for data with a many dimensions.
Can I somehow improve the speed of the described search using math methods (like KD-tree) or tech methods (like PG-Strom)?
I will try to use any RDBMS which allow to improve speed of the NNS. But MySQL and PostgreSQL are the most appropriate DBMS for me.
- 1These are other problems. Simply ask another question @don-progEvan Carroll– Evan Carroll2017-08-10 16:02:50 +00:00CommentedAug 10, 2017 at 16:02
4 Answers4
PostgreSQL 9.6 usingcube
First install thecube extension
CREATE EXTENSION cube;Now we will create some n-dimensional space with 100,000 points in 50 dimensions. In addition we'll add a GIST index.
CREATE TEMP TABLE space_ndAS SELECT i, cube(array_agg(random()::float)) AS c FROM generate_series(1,1e5) AS i CROSS JOIN LATERAL generate_series(1,50) AS x GROUP BY i;CREATE INDEX ON space_nd USING gist ( c );ANALYZE space_nd;Now we will generate a single point and use the<-> operater to find the nearest point using Eucledian distance.
WITH points AS ( SELECT cube(array_agg(random()::float)) AS c FROM generate_series(1,50) AS x)SELECT i, pg_typeof(space_nd.c), pg_typeof(points.c), cube_distance(space_nd.c, points.c)FROM space_ndCROSS JOIN pointsORDER BY space_nd.c <-> points.cLIMIT 5;PostgreSQL 9.6+ supports other distance operators overcube. All of which can use the GIST index we created. Namely,
a <-> b float8 Euclidean distance between a and b.a <#> b float8 Taxicab (L-1 metric) distance between a and b.a <=> b float8 Chebyshev (L-inf metric) distance between a and b.That said there is one caveat,
To make it harder for people to break things,there is a limit of 100 on the number of dimensions of cubes. This is set in cubedata.h if you need something bigger.
You ask for 150 dimensions. That may present a minor complication.
- 1The edit to
cubedata.hdoesn't work past 130 dimensions in my experience. Maybe you can also change all thedoubles orfloat8s in the extension tofloat4, since Postgres has a limit on per-row index size that you can stay away from by halving how many bytes you use on each number. I did some testing and got more dimensions in that way, and IIRC I got past 150, but I'm not totally sure.sudo– sudo2018-11-03 02:33:13 +00:00CommentedNov 3, 2018 at 2:33 - 2I had same problem with limit on dimensions and created docker image with 2048 limit:hub.docker.com/r/expert/postgresql-large-cubeexpert– expert2019-04-29 09:37:33 +00:00CommentedApr 29, 2019 at 9:37
Consider performing dimension reduction first (eg. Principle Component Analysis).
Then your are doing NN in small number of dimensions with higher performance.
You can use Pl/R to perform PCA inside postgres if needed.
Have a look atFLANN and OpenCV.
Unfortunately I am not aware of an integration of that into a RDBMS system. But there is for exampleintegration of chemical structure information with Posgres. So in principle this can be done.
Take a look athttps://github.com/a-mma/AquilaDB it is a vector database to store Feature Vectors along with JSON Metadata. Keep it along with your RDBMS and do use metadata to maintain cross reference between data.
Explore related questions
See similar questions with these tags.
