19

I want to create a database using any of the possible RDBMS. It will have a table with approximately 150 columns. The objective is to perform nearest neighbor search of some other objects. So it's a NNS in the 150-dimensional space.

I already tried to use some obvious methods like L1 or L2 distances but of course it takes a lot of time for tables with many rows. Also I tried to look at the KD-tree (note I did not test it) and PG-Strom but they are not a good solution for data with a many dimensions.

Can I somehow improve the speed of the described search using math methods (like KD-tree) or tech methods (like PG-Strom)?

I will try to use any RDBMS which allow to improve speed of the NNS. But MySQL and PostgreSQL are the most appropriate DBMS for me.

Evan Carroll's user avatar
Evan Carroll
65.8k50 gold badges263 silver badges513 bronze badges
askedFeb 4, 2017 at 5:35
konstantin_doncov's user avatar
1
  • 1
    These are other problems. Simply ask another question @don-progCommentedAug 10, 2017 at 16:02

4 Answers4

23

PostgreSQL 9.6 usingcube

First install thecube extension

CREATE EXTENSION cube;

Now we will create some n-dimensional space with 100,000 points in 50 dimensions. In addition we'll add a GIST index.

CREATE TEMP TABLE space_ndAS  SELECT i, cube(array_agg(random()::float)) AS c  FROM generate_series(1,1e5) AS i  CROSS JOIN LATERAL generate_series(1,50)    AS x  GROUP BY i;CREATE INDEX ON space_nd USING gist ( c );ANALYZE space_nd;

Now we will generate a single point and use the<-> operater to find the nearest point using Eucledian distance.

WITH points AS (  SELECT cube(array_agg(random()::float)) AS c  FROM generate_series(1,50)    AS x)SELECT i,  pg_typeof(space_nd.c),  pg_typeof(points.c),  cube_distance(space_nd.c, points.c)FROM space_ndCROSS JOIN pointsORDER BY space_nd.c <-> points.cLIMIT 5;

PostgreSQL 9.6+ supports other distance operators overcube. All of which can use the GIST index we created. Namely,

a <-> b float8  Euclidean distance between a and b.a <#> b float8  Taxicab (L-1 metric) distance between a and b.a <=> b float8  Chebyshev (L-inf metric) distance between a and b.

That said there is one caveat,

To make it harder for people to break things,there is a limit of 100 on the number of dimensions of cubes. This is set in cubedata.h if you need something bigger.

You ask for 150 dimensions. That may present a minor complication.

answeredFeb 10, 2017 at 18:25
Evan Carroll's user avatar
2
  • 1
    The edit tocubedata.h doesn't work past 130 dimensions in my experience. Maybe you can also change all thedoubles orfloat8s in the extension tofloat4, since Postgres has a limit on per-row index size that you can stay away from by halving how many bytes you use on each number. I did some testing and got more dimensions in that way, and IIRC I got past 150, but I'm not totally sure.CommentedNov 3, 2018 at 2:33
  • 2
    I had same problem with limit on dimensions and created docker image with 2048 limit:hub.docker.com/r/expert/postgresql-large-cubeCommentedApr 29, 2019 at 9:37
2

Consider performing dimension reduction first (eg. Principle Component Analysis).

Then your are doing NN in small number of dimensions with higher performance.

You can use Pl/R to perform PCA inside postgres if needed.

answeredFeb 21, 2017 at 19:51
Robin Chauhan's user avatar
1

Have a look atFLANN and OpenCV.

Unfortunately I am not aware of an integration of that into a RDBMS system. But there is for exampleintegration of chemical structure information with Posgres. So in principle this can be done.

answeredFeb 21, 2017 at 20:51
Grimaldi's user avatar
1

Take a look athttps://github.com/a-mma/AquilaDB it is a vector database to store Feature Vectors along with JSON Metadata. Keep it along with your RDBMS and do use metadata to maintain cross reference between data.

answeredAug 2, 2019 at 5:27
a_മ്മ's user avatar

Your Answer

Sign up orlog in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to ourterms of service and acknowledge you have read ourprivacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.