Quick nearest neighbor search in the 150-dimensional space

Question 1

I want to create a database using any of the possible RDBMS. It will have a table with approximately 150 columns. The objective is to perform nearest neighbor search of some other objects. So it's a NNS in the 150-dimensional space.

I already tried to use some obvious methods like L1 or L2 distances but of course it takes a lot of time for tables with many rows. Also I tried to look at the KD-tree (note I did not test it) and PG-Strom but they are not a good solution for data with a many dimensions.

Can I somehow improve the speed of the described search using math methods (like KD-tree) or tech methods (like PG-Strom)?

I will try to use any RDBMS which allow to improve speed of the NNS. But MySQL and PostgreSQL are the most appropriate DBMS for me.

Question 2

These are other problems. Simply ask another question @don-prog

Question 3

PostgreSQL 9.6 using`cube`

First install thecube extension

CREATE EXTENSION cube;

Now we will create some n-dimensional space with 100,000 points in 50 dimensions. In addition we'll add a GIST index.

CREATE TEMP TABLE space_ndAS  SELECT i, cube(array_agg(random()::float)) AS c  FROM generate_series(1,1e5) AS i  CROSS JOIN LATERAL generate_series(1,50)    AS x  GROUP BY i;CREATE INDEX ON space_nd USING gist ( c );ANALYZE space_nd;

Now we will generate a single point and use the<-> operater to find the nearest point using Eucledian distance.

WITH points AS (  SELECT cube(array_agg(random()::float)) AS c  FROM generate_series(1,50)    AS x)SELECT i,  pg_typeof(space_nd.c),  pg_typeof(points.c),  cube_distance(space_nd.c, points.c)FROM space_ndCROSS JOIN pointsORDER BY space_nd.c <-> points.cLIMIT 5;

PostgreSQL 9.6+ supports other distance operators overcube. All of which can use the GIST index we created. Namely,

a <-> b float8  Euclidean distance between a and b.a <#> b float8  Taxicab (L-1 metric) distance between a and b.a <=> b float8  Chebyshev (L-inf metric) distance between a and b.

That said there is one caveat,

To make it harder for people to break things,there is a limit of 100 on the number of dimensions of cubes. This is set in cubedata.h if you need something bigger.

You ask for 150 dimensions. That may present a minor complication.

Question 4

The edit tocubedata.h doesn't work past 130 dimensions in my experience. Maybe you can also change all thedoubles orfloat8s in the extension tofloat4, since Postgres has a limit on per-row index size that you can stay away from by halving how many bytes you use on each number. I did some testing and got more dimensions in that way, and IIRC I got past 150, but I'm not totally sure.

Question 5

I had same problem with limit on dimensions and created docker image with 2048 limit:hub.docker.com/r/expert/postgresql-large-cube

Question 6

Consider performing dimension reduction first (eg. Principle Component Analysis).

Then your are doing NN in small number of dimensions with higher performance.

You can use Pl/R to perform PCA inside postgres if needed.

Question 7

Have a look atFLANN and OpenCV.

Unfortunately I am not aware of an integration of that into a RDBMS system. But there is for exampleintegration of chemical structure information with Posgres. So in principle this can be done.

Question 8

Take a look athttps://github.com/a-mma/AquilaDB it is a vector database to store Feature Vectors along with JSON Metadata. Keep it along with your RDBMS and do use metadata to maintain cross reference between data.

Evan Carroll 65.8k50 gold badges263 silver badges513 bronze badges · Accepted Answer · 2017-08-10 16:03:40Z

PostgreSQL 9.6 using`cube`

First install thecube extension

CREATE EXTENSION cube;

Now we will create some n-dimensional space with 100,000 points in 50 dimensions. In addition we'll add a GIST index.

CREATE TEMP TABLE space_ndAS  SELECT i, cube(array_agg(random()::float)) AS c  FROM generate_series(1,1e5) AS i  CROSS JOIN LATERAL generate_series(1,50)    AS x  GROUP BY i;CREATE INDEX ON space_nd USING gist ( c );ANALYZE space_nd;

Now we will generate a single point and use the<-> operater to find the nearest point using Eucledian distance.

WITH points AS (  SELECT cube(array_agg(random()::float)) AS c  FROM generate_series(1,50)    AS x)SELECT i,  pg_typeof(space_nd.c),  pg_typeof(points.c),  cube_distance(space_nd.c, points.c)FROM space_ndCROSS JOIN pointsORDER BY space_nd.c <-> points.cLIMIT 5;

PostgreSQL 9.6+ supports other distance operators overcube. All of which can use the GIST index we created. Namely,

a <-> b float8  Euclidean distance between a and b.a <#> b float8  Taxicab (L-1 metric) distance between a and b.a <=> b float8  Chebyshev (L-inf metric) distance between a and b.

That said there is one caveat,

To make it harder for people to break things,there is a limit of 100 on the number of dimensions of cubes. This is set in cubedata.h if you need something bigger.

You ask for 150 dimensions. That may present a minor complication.

The edit tocubedata.h doesn't work past 130 dimensions in my experience. Maybe you can also change all thedoubles orfloat8s in the extension tofloat4, since Postgres has a limit on per-row index size that you can stay away from by halving how many bytes you use on each number. I did some testing and got more dimensions in that way, and IIRC I got past 150, but I'm not totally sure.
I had same problem with limit on dimensions and created docker image with 2048 limit:hub.docker.com/r/expert/postgresql-large-cube

Movatterモバイル変換

Stack Exchange Network

Quick nearest neighbor search in the 150-dimensional space

4 Answers4

PostgreSQL 9.6 using`cube`

Your Answer

Sign up orlog in

Post as a guest

Linked

Related

Hot Network Questions

Subscribe to RSS

Movatterモバイル変換

Quick nearest neighbor search in the 150-dimensional space

4 Answers4

PostgreSQL 9.6 usingcube

Your Answer

Sign up orlog in

Post as a guest

Linked

Related

Hot Network Questions

PostgreSQL 9.6 using`cube`