Graph-based similarity search

Graph-based methods use proximity graphs, where nodes represent data vectors and two nodes are connected if they fulfilla defined property or neighborhood criterion, building on the structure inherent in the data. Search involves startingat a designated entry point and traversing the graph to get closer and closer to the nearest neighbor with each hop. Wefollow the Vamana[SDSK19] algorithm for graph building and search.

How does the graph search work?

The simplest way to traverse the graph to find 1 approximate nearest neighbor is to do agreedy search. At each hop,the distances from the query to all the neighbors of the current node (i.e., vectors in the current node’s adjacencylist) are computed and the closest point is chosen as the next point to be explored. The search ends when the distance tothe query cannot be further reduced by jumping to any of the neighbors of the current node.

How do we find k neighbors? To improve the search accuracy and be able to find k nearest neighbors, this greedy search is combined with apriorityqueue. While traversing the graph, we keep track of the distance from the query to thesearch_window_sizeclosest points seen so far (wheresearch_window_size is the length of the priority queue). At each hop, we choose toexplore next the closest point in the priority queue that has not been visited yet. The search ends when all theneighbors of the current node are further from the query than the furthest point in the priority queue. This preventsthe search path to diverge too far from the query. A largersearch_window_size implies exploring a larger volume,improving the accuracy at the cost of a longer search path.

How does graph building work?

First, weset the hyper-parameters required to build the graph:alpha,graph_max_degree,window_size,max_candidate_pool_size (seesvs.VamanaBuildParameters andsvs::index::vamana::VamanaBuildParameters for more details).

Then, the graph is built following the Vamana indexing algorithm[SDSK19] as follows:

  1. Start from an uninitialized graphG.

  2. Iterate through all nodes in a random order.

    1. Run the search for nodex on the currentG, with the search window size set towindow_size, and save the list of visited nodesC.

    2. UpdateG bypruningC to determine the new set ofx’s neighbors.

    3. Add backward edges (x,x*) for allx* inx’s out neighbors and prunex*’ edges.

  3. Make two passes over the dataset, the first one with the pruning parameteralpha =1 and the second one withalpha =alpha.

  4. Return graphG to be used by the search algorithm.

Thepruning rule limitsx’s out-neighborsN to a maximum ofgraph_max_degree as follows:

  1. Set the list of neighbors candidatesC =C UN \ {x }

  2. SortC in ascending distance fromx, and limitC to the closestmax_candidate_pool_size neighbors.

  3. InitializeN to null

  4. WhileC is not empty do:

    1. Findx* the closest point tox inC.

    2. Addx* tox’s out-neighbors listN.

    3. Iflength (N ) >graph_max_degree then break; else remove all points fromC that are closer tox* thanx by a factoralpha.