Java Algorithms#

Arrow’s Java library provides algorithms for some commonly-usedfunctionalities. The algorithms are provided in theorg.apache.arrow.algorithmpackage of thealgorithm module.

Comparing Vector Elements#

Comparing vector elements is the basic for many algorithms. Vectorelements can be compared in one of the two ways:

1.Equality comparison: there are two possible results for this type of comparisons:equal andunequal.Currently, this type of comparison is supported through theorg.apache.arrow.vector.compare.VectorValueEqualizerinterface.

2.Ordering comparison: there are three possible results for this type of comparisons:lessthan,equaltoandgreaterthan. This comparison is supported by the abstract classorg.apache.arrow.algorithm.sort.VectorValueComparator.

We provide default implementations to compare vector elements. However, users can also define waysfor customized comparisons.

Vector Element Search#

A search algorithm tries to find a particular value in a vector. When successful, a vector index isreturned; otherwise, a-1 is returned. The following search algorithms are provided:

1.Linear search: this algorithm simply traverses the vector from the beginning, until a match isfound, or the end of the vector is reached. So it takesO(n) time, wheren is the number of elementsin the vector. This algorithm is implemented inorg.apache.arrow.algorithm.search.VectorSearcher#linearSearch.

2.Binary search: this represents a more efficient search algorithm, as it runs inO(log(n)) time.However, it is only applicable to sorted vectors. To get a sorted vector,one can use one of our sorting algorithms, which will be discussed in the next section. This algorithmis implemented inorg.apache.arrow.algorithm.search.VectorSearcher#binarySearch.

3.Parallel search: when the vector is large, it takes a long time to traverse the elements to searchfor a value. To make this process faster, one can split the vector into multiple partitions, and perform thesearch for each partition in parallel. This is supported byorg.apache.arrow.algorithm.search.ParallelSearcher.

4.Range search: for many scenarios, there can be multiple matching values in the vector.If the vector is sorted, the matching values reside in a contiguous region in the vector. Therange search algorithm tries to find the upper/lower bound of the region inO(log(n)) time.An implementation is provided inorg.apache.arrow.algorithm.search.VectorRangeSearcher.

Vector Sorting#

Given a vector, a sorting algorithm turns it into a sorted one. The sorting criteria mustbe specified by some ordering comparison operation. The sorting algorithms can beclassified into the following categories:

1.In-place sorter: an in-place sorter performs the sorting by manipulating the originalvector, without creating any new vector. So it just returns the original vector after the sorting operations.Currently, we haveorg.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter for in-placesorting inO(nlog(n)) time. As the name suggests, it only supports fixed width vectors.

2.Out-of-place sorter: an out-of-place sorter does not mutate the original vector. Instead,it copies vector elements to a new vector in sorted order, and returns the new vector.We haveorg.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.FixedWidthOutOfPlaceVectorSorterandorg.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.VariableWidthOutOfPlaceVectorSorterfor fixed width and variable width vectors, respectively. Both algorithms run inO(nlog(n)) time.

3.Index sorter: this sorter does not actually sort the vector. Instead, it returns an integervector, which correspond to indices of vector elements in sorted order. With the index vector, one caneasily construct a sorted vector. In addition, some other tasks can be easily achieved, like finding thek thsmallest value in the vector. Index sorting is supported byorg.apache.arrow.algorithm.sort.IndexSorter,which runs inO(nlog(n)) time. It is applicable to vectors of any type.

Other Algorithms#

Other algorithms include vector deduplication, dictionary encoding, etc., in thealgorithm module.