- Notifications
You must be signed in to change notification settings - Fork10
Easy-to-use Java library for similarity checking of strings or numeric-series
License
EdDuarte/similarity-search-java
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This library contains easy-to-use and high-performant nearest-neighbor-searchalgorithms (as specified in "Mining of Massive Datasets", Cambridge UniversityPress, Rajaraman, A., & Ullman, J. D.) implemented in Java, which can be used todetermine the similarity between text Strings or sets of numbers (any Collectionthat contains inheritors of java.lang.Number).
To achieve a higher performance, all these algorithms are implemented so thatmultiple operations (counters, hashing, etc...) are executed in parallel (usingthreads spawned by an ExecutorService).
<dependency> <groupId>com.edduarte</groupId> <artifactId>similarity-search-java</artifactId> <version>0.0.5</version></dependency>dependencies { compile 'com.edduarte:similarity-search-java:0.0.5'}If you just need an easy way to figure out the similarity between two strings ornumber-sets, use this:
// for stringsdoublesimilarity =Similarity.jaccard().of(string1,string2);// for number setsdoublesimilarity =Similarity.jaccard().of(set1,set2);
This will return a similarity coefficient, a value between 0 and 1 where 1 meansthe two strings or sets are exactly equal and where 0 means they are disjoint.
UsingofAsync() instead ofof() will return a CompletableFuturethat will hold the similarity coefficient once finalized. This way, multipleconversions of shingles and signatures can be executed in parallel.
In the case of string similarity, you might need to set an appropriate shinglelength based on the size of your strings. A shingle length of 2 is used bydefault, which works well for short strings. The shingle length should be 5 ifyour strings have the size of an email (e.g. 430 characters) or 8 if they havethe size of an average Wikipedia article (e.g. 7800 characters):
// only needed for stringsdoublesimilarity =Similarity.jaccard() .withShingleLength(5) .of(string1,string2);
Jaccard is an exact approach, so this will always return the most accurate,gold-standard result but at a slower speed than other approaches. Jaccardsimilarity should be enough for most use-cases, but if you're dealing with amassive number of operations in parallel (big data, data-streams), you should gofor a probabilistic approach like Minhashing or LSH, detailed in the "Advanced"section below.
Every Similarity operation is available through the simple fluent / builderinterface in the Similarity class.
// optional parametersdoublesimilarity =Similarity.jaccard()// Length of n-gram shingles that are used for// comparison (used for strings only). .withShingleLength(5)// An executor where the kshingling and signature// processing tasks are spawned. If nothing is// provided then it launches a new executor with// the cached thread pool. .withExecutor(executorService) .of(string1,string2);
// for stringsdoublesimilarity =Similarity.minhash().of(string1,string2);// for number setsdoublesimilarity =Similarity.minhash().of(set1,set2);// optional parametersdoublesimilarity =Similarity.minhash()// Length of n-gram shingles that are used when// generating signatures (used for strings only). .withShingleLength(5)// The size of the generated signatures, which are// compared to determine similarity. .withSignatureSize(100)// The hashing algorithm used to hash shingles to// signatures (used for strings only). .withHashMethod(HashMethod.Murmur3)// Number of unique elements in both sets (used for// sets only). For example, if set1=[4, 5, 6, 7, 8]// and set2=[7, 8, 9, 10], this value should be 7. If// nothing is provided, this value is determined in// pre-processing. .withNumberOfElements(14)// An executor where the kshingling and signature// processing tasks are spawned. If nothing is// provided then it launches a new executor with the// cached thread pool. .withExecutor(executorService) .of(string1,string2);
Minhashing is the fastest of the implemented approaches, but returns anon-deterministic index that can be worsened when using inappropriate shinglelengths or signature sizes. If you need the performance of a probabilisticapproach but the deterministic index values of Jaccard similarity, you can useMinhashing with Locality-Sensitive Hashing:
// for stringsdoublesimilarity =Similarity.lsh().of(string1,string2);// for number setsdoublesimilarity =Similarity.lsh().of(set1,set2);// optional parametersdoublesimilarity =Similarity.lsh()// Length of n-gram shingles that are used when// generating signatures (used for strings only). .withShingleLength(5)// The number of bands and rows where the minhash// signatures will be organized. .withNumberOfBands(20).withNumberOfRows(5)// A threshold S that balances the number of false// positives and false negatives. .withThreshold(0.5)// The hashing algorithm used to hash shingles to// signatures (used for strings only). .withHashMethod(HashMethod.Murmur3)// Number of unique elements in both sets (used for// sets only). For example, if set1=[4, 5, 6, 7, 8]// and set2=[7, 8, 9, 10], this value should be 7. If// nothing is provided, this value is determined in// pre-processing. .withNumberOfElements(14)// An executor where the kshingling and signature// processing tasks are spawned. If nothing is// provided then it launches a new executor with the// cached thread pool. .withExecutor(executorService) .of(string1,string2);
This will return the Jaccard similarity coefficient for strings / sets that areconsidered to be candidate pairs, or return 0 if they are not candidate pairs.In other words, this will potentially save up on resources by only performingthe expensive Jaccard coefficient measurement for a subset of a dataset andignoring elements that are too dissimilar. This also means that the result forcandidate pairs will be deterministic.
So far the code samples have shown how to use the builder pattern available inthe Similarity interface. However, you can instead instantiate the internalclasses that implement the Similarity interface.
// example valuesintshingleLength =2;intsignatureSize =100;HashMethodhashMethod =HashMethod.Murmur3;intn =5;intbands =20;introws =5;doublethreshold =0.5;// string similarityStringSimilaritystringSimilarity =newJaccardStringSimilarity(s1,s2,shingleLength,executor);StringSimilaritystringSimilarity =newMinHashStringSimilarity(s1,s2,shingleLength,signatureSize,hashMethod,executor);StringSimilaritystringSimilarity =newLSHStringSimilarity(s1,s2,shingleLength,bands,rows,threshold,hashMethod,executor);doublesimilarity =stringSimilarity.getAsDouble();booleanisSimilar =stringSimilarity.getAsBoolean();// true if getAsDouble() returns a value greater than the threshold (0.5)// set similaritySetSimilaritysetSimilarity =newJaccardSetSimilarity(c1,c2);SetSimilaritysetSimilarity =newMinHashSetSimilarity(c1,c2,n,signatureSize,executor);SetSimilaritysetSimilarity =newLSHSetSimilarity(c1,c2,n,bands,rows,threshold,executor);doublesimilarity =setSimilarity.getAsDouble();
Do note that while the Similarity builder pattern ensures that the providednumber sets are copied, so that the computation of the similarity index is notaffected by external threads that perform changes to the original setsin parallel, these internal classes doNOT, and you should ensure thiscondition yourself. This is intentional, so that these internal classes providea high-performance variant of Similarity instantiation.
You can also use a number of classes that correspond to each step of theimplemented similarity search algorithms, and use them in your application atyour own accord. Each of the converter classes below returns a result thatcould, for example, be stored in a database / cache for later use.
// base valuesStringexampleString ="example string";Set<Integer>exampleSet =Set.of(1,2,3,4,5);intshingleLength =2;intsignatureSize =100;HashMethodhashMethod =HashMethod.Murmur3;intn =5;intbands =20;introws =5;// generate shingles for stringKShinglerc1 =newKShingler(shingleLength);List<CharSequence>shingles =c1.apply(exampleString).call();// get jaccard similarity coefficient for the shingles abovedoublestringSimilarity =Similarity.jaccardIndex(shingles1,shingles2);// get jaccard similarity coefficient for the number-setdoublesetSimilarity =Similarity.jaccardIndex(exampleSet1,exampleSet2);// get signatures from shinglesKShinglesToSignatureConverterc2 =newKShinglesToSignatureConverter(hashMethod,signatureSize);int[]stringSignature =c2.apply(shingles).call();// generate a universal-hash signature for setsSetToSignatureConverterc3 =newSetToSignatureConverter(n,signatureSize);int[]setSignature =c3.apply(exampleSet).call();// get minhash similarity coefficientdoublestringSimilarity =Similarity.signatureIndex(stringSignature1,stringSignature2);doublesetSimilarity =Similarity.signatureIndex(setSignature1,setSignature2);// convert signatures to bandsSignatureToBandsConverterc4 =newSignatureToBandsConverter(bands,rows);int[]stringBands =c4.apply(stringSignature).call();int[]setBands =c4.apply(setSignature).call();// determine if there are any candidate pairsbooleanisCandidatePair =Similarity.isCandidatePair(stringBands1,stringBands2);booleanisCandidatePair =Similarity.isCandidatePair(setBands1,setBands2);
Note that all of the Converter classes above return a Callable, which can besubmitted into any Future or Executor in order to trigger multiple conversioncalls in parallel. Below is an example of how to obtain the string similaritybetween two strings, with k-shingling for both strings being forked as paralleltasks and then joined to compute the Jaccard index:
KShinglerkShingler =newKShingler(shingleLength);Future<List<CharSequence>>shingles1 =exec.submit(kShingler.apply("example string 1"));Future<List<CharSequence>>shingles2 =exec.submit(kShingler.apply("example string 2"));doublestringSimilarity =Similarity.jaccardIndex(shingles1.get(),shingles2.get());
You can see this library in use athttps://github.com/vokter/vokter.
Copyright 2018 Eduardo DuarteLicensed under the Apache License, Version 2.0 (the "License");you may not use this file except in compliance with the License.You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, softwaredistributed under the License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.See the License for the specific language governing permissions andlimitations under the License.About
Easy-to-use Java library for similarity checking of strings or numeric-series
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Contributors3
Uh oh!
There was an error while loading.Please reload this page.