- Notifications
You must be signed in to change notification settings - Fork3
Finding all pairs of similar documents time- and memory-efficiently
License
Apache-2.0, MIT licenses found
Licenses found
daac-tools/find-simdoc
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This software provides time- and memory-efficient all pairs similarity searches in documents.
- Input
- List of documents
$D = (d_1, d_2, \dots, d_n)$ - Distance function
$\delta: D \times D \rightarrow [0,1]$ - Radius threshold
$r \in [0,1]$
- List of documents
- Output
- All pairs of similar document ids
$R = \{ (i,j): i < j, \delta(d_i, d_j) \leq r \}$
- All pairs of similar document ids
This software supports all essential steps of document similarity search,from feature extraction to output of similar pairs.Therefore, you can immediately try the fast all pairs similarity search using your document files.
You can specify any delimiter when splitting words in tokenization for feature extraction.This can be useful in languages where multiple definitions of words exist, such as Japanese or Chinese.
The time and memory complexities arelinear over the numbers of input documents and output resultson the basis of the ideas behind the locality sensitive hashing (LSH) andsketch sorting approach.
LSH allows tuning of performance in accuracy, time, and memory, through a manual parameter specifying search dimensions.You can flexibly perform searches depending on your dataset and machine environment.
- Specifying lower dimensions allows for faster and rougher searches with less memory usage.
- Specifying higher dimensions allows for more accurate searches with more memory usage.
This software is implemented in Rust, achieving safe and fast performance.
Here, we describe the basic usage of this software through an example of running the CLI tool.
First of all, installrustc andcargo following theofficial instructions since this software is implemented in Rust.
You have to prepare a text file containing documents line by line (NOT including empty lines).
To produce an example file used throughout this description, you can usescripts/load_nltk_dataset.py that downloads the Reuters Corpus provided by NLTK.Run the following command.
$ ./scripts/load_nltk_dataset.py reutersreuters.txt will be output.
$ head reuters.txthre properties & lt ; hre > 1st qtr jan 31 net shr 38 cts vs 47 cts net 2 , 253 , 664 vs 2 , 806 , 820 gross income 5 , 173 , 318 vs 5 , 873 , 904 note : net includes gains on sale of real estate of 126 , 117 dlrs vs 29 , 812 dlrs .the firm , however , is supplying temporary financing , and sources close to the transaction disputed the claim that the firm will not end up paying for its equity position . conoco , which has completed geological prospecting for the tunisian government , has transferred one third of its option rights in the region to ina , it said ." willis faber ' s stake in morgan grenfell has been a very successful investment ," it said .china reports 700 mln dlr two - month trade deficit china ' s trade deficit totalled 700 mln dlrs in the first two months of this year , according to figures released by the state statistics bureau .the treasury said baker and stoltenberg " are consulting with their g - 7 colleagues and are confident that this will enable them to foster exchange rate stability around current levels ."u . s . tariffs are due to take effect on april 17 .some dealers said there were growing signs the united states wanted the dollar to fall further .since last august smart has been leading talks to open up japan to purchases of more u . s .- made automotive parts .the resulting association will operate under the name of charter and will be based in bristol .Fully-duplicate documents inreuters.txt are removed because they are noisy in evaluation of similarity searches.To do this, the output lines are shuffled, and your file will not be the identical to the example.
The workspacefind-simdoc-cli provides CLI tools for fast all pairs similarity searches in documents.The approach consists of three steps:
- Extract features from documents
- Set representation of character or word ngrams
- Tfidf-weighted vector representation of character or word ngrams
- Convert the features into binary sketches through locality sensitive hashing (LSH)
- 1-bit minwise hashing for the Jaccard similarity
- Simplified simhash for the Cosine similarity
- Search for similar sketches in the Hamming space using a modified variant of thesketch sorting approach
The executablejaccard provides a similarity search in theJaccard space.You can check the arguments with the following command.
$ cargo run --release -p find-simdoc-cli --bin jaccard -- --helpRun the following command if you want to search forreuters.txt with
- search radius
0.1, - tokens of character
5-grams, and 15*64=960dimensions in the Hamming space.
$ cargo run --release -p find-simdoc-cli --bin jaccard -- -i reuters.txt -r 0.1 -w 5 -c 15 > result-jaccard.csvArgument-c indicates the number of dimensions in the Hamming space,a trade-off parameter between approximation accuracy and search speed.The larger this value, the higher the accuracy, but the longer the search takes.This section describes how to examine the approximation accuracy for the number of dimensions.
Pairs of similar documents (indicated by zero-origin line numbers) and their distances are reported.
$ head result-jaccard.csvi,j,dist191,29637,0.07291666666666667199,38690,0.0375274,10048,0.07083333333333333294,27675,0.04791666666666667311,13812,0.04583333333333333361,50938,0.08958333333333333469,6360,0.035416666666666666546,10804,0.0875690,28281,0.0875The executablecosine provides a similarity search in theCosine space.You can check the arguments with the following command.
$ cargo run --release -p find-simdoc-cli --bin cosine -- --helpRun the following command if you want to search forreuters.txt with
- search radius
0.1, - tokens of word
3-grams, - word delimiter
" "(i.e., a space), 10*64=640dimensions in the Hamming space, and- weighting using the standard TF and the smoothed IDF.
$ cargo run --release -p find-simdoc-cli --bin cosine -- -i reuters.txt -r 0.1 -d " " -w 3 -c 10 -T standard -I smooth > result-cosine.csvPairs of similar documents (indicated by zero-origin line numbers) and their distances are reported.
$ head result-cosine.csvi,j,dist542,49001,0.084375964,24198,0.093751872,3024,0.08593751872,6823,0.0906251872,8462,0.09531251872,11402,0.0906251872,18511,0.08593751872,41491,0.08751872,48344,0.0859375The executabledump prints similar documents from an output CSV file.
If you want to print similar documents inreuters.txt with the resultresult-jaccard.csv,run the following command.
$ cargo run --release -p find-simdoc-cli --bin dump -- -i reuters.txt -s result-jaccard.csv[i=191,j=29637,dist=0.07291666666666667]pending its deliberations , harper and row ' s board has postponed indefinitely a special meeting of stockholders that had been scheduled for april 2 to discuss a proposal to recapitalize the company ' s stock to create two classes of shares with different voting rights .pending its deliberations , harper and row ' s board has postponed indefinitely a special meeting of stockholders that had been scheduled for april 2 to discuss a proposal to recapitalize the company ' s stock in order to create two classes of shares with different votinmg rights .[i=199,j=38690,dist=0.0375]government officials had no immediate comment on the report , which advised a reduction in the overall size of the public investment programme and greater emphasis on the preservation of peru ' s export potential .government officials had no immediate comment on the report , which advised a reduction in the overall size of the public investment program and greater emphasis on the preservation of peru ' s export potential .[i=274,j=10048,dist=0.07083333333333333]the measure was adopted as part of a wide - ranging trade bill that will be considered by the full house in april before it moves on to the senate .the measure was adopted as part of a wide - ranging trade bill that will be considered by the full house in april before it moves onto the senate .[i=294,j=27675,dist=0.04791666666666667]the company said the start - up was necessitated by continuing strong demand for aluminum and dwindling worldwide inventories , and that the metal is needed to supply reynolds ' various fabricating businesses .the company said the start up was necessitated by continuing strong demand for aluminum and dwindling worldwide inventories , and that the metal is needed to supply reynolds ' various fabricating businesses .[i=311,j=13812,dist=0.04583333333333333]he said in an interview with reuter that after a few years it was likely south korea would drop barriers to foreign goods and move toward a more balanced trade position .he said in an interview with reuters that after a few years it was likely south korea would drop barriers to foreign goods and move toward a more balanced trade position .[i=361,j=50938,dist=0.08958333333333333]hog and cattle slaughter guesstimates chicago mercantile exchange floor traders and commission house representatives are guesstimating today ' s hog slaughter at about 295 , 000 to 305 , 000 head versus 307 , 000 week ago and 311 , 000 a year ago .hog and cattle slaughter guesstimates chicago mercantile exchange floor traders and commission house representatives are guesstimating today ' s hog slaughter at about 295 , 000 to 308 , 000 head versus 305 , 000 week ago and 308 , 000 a year ago .[i=469,j=6360,dist=0.035416666666666666]the national planning department forecast that in 1987 coffee , colombia ' s traditional major export , will account for only one - third of total exports , or about 1 . 5 billion dlrs .the national planning department forecast that in 1987 coffee , colombia ' s traditional major export , will account for only one third of total exports , or about 1 . 5 billion dlrs ....LSH is an approximate solution, and you may want to know the accuracy.The executableminhash_acc allows you to examine
- the mean absolute error that is the averaged gap between the normalized Hamming distance and the actual Jaccard distance; and
- the number of true results, precisions, recalls, and F1-scores for search radii {0.01, 0.02, 0.05, 0.1, 0.2, 0.5}.
To use this executable, we recommend extracting a small subset from your datasetbecause it exactly computes distances for all possible pairs (although the computation is accelerated with parallelization).
$ head -5000 reuters.txt > reuters.5k.txtYou can test the number of Hamming dimensions from 64 to 6400(i.e., the number of chunks from 1 to 100 indicated with-c)with the following command.The arguments for feature extraction are the same as those ofjaccard.
$ cargo run --release -p find-simdoc-cli --bin minhash_acc -- -i reuters.5k.txt -w 5 > acc.csvLSH is an approximate solution, and the number of dimensions in the Hamming space(indicated with the command line argument-c) is related to the approximation accuracy.As a hint for choosing a parameter of-c, we show experimental results obtained fromreuters.txt of 51,535 documents when setting-w 5.
The following figure shows MAEs while varying the number of Hamming dimensions from 64 to 6400 (i.e., the number of chunks from 1 to 100 indicated with -c).
As expected, the larger the number, the higher the accuracy. For example, when the number of dimensions is 1024 (setting the argument-c 16), we achieve the MAE around 2.5%.
Of the precision, recall, and F1 score, the most interesting would be the recall.This is because false positives can be filtered out in post processing.
The following figure shows recalls in search with radii 0.05, 0.1, and 0.2 (indicated with the argument-r).
For radii 0.1 and 0.2, over 90% recalls are achieved in most cases.For smaller radius 0.05, 75-90% recalls are obtained because the MAE becomes larger relative to the radius.
By the way, the numbers of true results are 50, 201, and 626 for radii 0.05, 0.1, and 0.2, respecitvely.
The following figure shows F1 scores in search with radii 0.05, 0.1, and 0.2 (indicated with the argument-r).
- For radius 0.05, over 90% scores are achieved from 3520 dimensions (setting
-c 55). - For radius 0.1, over 90% scores are achieved from 704 dimensions (setting
-c 11). - For radius 0.2, over 90% scores are achieved from 448 dimensions (setting
-c 7).
This software is developed by LegalForce, Inc.,but not an officially supported LegalForce product.
Licensed under either of
- Apache License, Version 2.0(LICENSE-APACHE orhttp://www.apache.org/licenses/LICENSE-2.0)
- MIT license(LICENSE-MIT orhttp://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submittedfor inclusion in the work by you, as defined in the Apache-2.0 license, shall bedual licensed as above, without any additional terms or conditions.
About
Finding all pairs of similar documents time- and memory-efficiently
Topics
Resources
License
Apache-2.0, MIT licenses found
Licenses found
Contributing
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.