- Notifications
You must be signed in to change notification settings - Fork25
Fast Semantic Text Deduplication
License
MinishLab/semhash
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation

SemHash is a lightweight and flexible tool for deduplicating datasets using semantic similarity. It combines fast embedding generation fromModel2Vec with efficient ANN-based similarity search throughVicinity.
SemHash supports both single-dataset deduplication (e.g., cleaning up a train set) and multi-dataset deduplication (e.g., ensuring no overlap between a test set and a train set). It works with simple datasets, such as text lists, and more complex ones, like multi-column QA datasets. Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process.
Install the package with:
pip install semhash
Deduplicate a single dataset with the following code (note: the examples assume you havedatasets
installed, which you can install withpip install datasets
):
fromdatasetsimportload_datasetfromsemhashimportSemHash# Load a dataset to deduplicatetexts=load_dataset("ag_news",split="train")["text"]# Initialize a SemHash instancesemhash=SemHash.from_records(records=texts)# Deduplicate the textsdeduplicated_texts=semhash.self_deduplicate().deduplicated
Or, deduplicate across two datasets with the following code (e.g., eliminating train/test leakage):
fromdatasetsimportload_datasetfromsemhashimportSemHash# Load two datasets to deduplicatetrain_texts=load_dataset("ag_news",split="train")["text"]test_texts=load_dataset("ag_news",split="test")["text"]# Initialize a SemHash instance with the training datasemhash=SemHash.from_records(records=train_texts)# Deduplicate the test data against the training data, optionally with a specific thresholddeduplicated_test_texts=semhash.deduplicate(records=test_texts,threshold=0.9).deduplicated
Or, deduplicate multi-column datasets with the following code (e.g., deduplicating a QA dataset):
fromdatasetsimportload_datasetfromsemhashimportSemHash# Load the datasetdataset=load_dataset("squad_v2",split="train")# Convert the dataset to a list of dictionariesrecords= [dict(row)forrowindataset]# Initialize SemHash with the columns to deduplicatesemhash=SemHash.from_records(records=records,columns=["question","context"])# Deduplicate the recordsdeduplicated_records=semhash.self_deduplicate().deduplicated
Thededuplicate
andself_deduplicate
functions return aDeduplicationResult. This object stores the deduplicated corpus, a set of duplicate object (along with the objects that caused duplication), and several useful functions to further inspect the deduplication result. Examples of how these functions can be used can be found in theusage section.
- Fast: SemHash usesmodel2vec to embed texts andvicinity to perform similarity search, making it extremely fast.
- Scalable: SemHash can deduplicate large datasets with millions of records thanks to the ANN backends in Vicinity.
- Flexible: SemHash can be used to deduplicate a single dataset or across two datasets, and can also be used to deduplicate multi-column datasets (such as QA datasets).
- Lightweight: SemHash is a lightweight package with minimal dependencies, making it easy to install and use.
- Explainable: Easily inspect the duplicates and what caused them with the
DeduplicationResult
object. You can also view the lowest similarity duplicates to find the right threshold for deduplication for your dataset.
The following examples show the various ways you can use SemHash to deduplicate datasets. These examples assume you have thedatasets
library installed, which you can install withpip install datasets
.
Deduplicate a single dataset
The following code snippet shows how to deduplicate a single dataset using SemHash (in this example, the train split of theAG News dataset):
fromdatasetsimportload_datasetfromsemhashimportSemHash# Load a dataset to deduplicatetexts=load_dataset("ag_news",split="train")["text"]# Initialize a SemHash instancesemhash=SemHash.from_records(records=texts)# Deduplicate the textsdeduplicated_texts=semhash.self_deduplicate()
Deduplicate across two datasets
The following code snippet shows how to deduplicate across two datasets using SemHash (in this example, the train/test split of theAG News dataset):
fromdatasetsimportload_datasetfromsemhashimportSemHash# Initialize a SemHash instancesemhash=SemHash()# Load two datasets to deduplicatetrain_texts=load_dataset("ag_news",split="train")["text"]test_texts=load_dataset("ag_news",split="test")["text"]# Initialize a SemHash instancesemhash=SemHash.from_records(records=train_texts)# Deduplicate the test data against the training datadeduplicated_test_texts=semhash.deduplicate(records=test_texts)
Deduplicate multi-column datasets
The following code snippet shows how to deduplicate multi-column datasets using SemHash (in this example, the train split of the QA datasetSQuAD 2.0, which consists of questions, contexts, and answers):
fromdatasetsimportload_datasetfromsemhashimportSemHash# Load the datasetdataset=load_dataset("squad_v2",split="train")# Convert the dataset to a list of dictionariesrecords= [dict(row)forrowindataset]# Initialize SemHash with the columns to deduplicatesemhash=SemHash.from_records(records=records,columns=["question","context"])# Deduplicate the recordsdeduplicated_records=semhash.self_deduplicate().deduplicated
DeduplicationResult functionality
TheDeduplicationResult
object returned by thededuplicate
andself_deduplicate
functions contains several useful functions to inspect the deduplication result. The following code snippet shows how to use these functions:
fromdatasetsimportload_datasetfromsemhashimportSemHash# Load a dataset to deduplicatetexts=load_dataset("ag_news",split="train")["text"]# Initialize a SemHash instancesemhash=SemHash.from_records(records=texts)# Deduplicate the textsdeduplication_result=semhash.self_deduplicate()# Check the deduplicated textsdeduplication_result.deduplicated# Check the duplicatesdeduplication_result.duplicates# See what percentage of the texts were duplicatesdeduplication_result.duplicate_ratio# See what percentage of the texts were exact duplicatesdeduplication_result.exact_duplicate_ratio# Get the least similar text from the duplicates. This is useful for finding the right threshold for deduplication.least_similar=deduplication_result.get_least_similar_from_duplicates()# Rethreshold the duplicates. This allows you to instantly rethreshold the duplicates with a new threshold without having to re-deduplicate the texts.deduplication_result.rethreshold(0.95)
Using custom encoders
The following code snippet shows how to use a custom encoder with SemHash:
fromdatasetsimportload_datasetfrommodel2vecimportStaticModelfromsemhashimportSemHash# Load a dataset to deduplicatetexts=load_dataset("ag_news",split="train")["text"]# Load an embedding model (in this example, a multilingual model)model=StaticModel.from_pretrained("minishlab/M2V_multilingual_output")# Initialize a SemHash with the model and custom encodersemhash=SemHash.from_records(records=texts,model=model)# Deduplicate the textsdeduplicated_texts=semhash.self_deduplicate()
Any encoder can be used that adheres to ourencoder protocol. For example, anysentence-transformers model can be used as an encoder:
fromdatasetsimportload_datasetfromsemhashimportSemHashfromsentence_transformersimportSentenceTransformer# Load a dataset to deduplicatetexts=load_dataset("ag_news",split="train")["text"]# Load a sentence-transformers modelmodel=SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")# Initialize a SemHash with the model and custom encodersemhash=SemHash.from_records(records=texts,model=model)# Deduplicate the textsdeduplicated_texts=semhash.self_deduplicate()
Using Pandas DataFrames
You can easily use Pandas DataFrames with SemHash. The following code snippet shows how to deduplicate a Pandas DataFrame:
importpandasaspdfromdatasetsimportload_datasetfromsemhashimportSemHash# Load a dataset as a pandas dataframedataframe=load_dataset("ag_news",split="train").to_pandas()# Convert the dataframe to a list of dictionariesdataframe=dataframe.to_dict(orient="records")# Initialize a SemHash instance with the columns to deduplicatesemhash=SemHash.from_records(records=dataframe,columns=["text"])# Deduplicate the textsdeduplicated_records=semhash.self_deduplicate().deduplicated# Convert the deduplicated records back to a pandas dataframededuplicated_dataframe=pd.DataFrame(deduplicated_records)
NOTE: By default, we use the ANN (approximate-nearest neighbors) backend for deduplication. We recommend keeping this since the recall for smaller datasets is ~100%, and it's needed for larger datasets (>1M samples) since these will take too long to deduplicate without ANN. If you want to use the flat/exact-matching backend, you can setuse_ann=False
in the SemHash constructor:
semhash=SemHash.from_records(records=texts,use_ann=False)
We've benchmarked SemHash on a variety of datasets to measure the deduplication performance and speed. The benchmarks were run with the following setup:
- The benchmarks were all run on CPU
- The benchmarks were all run with
use_ann=True
- The used encoder is the default encoder (potion-base-8M).
- The timings include the encoding time, index building time, and deduplication time.
Dataset | Original Train Size | Deduplicated Train Size | % Removed | Deduplication Time (s) |
---|---|---|---|---|
bbc | 1225 | 1144 | 6.61 | 0.57 |
senteval_cr | 3012 | 2990 | 0.73 | 0.14 |
tweet_sentiment_extraction | 27481 | 26695 | 2.86 | 1.77 |
emotion | 16000 | 15695 | 1.91 | 0.77 |
amazon_counterfactual | 5000 | 4992 | 0.16 | 0.33 |
ag_news | 120000 | 106921 | 10.90 | 5.20 |
enron_spam | 31716 | 20540 | 35.24 | 2.03 |
subj | 8000 | 7990 | 0.12 | 0.63 |
sst5 | 8544 | 8526 | 0.21 | 0.58 |
20_newgroups | 11314 | 10684 | 5.57 | 0.73 |
hatespeech_offensive | 22783 | 22090 | 3.04 | 0.92 |
ade | 17637 | 15718 | 10.88 | 0.73 |
imdb | 25000 | 24830 | 0.68 | 1.76 |
massive_scenario | 11514 | 9366 | 18.66 | 0.47 |
student | 117519 | 63856 | 45.66 | 8.80 |
squad_v2 | 130319 | 109698 | 15.82 | 8.81 |
wikitext | 1801350 | 884645 | 50.89 | 83.53 |
Dataset | Train Size | Test Size | Deduplicated Test Size | % Removed | Deduplication Time (s) |
---|---|---|---|---|---|
bbc | 1225 | 1000 | 870 | 13.00 | 0.71 |
senteval_cr | 3012 | 753 | 750 | 0.40 | 0.13 |
tweet_sentiment_extraction | 27481 | 3534 | 3412 | 3.45 | 1.53 |
emotion | 16000 | 2000 | 1926 | 3.70 | 0.65 |
amazon_counterfactual | 5000 | 5000 | 4990 | 0.20 | 0.51 |
ag_news | 120000 | 7600 | 6198 | 18.45 | 3.74 |
enron_spam | 31716 | 2000 | 1060 | 47.00 | 1.94 |
subj | 8000 | 2000 | 1999 | 0.05 | 0.62 |
sst5 | 8544 | 2210 | 2205 | 0.23 | 0.59 |
20_newgroups | 11314 | 7532 | 7098 | 5.76 | 2.25 |
hatespeech_offensive | 22783 | 2000 | 1925 | 3.75 | 0.77 |
ade | 17637 | 5879 | 4952 | 15.77 | 0.81 |
imdb | 25000 | 25000 | 24795 | 0.82 | 2.81 |
massive_scenario | 11514 | 2974 | 2190 | 26.36 | 0.46 |
student | 117519 | 5000 | 2393 | 52.14 | 3.78 |
squad_v2 | 130319 | 11873 | 11863 | 0.08 | 7.13 |
wikitext | 1801350 | 4358 | 2139 | 50.92 | 40.32 |
As can be seen, SemHash is extremely fast, and scales to large datasets with millions of records. There are some notable examples of train/test leakage, such asenron_spam
andstudent
, where the test dataset contains a significant amount of semantic overlap with the training dataset.
To run the benchmarks yourself, you can use the following command (assuming you have thedatasets
library installed):
python -m benchmarks.run_benchmarks
Optionally, the datasets can be updated in thedatasets.py file.
MIT
If you use SemHash in your research, please cite the following:
@software{minishlab2025semhash,author ={Thomas van Dongen and Stephan Tulkens},title ={SemHash: Fast Semantic Text Deduplication},year ={2025},url ={https://github.com/MinishLab/semhash}}
About
Fast Semantic Text Deduplication