Movatterモバイル変換

MinishLab/semhashPublic

generated fromMinishLab/watertemplate

NotificationsYou must be signed in to change notification settings
Fork44
Star761

Fast Semantic Text Deduplication & Filtering

minish.ai/packages/semhash

License

MIT license

761 stars 44 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
assets/images		assets/images
benchmarks		benchmarks
semhash		semhash
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Repository files navigation

Fast Semantic Text Deduplication & Filtering

Quickstart •Main Features •Usage •Benchmarks

SemHash is a lightweight and flexible tool for deduplicating datasets, filtering outliers, and finding representative samples using semantic similarity. It combines fast embedding generation fromModel2Vec with efficient ANN-based similarity search throughVicinity.

SemHash supports both single-dataset deduplication & filtering (e.g., cleaning up a train set by removing duplicates and outliers) and multi-dataset deduplication & filtering (e.g., ensuring no overlap between a test set and a train set). It works with simple datasets, such as text lists, and more complex ones, like multi-column QA datasets. Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process.

Quickstart

Install the package with:

pip install semhash

Deduplicate a single dataset, filter outliers, and find representative samples with the following code (note: the examples assume you havedatasets installed, which you can install withpip install datasets):

fromdatasetsimportload_datasetfromsemhashimportSemHash# Load a dataset to deduplicatetexts=load_dataset("ag_news",split="train")["text"]# Initialize a SemHash instancesemhash=SemHash.from_records(records=texts)# Deduplicate the textsdeduplicated_texts=semhash.self_deduplicate().selected# Filter outliersfiltered_texts=semhash.self_filter_outliers().selected# Find representative textsrepresentative_texts=semhash.self_find_representative().selected

Or, deduplicate across two datasets, filter outliers, and find representative samples with the following code (e.g., eliminating train/test leakage):

fromdatasetsimportload_datasetfromsemhashimportSemHash# Load two datasets to deduplicatetrain_texts=load_dataset("ag_news",split="train")["text"]test_texts=load_dataset("ag_news",split="test")["text"]# Initialize a SemHash instance with the training datasemhash=SemHash.from_records(records=train_texts)# Deduplicate the test data against the training data, optionally with a specific thresholddeduplicated_test_texts=semhash.deduplicate(records=test_texts,threshold=0.9).selected# Filter outliers from the test data against the training data,# optionally with a specific percentagefiltered_test_texts=semhash.filter_outliers(records=test_texts,outlier_percentage=0.1).selected# Find representative texts in the test data against the training data,# optionally with a specific selection sizerepresentative_test_texts=semhash.find_representative(records=test_texts,selection_size=10).selected

Or, deduplicate multi-column dataset, filter outliers, and find representative samples with the following code (e.g., deduplicating a QA dataset):

fromdatasetsimportload_datasetfromsemhashimportSemHash# Load the datasetdataset=load_dataset("squad_v2",split="train")# Convert the dataset to a list of dictionariesrecords= [dict(row)forrowindataset]# Initialize SemHash with the columns to deduplicatesemhash=SemHash.from_records(records=records,columns=["question","context"])# Deduplicate the recordsdeduplicated_records=semhash.self_deduplicate().selected# Filter outliers from the recordsfiltered_texts=semhash.self_filter_outliers().selected# Find representative texts in the recordsrepresentative_texts=semhash.self_find_representative().selected

Thededuplicate andself_deduplicate functions return aDeduplicationResult. This object stores the deduplicated corpus, a set of duplicate object (along with the objects that caused duplication), and several useful functions to further inspect the deduplication result. Examples of how these functions can be used can be found in theusage section.

Thefilter_outliers,self_filter_outliers,find_representative, andself_find_representative functions return aFilterResult. This object stores the found outliers/representative samples.

For both theDeduplicationResult andFilterResult objects, you can easily view the filtered records with theselected attribute (e.g. to view outliers:outliers = semhash.self_filter_outliers().filtered)

Main Features

Fast: SemHash usesmodel2vec to embed texts andvicinity to perform similarity search, making it extremely fast.
Scalable: SemHash can deduplicate & filter large datasets with millions of records thanks to the ANN backends in Vicinity.
Flexible: SemHash can be used to deduplicate & filter a single dataset or across two datasets, and can also be used to deduplicate & filter multi-column datasets (such as QA datasets).
Lightweight: SemHash is a lightweight package with minimal dependencies, making it easy to install and use.
Explainable: Easily inspect the duplicates and what caused them with theDeduplicationResult object. You can also view the lowest similarity duplicates to find the right threshold for deduplication for your dataset.

Usage

The following examples show the various ways you can use SemHash to deduplicate datasets, filter outliers, and find representative samples. These examples assume you have thedatasets library installed, which you can install withpip install datasets.

Deduplicate, filter outliers, and find representative samples on a single dataset

The following code snippet shows how to deduplicate a single dataset, filter outliers, and find representative samples using SemHash (in this example, the train split of theAG News dataset):

fromdatasetsimportload_datasetfromsemhashimportSemHash# Load a dataset to deduplicatetexts=load_dataset("ag_news",split="train")["text"]# Initialize a SemHash instancesemhash=SemHash.from_records(records=texts)# Deduplicate the textsdeduplicated_texts=semhash.self_deduplicate().selected# Filter outliersfiltered_texts=semhash.self_filter_outliers().selected# Find representative textsrepresentative_texts=semhash.self_find_representative().selected

Deduplicate, filter outliers, and find representative samples across two datasets

The following code snippet shows how to deduplicate across two datasets, filter outliers, and find representative samples using SemHash (in this example, the train/test split of theAG News dataset):

fromdatasetsimportload_datasetfromsemhashimportSemHash# Initialize a SemHash instancesemhash=SemHash()# Load two datasets to deduplicatetrain_texts=load_dataset("ag_news",split="train")["text"]test_texts=load_dataset("ag_news",split="test")["text"]# Initialize a SemHash instancesemhash=SemHash.from_records(records=train_texts)# Deduplicate the test data against the training datadeduplicated_test_texts=semhash.deduplicate(records=test_texts).selected# Filter outliers from the test datafiltered_test_texts=semhash.filter_outliers(records=test_texts).selected# Find representative texts in the test datarepresentative_test_texts=semhash.find_representative(records=test_texts).selected

Deduplicate, filter outliers, and find representative samples on multi-column datasets

The following code snippet shows how to deduplicate multi-column datasets, filter outliers, and find representative samples using SemHash (in this example, the train split of the QA datasetSQuAD 2.0, which consists of questions, contexts, and answers):

fromdatasetsimportload_datasetfromsemhashimportSemHash# Load the datasetdataset=load_dataset("squad_v2",split="train")# Convert the dataset to a list of dictionariesrecords= [dict(row)forrowindataset]# Initialize SemHash with the columns to deduplicatesemhash=SemHash.from_records(records=records,columns=["question","context"])# Deduplicate the recordsdeduplicated_records=semhash.self_deduplicate().selected# Filter outliers from the recordsfiltered_records=semhash.self_filter_outliers().selected# Find representative samples in the recordsrepresentative_records=semhash.self_find_representative().selected

DeduplicationResult functionality

TheDeduplicationResult object returned by thededuplicate andself_deduplicate functions contains several useful functions to inspect the deduplication result. The following code snippet shows how to use these functions:

fromdatasetsimportload_datasetfromsemhashimportSemHash# Load a dataset to deduplicatetexts=load_dataset("ag_news",split="train")["text"]# Initialize a SemHash instancesemhash=SemHash.from_records(records=texts)# Deduplicate the textsdeduplication_result=semhash.self_deduplicate()# Check the deduplicated textsdeduplication_result.selected# Check the duplicatesdeduplication_result.filtered# See what percentage of the texts were duplicatesdeduplication_result.duplicate_ratio# See what percentage of the texts were exact duplicatesdeduplication_result.exact_duplicate_ratio# Get the least similar text from the duplicates. This is useful for finding the right threshold for deduplication.least_similar=deduplication_result.get_least_similar_from_duplicates()# Rethreshold the duplicates. This allows you to instantly rethreshold the duplicates with a new threshold without having to re-deduplicate the texts.deduplication_result.rethreshold(0.95)

Using custom encoders

The following code snippet shows how to use a custom encoder with SemHash:

fromdatasetsimportload_datasetfrommodel2vecimportStaticModelfromsemhashimportSemHash# Load a dataset to deduplicatetexts=load_dataset("ag_news",split="train")["text"]# Load an embedding model (in this example, a multilingual model)model=StaticModel.from_pretrained("minishlab/M2V_multilingual_output")# Initialize a SemHash with the model and custom encodersemhash=SemHash.from_records(records=texts,model=model)# Deduplicate the textsdeduplicated_texts=semhash.self_deduplicate()

Any encoder can be used that adheres to ourencoder protocol. For example, anysentence-transformers model can be used as an encoder:

fromdatasetsimportload_datasetfromsemhashimportSemHashfromsentence_transformersimportSentenceTransformer# Load a dataset to deduplicatetexts=load_dataset("ag_news",split="train")["text"]# Load a sentence-transformers modelmodel=SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")# Initialize a SemHash with the model and custom encodersemhash=SemHash.from_records(records=texts,model=model)# Deduplicate the textsdeduplicated_texts=semhash.self_deduplicate()

Using Pandas DataFrames

You can easily use Pandas DataFrames with SemHash. The following code snippet shows how to deduplicate a Pandas DataFrame:

importpandasaspdfromdatasetsimportload_datasetfromsemhashimportSemHash# Load a dataset as a pandas dataframedataframe=load_dataset("ag_news",split="train").to_pandas()# Convert the dataframe to a list of dictionariesdataframe=dataframe.to_dict(orient="records")# Initialize a SemHash instance with the columns to deduplicatesemhash=SemHash.from_records(records=dataframe,columns=["text"])# Deduplicate the textsdeduplicated_records=semhash.self_deduplicate().selected# Convert the deduplicated records back to a pandas dataframededuplicated_dataframe=pd.DataFrame(deduplicated_records)

NOTE: By default, we use the ANN (approximate-nearest neighbors) backend for deduplication. We recommend keeping this since the recall for smaller datasets is ~100%, and it's needed for larger datasets (>1M samples) since these will take too long to deduplicate without ANN. If you want to use the flat/exact-matching backend, you can setuse_ann=False in the SemHash constructor:

semhash=SemHash.from_records(records=texts,use_ann=False)

Benchmarks

We've benchmarked SemHash on a variety of datasets to measure the deduplication performance and speed. The benchmarks were run with the following setup:

The benchmarks were all run on CPU
The benchmarks were all run withuse_ann=True
The used encoder is the default encoder (potion-base-8M).
The timings include the encoding time, index building time, and deduplication time.

Train Deduplication Benchmark

Dataset	Original Train Size	Deduplicated Train Size	% Removed	Deduplication Time (s)
bbc	1225	1144	6.61	0.57
senteval_cr	3012	2990	0.73	0.14
tweet_sentiment_extraction	27481	26695	2.86	1.77
emotion	16000	15695	1.91	0.77
amazon_counterfactual	5000	4992	0.16	0.33
ag_news	120000	106921	10.90	5.20
enron_spam	31716	20540	35.24	2.03
subj	8000	7990	0.12	0.63
sst5	8544	8526	0.21	0.58
20_newgroups	11314	10684	5.57	0.73
hatespeech_offensive	22783	22090	3.04	0.92
ade	17637	15718	10.88	0.73
imdb	25000	24830	0.68	1.76
massive_scenario	11514	9366	18.66	0.47
student	117519	63856	45.66	8.80
squad_v2	130319	109698	15.82	8.81
wikitext	1801350	884645	50.89	83.53

Train/Test Deduplication Benchmark

Dataset	Train Size	Test Size	Deduplicated Test Size	% Removed	Deduplication Time (s)
bbc	1225	1000	870	13.00	0.71
senteval_cr	3012	753	750	0.40	0.13
tweet_sentiment_extraction	27481	3534	3412	3.45	1.53
emotion	16000	2000	1926	3.70	0.65
amazon_counterfactual	5000	5000	4990	0.20	0.51
ag_news	120000	7600	6198	18.45	3.74
enron_spam	31716	2000	1060	47.00	1.94
subj	8000	2000	1999	0.05	0.62
sst5	8544	2210	2205	0.23	0.59
20_newgroups	11314	7532	7098	5.76	2.25
hatespeech_offensive	22783	2000	1925	3.75	0.77
ade	17637	5879	4952	15.77	0.81
imdb	25000	25000	24795	0.82	2.81
massive_scenario	11514	2974	2190	26.36	0.46
student	117519	5000	2393	52.14	3.78
squad_v2	130319	11873	11863	0.08	7.13
wikitext	1801350	4358	2139	50.92	40.32

As can be seen, SemHash is extremely fast, and scales to large datasets with millions of records. There are some notable examples of train/test leakage, such asenron_spam andstudent, where the test dataset contains a significant amount of semantic overlap with the training dataset.

Reproducing the Benchmarks

To run the benchmarks yourself, you can use the following command (assuming you have thedatasets library installed):

python -m benchmarks.run_benchmarks

Optionally, the datasets can be updated in thedatasets.py file.

License

MIT

Citing

If you use SemHash in your research, please cite the following:

@software{minishlab2025semhash,author ={Thomas van Dongen and Stephan Tulkens},title ={SemHash: Fast Semantic Text Deduplication & Filtering},year ={2025},url ={https://github.com/MinishLab/semhash}}

About

Fast Semantic Text Deduplication & Filtering

minish.ai/packages/semhash

Releases4

v0.3.0 Latest

Apr 25, 2025

+ 3 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Fast Semantic Text Deduplication & Filtering

Quickstart

Main Features

Usage

Benchmarks

Train Deduplication Benchmark

Train/Test Deduplication Benchmark

Reproducing the Benchmarks

License

Citing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases4

Packages

Contributors5

Languages

Movatterモバイル変換

License

MinishLab/semhash

Folders and files

Latest commit

History

Repository files navigation

Fast Semantic Text Deduplication & Filtering

Quickstart

Main Features

Usage

Benchmarks

Train Deduplication Benchmark

Train/Test Deduplication Benchmark

Reproducing the Benchmarks

License

Citing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases4

Packages0

Contributors5

Languages

Packages