Movatterモバイル変換

NotificationsYou must be signed in to change notification settings
Fork86
Star1.8k

fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.

License

View license

1.8k stars 86 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,359 Commits
.github		.github
examples		examples
fastdup		fastdup
gallery		gallery
tests		tests
.gitignore		.gitignore
CLOUD.md		CLOUD.md
Dockerfile		Dockerfile
EXAMPLES.md		EXAMPLES.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
RUN.md		RUN.md

Repository files navigation

A powerful open-source tool for analyzing image and video datasets founded by the authors ofXGBoost,Apache TVM &Turi Create -Danny Bickson,Carlos Guestrin andAmir Alush.

Documentation ·Features ·Report Bug ·Blog ·Quickstart ·Visual Layer Cloud

Getting Started

pip install fastdup fromPyPI:

pip install fastdup

More installation options are availablehere.

Initialize and run fastdup:

importfastdupfd=fastdup.create(input_dir="IMAGE_FOLDER/")fd.run()

Explore the results in a interactive web UI:

fd.explore()

Alternatively, visualize the result in a static gallery:

fd.vis.duplicates_gallery()# gallery of duplicatesfd.vis.outliers_gallery()# gallery of outliersfd.vis.component_gallery()# gallery of connected componentsfd.vis.stats_gallery()# gallery of image statistics (e.g. blur, brightness, etc.)fd.vis.similarity_gallery()# gallery of similar images

Check thisquickstart tutorial for more info

quickstart_video.4.mp4

Features & Advantages

fastdup handles labeled/unlabeled datasets in image or video format, providing a range of features:

What sets fastdup apart from other similar tools:

Quality: High-quality analysis to identify duplicates/near-duplicates, outliers, mislabels, broken images, and low-quality images.
Scale: Highly scalable, capable of processing 400M images on a single CPU machine. Scales up to billions of images.
Speed: Optimized C++ engine enables high performance even on low-resource CPU machines.
Privacy: Runs locally or on your cloud infrastructure. Your data stays where it is.
Ease of use: Works on labeled or unlabeled datasets in image or video format with support for major operating systems like MacOS, Linux and Windows.

Learn from Examples

Learn the basics of fastdup through interactive examples. View the notebooks on GitHub or nbviewer. Even better, run them on Google Colab or Kaggle, for free.

	⚡ Quickstart: Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here! 📌 Dataset:Oxford-IIIT Pet.



	🧹 Finding and Removing Duplicates: Learn how to how to analyze an image dataset for duplicates and near-duplicates. 📌 Dataset:Oxford-IIIT Pet.



	🖼 Finding and Removing Mislabels: Learn how to analyze an image dataset for potential image mislabels and export the list of mislabeled images for further inspection. 📌 Dataset:Food-101.



	🎁 Image Similarity Search: Perform image search in a large dataset of images. 📌 Dataset:Shopee Product Matching.



	🤗 Hugging Face Datasets: Load and analyze datasets fromHugging Face Datasets. Perfect if you already have a dataset hosted on Hugging Face hub.



	🧠 TIMM Embeddings: Compute dataset embeddings usingTIMM (PyTorch Image Models) and run fastdup over the them to surface dataset issues. Runs on CPU and GPU.



	🦖 ONNX Embeddings: Bring your own ONNX model. In this example we extract feature vectors of your images usingDINOv2 model. Runs on CPU.

See moreexamples.

Join the Community

Get help from the fastdup team or community members via the following channels:

Community-contributed blog posts on fastdup:

	Deploying AWS Lambda functions with Docker Container by using Custom Base Image 🖋️atahan bulus • 🗓 16 September 2023
	Renumics: Cleaning Image Classification Datasets With fastdup and Renumics Spotlight 🖋️Daniel Klitzke • 🗓 4 September 2023
	Roboflow: How to Reduce Dataset Size Without Losing Accuracy 🖋️Arty Ariuntuya • 🗓 9 August 2023
	The weighty significance of data cleanliness — or as I like to call it, “cleanliness is next to model-ness” — cannot be overstated. 🖋️Alexander Lan • 🗓 9 March 2023
	Clean Up Your Digital Life: How I Found 1929 Fully Identical Images, Dark, Bright and Blurry Shots in Minutes, For Free. 🖋️Dickson Neoh • 🗓 23 February 2023
	fastdup: A Powerful Tool to Manage, Clean & Curate Visual Data at Scale on Your CPU - For Free. 🖋️Dickson Neoh • 🗓 3 January 2023
	Master Data Integrity to Clean Your Computer Vision Datasets. 🖋️Paul lusztin • 🗓 19 December 2022

What our users say:

Visual Layer Cloud

Visual Layer offers commercial services for managing, cleaning, and curating visual data at scale.

Sign-up for free.

Visual.Layer.Cloud.mp4

Not convinced? Interact with Visual Layer Cloudpublic dataset with no sign-up required.

Disclaimer

Usage Tracking

We have added an experimental crash report collection usingSentry.

WeDO NOT collect user-specific information such as folder names, user names, image names, image content, etc.We do collect data related to fastdup's internal operations and performance statistics such as total number of images, average runtime per image, total free memory, total free disk space, number of cores, etc.

This help us identify and resolve stability issues, thereby improving the overall reliability of fastdup.The code for the data collection is foundhere. On MAC we useGoogle crashpad to report crashes.

Users have the option to opt out of the experimental crash reporting system through one of the following methods:

Define an environment variable calledSENTRY_OPT_OUT
orrun() withturi_param='run_sentry=0'

License

fastdup is licensed underCreative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License.

For any more information or inquiries regarding the license, please contact us atinfo@visual-layer.com or see theLICENSE file.

🔝 Back to Top

About

Releases136

Centos 7.0.9 Stable release Latest

Jun 14, 2024

+ 135 releases

Packages

No packages published

Contributors19

+ 5 contributors

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Getting Started

Check thisquickstart tutorial for more info

Features & Advantages

Learn from Examples

Join the Community

Visual Layer Cloud

Disclaimer

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases136

Packages

Uh oh!

Contributors19

Uh oh!

Languages

Movatterモバイル変換

License

visual-layer/fastdup

Folders and files

Latest commit

History

Repository files navigation

Getting Started

Check thisquickstart tutorial for more info

Features & Advantages

Learn from Examples

Join the Community

Visual Layer Cloud

Disclaimer

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases136

Packages0

Uh oh!

Contributors19

Uh oh!

Languages

Packages