Image Search in Large Datasets¶
With the ever increasing data generated every day, it's important to have efficient ways to search through large image dataset to find the ones you need.
If you only have a CPU only machine and want to search through a large dataset using image as queries, this tutorial is for you.
We will walk you through how to use fastdup to search through thousands of images and find similar looking images to your query image.
Installation¶
importsysif"google.colab"insys.modules:# Running in Google Colab!pipinstall--force-reinstall--no-cache-dirnumpy==1.26.4scipyfastdupelse:# Running outside Colab!pipinstall-Uqfastdup
importfastdupfastdup.__version__
/home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (2.2.1) or chardet (5.2.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn(
'2.0.21'
Shoppee Product Match Dataset¶
In this notebook we will use the a dataset fromShopee Product Match Kaggle Competition. In this competition, participants must determine if two products are the same by their images.
Head to Kaggle and download the dataset into your local directory. You should have a folder namedshopee-product-matching in your current working directory.
With the dataset downloaded, let's randomly pick a few images and preview them.
sample_images=!findshopee-product-matching/-name'*.jpg'ret=fastdup.generate_sprite_image(sample_images,55,".")[0]
fromIPython.displayimportImageImage(filename=ret)
Run fastdup¶
Pointinput_dir to the location you store the images.
input_dir="./shopee-product-matching"work_dir="./my-fastdup-workdir"fastdup.run(input_dir,work_dir)
fastdup By Visual Layer, Inc. 2024. All rights reserved.
Initializing data [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] 100% Estimated: 0 Minutes
0
Restart Runtime¶
Once the run is complete you can terminate the session and use the generated arfifacts to run an image search.
Let's restart the kernel to simulate a different session.
importIPythonapp=IPython.Application.instance()app.kernel.do_shutdown(True)
{'status': 'ok', 'restart': True}Initialize Search Parameters¶
To start searching we must first initialize the search parameters.
The first positional argument isk - The number of nearest neighbors to search for.
In this case we want to search for 10 nearest neighbor. Feel free to experiment with your own number ofk.
importfastdupwork_dir="./my-fastdup-workdir"fastdup.init_search(10,work_dir=work_dir)
2024-06-12 22:04:36 [INFO] 38) Finished load_index() NN model, num_images 324152024-06-12 22:04:36 [INFO] Read nnf index file from ./my-fastdup-workdir/nnf.index2024-06-12 22:04:36 [INFO] Read NNF index with 32415 imagesinit_search() initialized OK.
0
Search with a Query Image¶
Let's use our own image and find out if there are matches in the shopee dataset.
fromIPython.displayimportImageImage(filename="shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg")
Specify the query image filename and search for similar images in the images directory.
df=fastdup.search("shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg")
0 False2024-06-12 22:05:21 [INFO] Total time took 63 ms2024-06-12 22:05:21 [INFO] Found a total of 1 fully identical images (d>0.990), which are 0.00 % of total graph edges2024-06-12 22:05:21 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 % of total graph edges2024-06-12 22:05:21 [INFO] Found a total of 10 above threshold images (d>0.700), which are 0.00 % of total graph edges2024-06-12 22:05:21 [INFO] Found a total of 1 outlier images (d<0.000), which are 0.00 % of total graph edges2024-06-12 22:05:21 [INFO] Min similarity found 0.822 max similarity 1.000
Inspect the search result.
Thedistance value indicate how similar is your query image to the other image.
Adistance of1.0 indicates the images are exact duplicates. The lower the value, the less similar the images are.
df| from | to | distance | |
|---|---|---|---|
| 0 | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg | 1.000000 |
| 1 | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg | shopee-product-matching/train_images/b1b0ef712ae90ecc8d1ec7bc5d11485a.jpg | 0.842375 |
| 2 | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg | shopee-product-matching/train_images/182ef6021d6b2118fb9915156cff50e6.jpg | 0.841552 |
| 3 | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg | shopee-product-matching/train_images/5235cbbdfd70272503647694730424c4.jpg | 0.836889 |
| 4 | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg | shopee-product-matching/train_images/4cd0ef616259eac109212b2f2e5f7136.jpg | 0.830368 |
| 5 | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg | shopee-product-matching/train_images/4851da5e4b570ab7147566c85b3fabc2.jpg | 0.829968 |
| 6 | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg | shopee-product-matching/train_images/c29d3d0821e9e3b0188c005fd95bf424.jpg | 0.828526 |
| 7 | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg | shopee-product-matching/train_images/086b2dcda1059ba3fd0365a42277b743.jpg | 0.828210 |
| 8 | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg | shopee-product-matching/train_images/60abf69848da6bc126f31c880a6372ca.jpg | 0.825624 |
| 9 | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg | shopee-product-matching/train_images/0ae01a272a94a019759bc2a3b4813ee2.jpg | 0.822049 |
You can repeat the search as many times as you wish as long as the model is loaded in memory.
Let's try to search using another query image.
Image(filename="shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg")
df2=fastdup.search("shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg")
0 False2024-06-12 22:05:47 [INFO] Total time took 20 ms2024-06-12 22:05:47 [INFO] Found a total of 1 fully identical images (d>0.990), which are 0.00 % of total graph edges2024-06-12 22:05:47 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 % of total graph edges2024-06-12 22:05:47 [INFO] Found a total of 10 above threshold images (d>0.700), which are 0.00 % of total graph edges2024-06-12 22:05:47 [INFO] Found a total of 1 outlier images (d<0.000), which are 0.00 % of total graph edges2024-06-12 22:05:47 [INFO] Min similarity found 0.751 max similarity 0.999
df2| from | to | distance | |
|---|---|---|---|
| 0 | shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg | shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg | 0.998664 |
| 1 | shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg | shopee-product-matching/train_images/0728167ad63d954828ead460c34a18f1.jpg | 0.787696 |
| 2 | shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg | shopee-product-matching/train_images/4babdac92bac8ed1b489e08f4b753772.jpg | 0.786759 |
| 3 | shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg | shopee-product-matching/train_images/93171da404992e3dae9607f4fcdab48c.jpg | 0.786458 |
| 4 | shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg | shopee-product-matching/train_images/91307a90a0307d3766feb3df475b9e61.jpg | 0.786357 |
| 5 | shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg | shopee-product-matching/train_images/db87686fb0a0f181074478dc688146cd.jpg | 0.770760 |
| 6 | shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg | shopee-product-matching/train_images/3df1c072d8ff94fe0f6309be4ba8b6e7.jpg | 0.767405 |
| 7 | shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg | shopee-product-matching/train_images/7b298b7675c5e3e313d2bd0c8ea9a9f9.jpg | 0.767389 |
| 8 | shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg | shopee-product-matching/train_images/2165534b91330cf04b567ef8d51ba17a.jpg | 0.757656 |
| 9 | shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg | shopee-product-matching/train_images/46d63b8d474480dae4f519c00c3119d0.jpg | 0.751443 |
Visualize Results¶
This step is optional. fastdup provides a convenient way to visualize your search results for duplicate and similar looking images.
fastdup.create_duplicates_gallery(df,work_dir,input_dir="./shopee-product-matching")
Generating gallery: 0%| | 0/10 [00:00<?, ?it/s]
Stored similarity visual view in ./my-fastdup-workdir/duplicates.html
0
fromIPython.displayimportHTMLHTML(filename="./my-fastdup-workdir/duplicates.html")
Duplicates Report
| Info | |
|---|---|
| Distance | 1.0 |
| From | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg |
| To | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg |
| Info | |
|---|---|
| Distance | 0.842375 |
| From | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg |
| To | shopee-product-matching/train_images/b1b0ef712ae90ecc8d1ec7bc5d11485a.jpg |
| Info | |
|---|---|
| Distance | 0.841552 |
| From | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg |
| To | shopee-product-matching/train_images/182ef6021d6b2118fb9915156cff50e6.jpg |
| Info | |
|---|---|
| Distance | 0.836889 |
| From | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg |
| To | shopee-product-matching/train_images/5235cbbdfd70272503647694730424c4.jpg |
| Info | |
|---|---|
| Distance | 0.830368 |
| From | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg |
| To | shopee-product-matching/train_images/4cd0ef616259eac109212b2f2e5f7136.jpg |
| Info | |
|---|---|
| Distance | 0.829968 |
| From | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg |
| To | shopee-product-matching/train_images/4851da5e4b570ab7147566c85b3fabc2.jpg |
| Info | |
|---|---|
| Distance | 0.828526 |
| From | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg |
| To | shopee-product-matching/train_images/c29d3d0821e9e3b0188c005fd95bf424.jpg |
| Info | |
|---|---|
| Distance | 0.82821 |
| From | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg |
| To | shopee-product-matching/train_images/086b2dcda1059ba3fd0365a42277b743.jpg |
| Info | |
|---|---|
| Distance | 0.825624 |
| From | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg |
| To | shopee-product-matching/train_images/60abf69848da6bc126f31c880a6372ca.jpg |
| Info | |
|---|---|
| Distance | 0.822049 |
| From | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg |
| To | shopee-product-matching/train_images/0ae01a272a94a019759bc2a3b4813ee2.jpg |
fastdup.create_similarity_gallery(df,work_dir,input_dir="./shopee-product-matching",min_items=3)HTML(filename="./my-fastdup-workdir/similarity.html")
Warning: you are running create_similarity_gallery() without providing get_label_func so similarities are not computed between different classes. It is recommended to run this report with labels. Without labels this report output is similar to create_duplicate_gallery()
Generating gallery: 0%| | 0/1 [00:00<?, ?it/s]
Stored similar images visual view in ./my-fastdup-workdir/similarity.html
Similarity Report
| Info From | |
|---|---|
| from | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg |
| Info To | |
|---|---|
| 1.0 | shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg |
| 0.842375 | shopee-product-matching/train_images/b1b0ef712ae90ecc8d1ec7bc5d11485a.jpg |
| 0.841552 | shopee-product-matching/train_images/182ef6021d6b2118fb9915156cff50e6.jpg |
| 0.836889 | shopee-product-matching/train_images/5235cbbdfd70272503647694730424c4.jpg |
| 0.830368 | shopee-product-matching/train_images/4cd0ef616259eac109212b2f2e5f7136.jpg |
| 0.829968 | shopee-product-matching/train_images/4851da5e4b570ab7147566c85b3fabc2.jpg |
| 0.828526 | shopee-product-matching/train_images/c29d3d0821e9e3b0188c005fd95bf424.jpg |
| 0.82821 | shopee-product-matching/train_images/086b2dcda1059ba3fd0365a42277b743.jpg |
| 0.825624 | shopee-product-matching/train_images/60abf69848da6bc126f31c880a6372ca.jpg |
| 0.822049 | shopee-product-matching/train_images/0ae01a272a94a019759bc2a3b4813ee2.jpg |
| Query Image |
| Similar |
Feel free to repeat the search using other images and visualize them.
Interactive Exploration¶
In addition to the static visualizations presented above, fastdup also offers interactive exploration of the dataset.
To explore the dataset and issues interactively in a browser, run:
fd.explore()
🗒Note - This currently requires you to sign-up (for free) to view the interactive exploration. Alternatively, you can visualize fastdup in a non-interactive way using fastdup's built in galleries shown in the upcoming cells.
You'll be presented with a web interface that lets you conveniently view, filter, and curate your dataset in a web interface.

Wrap up¶
Congratulations! You've made it to the end of the tutorial!
Image similarity search is an incredibly powerful tookit to have in your arsenal as a machine learning practitioner.
For example, if your model is not performing well on a particular category of images, you could use image search to find more examples of that category and add them to your training data.
Next, feel free to check out other tutorials -
- ⚡Quickstart: Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
- 🧹Clean Image Folder: Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
- 🖼Analyze Image Classification Dataset: Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
- 🎁Analyze Object Detection Dataset: Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try.








