Movatterモバイル変換


[0]ホーム

URL:


  1. fastdup
  2. examples
Notebook

Image Search in Large Datasets

Open in ColabOpen in KaggleExplore the Docs

With the ever increasing data generated every day, it's important to have efficient ways to search through large image dataset to find the ones you need.

If you only have a CPU only machine and want to search through a large dataset using image as queries, this tutorial is for you.

We will walk you through how to use fastdup to search through thousands of images and find similar looking images to your query image.

Installation

In [ ]:
importsysif"google.colab"insys.modules:# Running in Google Colab!pipinstall--force-reinstall--no-cache-dirnumpy==1.26.4scipyfastdupelse:# Running outside Colab!pipinstall-Uqfastdup
In [1]:
importfastdupfastdup.__version__
/home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (2.2.1) or chardet (5.2.0)/charset_normalizer (2.0.4) doesn't match a supported version!  warnings.warn(
Out[1]:
'2.0.21'

Shoppee Product Match Dataset

In this notebook we will use the a dataset fromShopee Product Match Kaggle Competition. In this competition, participants must determine if two products are the same by their images.

Head to Kaggle and download the dataset into your local directory. You should have a folder namedshopee-product-matching in your current working directory.

With the dataset downloaded, let's randomly pick a few images and preview them.

In [2]:
sample_images=!findshopee-product-matching/-name'*.jpg'ret=fastdup.generate_sprite_image(sample_images,55,".")[0]
In [3]:
fromIPython.displayimportImageImage(filename=ret)
Out[3]:
No description has been provided for this image

Run fastdup

Pointinput_dir to the location you store the images.

In [4]:
input_dir="./shopee-product-matching"work_dir="./my-fastdup-workdir"fastdup.run(input_dir,work_dir)
fastdup By Visual Layer, Inc. 2024. All rights reserved.
Initializing data [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] 100% Estimated: 0 Minutes
Out[4]:
0

Restart Runtime

Once the run is complete you can terminate the session and use the generated arfifacts to run an image search.

Let's restart the kernel to simulate a different session.

In [5]:
importIPythonapp=IPython.Application.instance()app.kernel.do_shutdown(True)
Out[5]:
{'status': 'ok', 'restart': True}

Initialize Search Parameters

To start searching we must first initialize the search parameters.

The first positional argument isk - The number of nearest neighbors to search for.

In this case we want to search for 10 nearest neighbor. Feel free to experiment with your own number ofk.

In [2]:
importfastdupwork_dir="./my-fastdup-workdir"fastdup.init_search(10,work_dir=work_dir)
2024-06-12 22:04:36 [INFO] 38) Finished load_index() NN model, num_images 324152024-06-12 22:04:36 [INFO] Read nnf index file from ./my-fastdup-workdir/nnf.index2024-06-12 22:04:36 [INFO] Read NNF index with 32415 imagesinit_search() initialized OK.
Out[2]:
0

Search with a Query Image

Let's use our own image and find out if there are matches in the shopee dataset.

In [4]:
fromIPython.displayimportImageImage(filename="shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg")
Out[4]:
No description has been provided for this image

Specify the query image filename and search for similar images in the images directory.

In [5]:
df=fastdup.search("shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg")
0 False2024-06-12 22:05:21 [INFO] Total time took 63 ms2024-06-12 22:05:21 [INFO] Found a total of 1 fully identical images (d>0.990), which are 0.00 % of total graph edges2024-06-12 22:05:21 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 % of total graph edges2024-06-12 22:05:21 [INFO] Found a total of 10 above threshold images (d>0.700), which are 0.00 % of total graph edges2024-06-12 22:05:21 [INFO] Found a total of 1 outlier images         (d<0.000), which are 0.00 % of total graph edges2024-06-12 22:05:21 [INFO] Min similarity found 0.822 max similarity 1.000

Inspect the search result.

Thedistance value indicate how similar is your query image to the other image.

Adistance of1.0 indicates the images are exact duplicates. The lower the value, the less similar the images are.

In [6]:
df
Out[6]:
fromtodistance
0shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpgshopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg1.000000
1shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpgshopee-product-matching/train_images/b1b0ef712ae90ecc8d1ec7bc5d11485a.jpg0.842375
2shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpgshopee-product-matching/train_images/182ef6021d6b2118fb9915156cff50e6.jpg0.841552
3shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpgshopee-product-matching/train_images/5235cbbdfd70272503647694730424c4.jpg0.836889
4shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpgshopee-product-matching/train_images/4cd0ef616259eac109212b2f2e5f7136.jpg0.830368
5shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpgshopee-product-matching/train_images/4851da5e4b570ab7147566c85b3fabc2.jpg0.829968
6shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpgshopee-product-matching/train_images/c29d3d0821e9e3b0188c005fd95bf424.jpg0.828526
7shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpgshopee-product-matching/train_images/086b2dcda1059ba3fd0365a42277b743.jpg0.828210
8shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpgshopee-product-matching/train_images/60abf69848da6bc126f31c880a6372ca.jpg0.825624
9shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpgshopee-product-matching/train_images/0ae01a272a94a019759bc2a3b4813ee2.jpg0.822049

You can repeat the search as many times as you wish as long as the model is loaded in memory.

Let's try to search using another query image.

In [7]:
Image(filename="shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg")
Out[7]:
No description has been provided for this image
In [8]:
df2=fastdup.search("shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg")
0 False2024-06-12 22:05:47 [INFO] Total time took 20 ms2024-06-12 22:05:47 [INFO] Found a total of 1 fully identical images (d>0.990), which are 0.00 % of total graph edges2024-06-12 22:05:47 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 % of total graph edges2024-06-12 22:05:47 [INFO] Found a total of 10 above threshold images (d>0.700), which are 0.00 % of total graph edges2024-06-12 22:05:47 [INFO] Found a total of 1 outlier images         (d<0.000), which are 0.00 % of total graph edges2024-06-12 22:05:47 [INFO] Min similarity found 0.751 max similarity 0.999
In [9]:
df2
Out[9]:
fromtodistance
0shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpgshopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg0.998664
1shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpgshopee-product-matching/train_images/0728167ad63d954828ead460c34a18f1.jpg0.787696
2shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpgshopee-product-matching/train_images/4babdac92bac8ed1b489e08f4b753772.jpg0.786759
3shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpgshopee-product-matching/train_images/93171da404992e3dae9607f4fcdab48c.jpg0.786458
4shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpgshopee-product-matching/train_images/91307a90a0307d3766feb3df475b9e61.jpg0.786357
5shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpgshopee-product-matching/train_images/db87686fb0a0f181074478dc688146cd.jpg0.770760
6shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpgshopee-product-matching/train_images/3df1c072d8ff94fe0f6309be4ba8b6e7.jpg0.767405
7shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpgshopee-product-matching/train_images/7b298b7675c5e3e313d2bd0c8ea9a9f9.jpg0.767389
8shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpgshopee-product-matching/train_images/2165534b91330cf04b567ef8d51ba17a.jpg0.757656
9shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpgshopee-product-matching/train_images/46d63b8d474480dae4f519c00c3119d0.jpg0.751443

Visualize Results

This step is optional. fastdup provides a convenient way to visualize your search results for duplicate and similar looking images.

In [10]:
fastdup.create_duplicates_gallery(df,work_dir,input_dir="./shopee-product-matching")
Generating gallery:   0%|          | 0/10 [00:00<?, ?it/s]
Stored similarity visual view in  ./my-fastdup-workdir/duplicates.html
Out[10]:
0
In [12]:
fromIPython.displayimportHTMLHTML(filename="./my-fastdup-workdir/duplicates.html")
Out[12]:
Duplicates Report
logo
For the new and interactive data exploration Read more
No description has been provided for this image
fastdup.explore()
No description has been provided for this image

Duplicates Report

No description has been provided for this image
Info
Distance1.0
Fromshopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg
Toshopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg
No description has been provided for this image
Info
Distance0.842375
Fromshopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg
Toshopee-product-matching/train_images/b1b0ef712ae90ecc8d1ec7bc5d11485a.jpg
No description has been provided for this image
Info
Distance0.841552
Fromshopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg
Toshopee-product-matching/train_images/182ef6021d6b2118fb9915156cff50e6.jpg
No description has been provided for this image
Info
Distance0.836889
Fromshopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg
Toshopee-product-matching/train_images/5235cbbdfd70272503647694730424c4.jpg
No description has been provided for this image
Info
Distance0.830368
Fromshopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg
Toshopee-product-matching/train_images/4cd0ef616259eac109212b2f2e5f7136.jpg
No description has been provided for this image
Info
Distance0.829968
Fromshopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg
Toshopee-product-matching/train_images/4851da5e4b570ab7147566c85b3fabc2.jpg
No description has been provided for this image
Info
Distance0.828526
Fromshopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg
Toshopee-product-matching/train_images/c29d3d0821e9e3b0188c005fd95bf424.jpg
No description has been provided for this image
Info
Distance0.82821
Fromshopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg
Toshopee-product-matching/train_images/086b2dcda1059ba3fd0365a42277b743.jpg
No description has been provided for this image
Info
Distance0.825624
Fromshopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg
Toshopee-product-matching/train_images/60abf69848da6bc126f31c880a6372ca.jpg
No description has been provided for this image
Info
Distance0.822049
Fromshopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg
Toshopee-product-matching/train_images/0ae01a272a94a019759bc2a3b4813ee2.jpg
In [14]:
fastdup.create_similarity_gallery(df,work_dir,input_dir="./shopee-product-matching",min_items=3)HTML(filename="./my-fastdup-workdir/similarity.html")
Warning: you are running create_similarity_gallery() without providing get_label_func so similarities are not computed between different classes. It is recommended to run this report with labels. Without labels this report output is similar to create_duplicate_gallery()
Generating gallery:   0%|          | 0/1 [00:00<?, ?it/s]
Stored similar images visual view in  ./my-fastdup-workdir/similarity.html
Out[14]:
Similarity Report
logo
For the new and interactive data exploration Read more
No description has been provided for this image
fastdup.explore()
No description has been provided for this image

Similarity Report

Info From
fromshopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg
Info To
1.0shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg
0.842375shopee-product-matching/train_images/b1b0ef712ae90ecc8d1ec7bc5d11485a.jpg
0.841552shopee-product-matching/train_images/182ef6021d6b2118fb9915156cff50e6.jpg
0.836889shopee-product-matching/train_images/5235cbbdfd70272503647694730424c4.jpg
0.830368shopee-product-matching/train_images/4cd0ef616259eac109212b2f2e5f7136.jpg
0.829968shopee-product-matching/train_images/4851da5e4b570ab7147566c85b3fabc2.jpg
0.828526shopee-product-matching/train_images/c29d3d0821e9e3b0188c005fd95bf424.jpg
0.82821shopee-product-matching/train_images/086b2dcda1059ba3fd0365a42277b743.jpg
0.825624shopee-product-matching/train_images/60abf69848da6bc126f31c880a6372ca.jpg
0.822049shopee-product-matching/train_images/0ae01a272a94a019759bc2a3b4813ee2.jpg
Query Image
No description has been provided for this image
Similar
No description has been provided for this image

Feel free to repeat the search using other images and visualize them.

Interactive Exploration

In addition to the static visualizations presented above, fastdup also offers interactive exploration of the dataset.

To explore the dataset and issues interactively in a browser, run:

In [ ]:
fd.explore()

🗒Note - This currently requires you to sign-up (for free) to view the interactive exploration. Alternatively, you can visualize fastdup in a non-interactive way using fastdup's built in galleries shown in the upcoming cells.

You'll be presented with a web interface that lets you conveniently view, filter, and curate your dataset in a web interface.

image.png

Wrap up

Congratulations! You've made it to the end of the tutorial!

Image similarity search is an incredibly powerful tookit to have in your arsenal as a machine learning practitioner.

For example, if your model is not performing well on a particular category of images, you could use image search to find more examples of that category and add them to your training data.

Next, feel free to check out other tutorials -

  • Quickstart: Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
  • 🧹Clean Image Folder: Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
  • 🖼Analyze Image Classification Dataset: Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
  • 🎁Analyze Object Detection Dataset: Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try.
sitebloggithubslacklinkedinyoutubetwitter

logo
Copyright © 2024 Visual Layer. All rights reserved.

[8]ページ先頭

©2009-2025 Movatter.jp