Movatterモバイル変換


[0]ホーム

URL:


Skip to content

FiftyOne

FiftyOne is an open source toolkit that enables users to curate better data and build better models. It includes tools for data exploration, visualization, and management, as well as features for collaboration and sharing.

Any developers, data scientists, and researchers who work with computer vision and machine learning can use FiftyOne to improve the quality of their datasets and deliver insights about their models.

example

FiftyOne provides an API to create LanceDB tables and run similarity queries, bothprogrammatically in Python and viapoint-and-click in the App.

Let's get started and see how to useLanceDB to create asimilarity index on your FiftyOne datasets.

Overview

Embeddings are foundational to all of thevector search features. In FiftyOne, embeddings are managed by theFiftyOne Brain that provides powerful machine learning techniques designed to transform how you curate your data from an art into a measurable science.

Have you ever wanted to find the images most similar to an image in your dataset?

TheFiftyOne Brain makes computingvisual similarity really easy. You can compute the similarity of samples in your dataset using an embedding model and store the results in thebrain key.

You can then sort your samples by similarity or use this information to find potential duplicate images.

Here we will be doing the following :

  1. Create Index - In order to run similarity queries against our media, we need toindex the data. We can do this via thecompute_similarity() function.

    • In the function, specify themodel you want to use to generate the embedding vectors, and whatvector search engine you want to use on thebackend (here LanceDB).

    Tip

    You can also give the similarity index a name(brain_key), which is useful if you want to run vector searches against multiple indexes.

  2. Query - Once you have generated your similarity index, you can query your dataset withsort_by_similarity(). The query can be any of the following:

    • An ID (sample or patch)
    • A query vector of same dimension as the index
    • A list of IDs (samples or patches)
    • A text prompt (search semantically)

Prerequisites: install necessary dependencies

  1. Create and activate a virtual environment

    Install virtualenv package and run the following command in your project directory.

    python-mvenvfiftyone_
    From inside the project directory run the following to activate the virtual environment.

    fiftyone_/Scripts/activate
    sourcefiftyone_/Scripts/activate
  2. Install the following packages in the virtual environment

    To install FiftyOne, ensure you have activated any virtual environment that you are using, then run

    pipinstallfiftyone

Understand basic workflow

The basic workflow shown below uses LanceDB to create a similarity index on your FiftyOne datasets:

  1. Load a dataset into FiftyOne.

  2. Compute embedding vectors for samples or patches in your dataset, or select a model to use to generate embeddings.

  3. Use thecompute_similarity() method to generate a LanceDB table for the samples or object patches embeddings in a dataset by setting the parameterbackend="lancedb" and specifying abrain_key of your choice.

  4. Use this LanceDB table to query your data withsort_by_similarity().

  5. If desired, delete the table.

Quick Example

Let's jump on a quick example that demonstrates this workflow.

importfiftyoneasfoimportfiftyone.brainasfobimportfiftyone.zooasfoz# Step 1: Load your data into FiftyOnedataset=foz.load_zoo_dataset("quickstart")
Make sure you install torch (guide here) before proceeding.

# Steps 2 and 3: Compute embeddings and create a similarity indexlancedb_index=fob.compute_similarity(dataset,model="clip-vit-base32-torch",brain_key="lancedb_index",backend="lancedb",)

Note

Running the code above will download the clip model (2.6Gb)

Once the similarity index has been generated, we can query our data in FiftyOne by specifying thebrain_key:

# Step 4: Query your dataquery=dataset.first().id# query by sample IDview=dataset.sort_by_similarity(query,brain_key="lancedb_index",k=10,# limit to 10 most similar samples)
The returned result are of type -DatasetView.

Note

DatasetView does not hold its contents in-memory. Views simply store the rule(s) that are applied to extract the content of interest from the underlying Dataset when the view is iterated/aggregated on.

This means, for example, that the contents of aDatasetView may change as the underlying Dataset is modified.

Can you query a view instead of dataset?

Yes, you can also query a view.

Performing a similarity search on aDatasetView will only return results from the view; if the view contains samples that were not included in the index, they will never be included in the result.

This means that you can index an entire Dataset once and then perform searches on subsets of the dataset by constructing views that contain the images of interest.

# Step 5 (optional): Cleanup# Delete the LanceDB tablelancedb_index.cleanup()# Delete run record from FiftyOnedataset.delete_brain_run("lancedb_index")

Using LanceDB backend

By default, callingcompute_similarity() orsort_by_similarity() will use an sklearn backend.

To use the LanceDB backend, simply set the optionalbackend parameter ofcompute_similarity() to"lancedb":

importfiftyone.brainasfob#... rest of the codefob.compute_similarity(...,backend="lancedb",...)

Alternatively, you can configure FiftyOne to use the LanceDB backend by setting the following environment variable.

In your terminal, set the environment variable using:

$Env:FIFTYONE_BRAIN_DEFAULT_SIMILARITY_BACKEND="lancedb"//powershellsetFIFTYONE_BRAIN_DEFAULT_SIMILARITY_BACKEND=lancedb//cmd
exportFIFTYONE_BRAIN_DEFAULT_SIMILARITY_BACKEND=lancedb

Note

This will only run during the terminal session. Once terminal is closed, environment variable is deleted.

Alternatively, you canpermanently configure FiftyOne to use the LanceDB backend creating abrain_config.json at~/.fiftyone/brain_config.json. The JSON file may contain any desired subset of config fields that you wish to customize.

{"default_similarity_backend":"lancedb"}
This will override the defaultbrain_config and will set it according to your customization. You can check the configuration by running the following code :

importfiftyone.brainasfob# Print your current brain configprint(fob.brain_config)

LanceDB config parameters

The LanceDB backend supports query parameters that can be used to customize your similarity queries. These parameters include:

NamePurposeDefault
table_nameThe name of the LanceDB table to use. If none is provided, a new table will be createdNone
metricThe embedding distance metric to use when creating a new table. The supported values are ("cosine", "euclidean")"cosine"
uriThe database URI to use. In this Database URI, tables will be created."/tmp/lancedb"

There are two ways to specify/customize the parameters:

  1. Usingbrain_config.json file

    {"similarity_backends":{"lancedb":{"table_name":"your-table","metric":"euclidean","uri":"/tmp/lancedb"}}}
  2. Directly passing tocompute_similarity() to configure a specific new index :

    lancedb_index=fob.compute_similarity(...backend="lancedb",brain_key="lancedb_index",table_name="your-table",metric="euclidean",uri="/tmp/lancedb",)

For a much more in depth walkthrough of the integration, visit the LanceDB x Voxel51docs page.


[8]ページ先頭

©2009-2025 Movatter.jp