Deploy an ML vision analytics solution with Dataflow and Cloud Vision API

Last reviewed 2024-05-16 UTC

This deployment document describes how to deploy aDataflow pipeline to process image files at scale withCloud Vision API.This pipeline stores the results of the processed files in BigQuery.You can use the files for analytical purposes or to trainBigQuery ML models.

The Dataflow pipeline you create in this deployment can processmillions of images per day. The only limit is yourVision API quota.You can increase your Vision API quota based on your scalerequirements.

These instructions are intended for data engineers and data scientists. Thisdocument assumes you have basic knowledge of building Dataflowpipelines usingApache Beam's Java SDK,GoogleSQL for BigQuery, and basic shell scripting.It also assumes that you are familiar with Vision API.

Architecture

The following diagram illustrates the system flow for building an ML visionanalytics solution.

An architecture showing the flow of information for ingest and trigger, processing, and store and analyze processes.

In the preceding diagram, information flows through the architecture as follows:

A client uploads image files to a Cloud Storage bucket.
Cloud Storage sends a message about the data upload to Pub/Sub.
Pub/Sub notifies Dataflow about the upload.
The Dataflow pipeline sends the images to Vision API.
Vision API processes the images and then returns the annotations.
The pipeline sends the annotated files to BigQuery for you to analyze.

Objectives

Create an Apache Beam pipeline for image analysis of the images loadedin Cloud Storage.
UseDataflow Runner v2 to run the Apache Beam pipeline in a streaming mode to analyze the images as soon asthey're uploaded.
Use Vision API to analyze images for a set of feature types.
Analyze annotations with BigQuery.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use thepricing calculator.

New Google Cloud users might be eligible for afree trial.

When you finish building the example application, you can avoid continuedbilling by deleting the resources you created. For more information, seeClean up.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.

Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

In the Google Cloud console, activate Cloud Shell.
Activate Cloud Shell
At the bottom of the Google Cloud console, aCloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Clone the GitHub repository that contains the source code of theDataflow pipeline:

    git clone    https://github.com/GoogleCloudPlatform/dataflow-vision-analytics.git

Go to the root folder of the repository:
```
    cd dataflow-vision-analytics
```
Follow the instructions in theGetting started section of thedataflow-vision-analytics repository in GitHub to accomplish the following tasks:
- Enable several APIs.
- Create a Cloud Storage bucket.
- Create a Pub/Sub topic and subscription.
- Create a BigQuery dataset.
- Set up several environment variables for this deployment.

Running the Dataflow pipeline for all implemented Vision API features

The Dataflow pipeline requests and processes a specific set ofVision API features and attributes within the annotated files.

The parameters listed in the following table are specific to theDataflow pipeline in this deployment. For the complete list ofstandard Dataflow execution parameters, seeSet Dataflow pipeline options.

Parameter name	Description
`batchSize`	The number of images to include in a request toVision API. The default is 1. You can increase this valueto amaximum of 16.
`datasetName`	The name of the output BigQuery dataset.
`features`	A list of image-processingfeatures. The pipeline supportsthe label, landmark, logo, face, crop hint, and image propertiesfeatures.
`keyRange`	The parameter that defines the max number of parallel calls toVision API. The default is 1.
`labelAnnottationTable, landmarkAnnotationTable, logoAnnotationTable, faceAnnotationTable, imagePropertiesTable, cropHintAnnotationTable, errorLogTable`	String parameters with table names for various annotations. Thedefault values are provided for each table—for example,`label_annotation`.
`maxBatchCompletionDurationInSecs`	The length of time to wait before processing images when there is an incompletebatch of images. The default is 30 seconds.
`subscriberId`	The ID of the Pub/Sub subscription that receivesinput Cloud Storage notifications.
`visionApiProjectId`	The project ID to use for Vision API.

Note: The code samples in step 1 and step 5 don't specify a dedicated service account to run the deployment pipeline. The pipeline uses the default Compute Engine service account for the project that launches it.

In Cloud Shell, run the following command to process imagesfor all feature types supported by the Dataflow pipeline:

./gradlewrun--args=" \--jobName=test-vision-analytics \  --streaming \  --runner=DataflowRunner \  --enableStreamingEngine \  --diskSizeGb=30 \  --project=${PROJECT} \  --datasetName=${BIGQUERY_DATASET} \  --subscriberId=projects/${PROJECT}/subscriptions/${GCS_NOTIFICATION_SUBSCRIPTION} \  --visionApiProjectId=${PROJECT} \  --features=IMAGE_PROPERTIES,LABEL_DETECTION,LANDMARK_DETECTION,LOGO_DETECTION,CROP_HINTS,FACE_DETECTION"

The dedicated service account needs to have read access to the bucketcontaining the images. In other words, that account must have theroles/storage.objectViewer role granted on that bucket.

For more information about using a dedicated service account, see Dataflow security and permissions.

Open the displayed URL in a new browser tab, or go to theDataflow Jobs page and select thetest-vision-analytics pipeline.
After a few seconds, the graph for the Dataflow job appears:
The Dataflow pipeline is now running and waiting to receiveinput notifications from the Pub/Sub subscription.
Trigger Dataflow image processing by uploading the six samplefiles into the input bucket:
```
gcloudstoragecpdata-sample/*gs://${IMAGE_BUCKET}
```
In the Google Cloud console, locate the Custom Counters panel and useit to review the custom counters in Dataflow and to verifythat Dataflow has processed all six images. You can use thefilter functionality of the panel to navigate to the correct metrics. Toonly display the counters start with thenumberOf prefix, typenumberOfin the filter.

In Cloud Shell, validate that the tables were automaticallycreated:

bqquery--nouse_legacy_sql"SELECT table_name FROM${BIGQUERY_DATASET}.INFORMATION_SCHEMA.TABLES ORDER BY table_name"

The output is as follows:

+----------------------+|      table_name      |+----------------------+| crop_hint_annotation || face_annotation      || image_properties     || label_annotation     || landmark_annotation  || logo_annotation      |+----------------------+

View the schema for thelandmark_annotation table. TheLANDMARK_DETECTION feature capturesthe attributes returned from the API call.

bqshow--schema--format=prettyjson${BIGQUERY_DATASET}.landmark_annotation

The output is as follows:

[   {      "name":"gcs_uri",      "type":"STRING"   },   {      "name":"feature_type",      "type":"STRING"   },   {      "name":"transaction_timestamp",      "type":"STRING"   },   {      "name":"mid",      "type":"STRING"   },   {      "name":"description",      "type":"STRING"   },   {      "name":"score",      "type":"FLOAT"   },   {      "fields":[         {            "fields":[               {                  "name":"x",                  "type":"INTEGER"               },               {              "name":"y",              "type":"INTEGER"           }        ],        "mode":"REPEATED",        "name":"vertices",        "type":"RECORD"     }  ],  "name":"boundingPoly",  "type":"RECORD"},{  "fields":[     {        "fields":[           {              "name":"latitude",              "type":"FLOAT"           },           {              "name":"longitude",              "type":"FLOAT"           }        ],            "name":"latLon",            "type":"RECORD"          }        ],      "mode":"REPEATED",      "name":"locations",      "type":"RECORD"   }]

View the annotation data produced by the API by running the followingbq query commands to see all the landmarks found in these six imagesordered by the most likely score:

bqquery--nouse_legacy_sql"SELECT SPLIT(gcs_uri, '/')[OFFSET(3)] file_name, description, score, locations FROM${BIGQUERY_DATASET}.landmark_annotation ORDER BY score DESC"

The output is similar to the following:

+------------------+-------------------+------------+---------------------------------+|    file_name     |    description    |   score    |            locations            |+------------------+-------------------+------------+---------------------------------+| eiffel_tower.jpg | Eiffel Tower      |  0.7251996 | ["POINT(2.2944813 48.8583701)"] || eiffel_tower.jpg | Trocadéro Gardens | 0.69601923 | ["POINT(2.2892823 48.8615963)"] || eiffel_tower.jpg | Champ De Mars     |  0.6800974 | ["POINT(2.2986304 48.8556475)"] |+------------------+-------------------+------------+---------------------------------+

For detailed descriptions of all the columns that are specific to annotations, seeAnnotateImageResponse.

To stop the streaming pipeline, run the following command. The pipelinecontinues to run even though there are no more Pub/Subnotifications to process.
```
gclouddataflowjobscancel--region${REGION}$(gclouddataflowjobslist--region${REGION}--filter="NAME:test-vision-analytics AND STATE:Running"--format="get(JOB_ID)")
```
The following section contains more sample queries that analyze differentimage features of the images.

Analyzing a Flickr30K dataset

In this section, you detect labels and landmarks in the publicFlickr30k image dataset hosted on Kaggle.

In Cloud Shell, change the Dataflow pipelineparameters so that it's optimized for a large dataset. To allow higherthroughput, also increase thebatchSize andkeyRange values. Dataflow scales the number of workers asneeded:

./gradlewrun--args=" \  --runner=DataflowRunner \  --jobName=vision-analytics-flickr \  --streaming \  --enableStreamingEngine \  --diskSizeGb=30 \  --autoscalingAlgorithm=THROUGHPUT_BASED \  --maxNumWorkers=5 \  --project=${PROJECT} \  --region=${REGION} \  --subscriberId=projects/${PROJECT}/subscriptions/${GCS_NOTIFICATION_SUBSCRIPTION} \  --visionApiProjectId=${PROJECT} \  --features=LABEL_DETECTION,LANDMARK_DETECTION \  --datasetName=${BIGQUERY_DATASET} \  --batchSize=16 \  --keyRange=5"

Because the dataset is large, you can't use Cloud Shell toretrieve the images from Kaggle and send them to the Cloud Storagebucket. You must use a VM with a larger disk size to do that.

To retrieve Kaggle-based images and send them to theCloud Storage bucket, follow the instructions in theSimulate the images being uploaded to the storage bucket section in the GitHub repository.
To observe the progress of the copying process by looking at the custommetrics available in the Dataflow UI, navigate to theDataflow Jobs page and select thevision-analytics-flickr pipeline. The customercounters should change periodically until the Dataflowpipeline processes all the files.
The output is similar to the following screenshot of the Custom Counterspanel. One of the files in the datasetis of the wrong type, and therejectedFiles counter reflects that. Thesecounter values are approximate. You might see higher numbers. Also, thenumber of annotations will most likely change due to increased accuracy ofthe processing by Vision API.
To determine whether you're approaching or exceeding the availableresources, see the Vision API quota page.
In our example, the Dataflow pipeline used only roughly50% of its quota. Based on the percentage of the quota you use, you candecide to increase the parallelism of the pipeline by increasing the valueof thekeyRange parameter.

Shut down the pipeline:

gclouddataflowjobslist--region$REGION--filter="NAME:vision-analytics-flickr AND STATE:Running"--format="get(JOB_ID)"

Analyze annotations in BigQuery

In this deployment, you've processed more than 30,000 images for label andlandmark annotation. In this section, you gather statistics about those files.You can run these queries in the GoogleSQL forBigQuery workspace or you can use the bq command-line tool.

Be aware that the numbers that you see can vary from the sample query resultsin this deployment. Vision API constantly improves the accuracy of its analysis;it can produce richer results by analyzing the same image after you initiallytest the solution.

In the Google Cloud console, go to the BigQueryQuery editor page and run the following command to view the top 20labels in the dataset:

Go to Query editor

SELECTdescription,count(*)ascount\FROMvision_analytics.label_annotationGROUPBYdescriptionORDERBYcountDESCLIMIT20

The output is similar to the following:

+------------------+-------+|   description    | count |+------------------+-------+| Leisure          |  7663 || Plant            |  6858 || Event            |  6044 || Sky              |  6016 || Tree             |  5610 || Fun              |  5008 || Grass            |  4279 || Recreation       |  4176 || Shorts           |  3765 || Happy            |  3494 || Wheel            |  3372 || Tire             |  3371 || Water            |  3344 || Vehicle          |  3068 || People in nature |  2962 || Gesture          |  2909 || Sports equipment |  2861 || Building         |  2824 || T-shirt          |  2728 || Wood             |  2606 |+------------------+-------+

Determine which other labels are present on an image with a particularlabel, ranked by frequency:

DECLARElabelSTRINGDEFAULT'Plucked string instruments';WITHother_labelsAS(SELECTdescription,COUNT(*)countFROMvision_analytics.label_annotationWHEREgcs_uriIN(SELECTgcs_uriFROMvision_analytics.label_annotationWHEREdescription=label)ANDdescription!=labelGROUPBYdescription)SELECTdescription,count,RANK()OVER(ORDERBYcountDESC)rankFROMother_labelsORDERBYrankLIMIT20;

The output is as follows. For thePlucked string instruments label used inthe preceding command, you should see:

+------------------------------+-------+------+|         description          | count | rank |+------------------------------+-------+------+| String instrument            |   397 |    1 || Musical instrument           |   236 |    2 || Musician                     |   207 |    3 || Guitar                       |   168 |    4 || Guitar accessory             |   135 |    5 || String instrument accessory  |    99 |    6 || Music                        |    88 |    7 || Musical instrument accessory |    72 |    8 || Guitarist                    |    72 |    8 || Microphone                   |    52 |   10 || Folk instrument              |    44 |   11 || Violin family                |    28 |   12 || Hat                          |    23 |   13 || Entertainment                |    22 |   14 || Band plays                   |    21 |   15 || Jeans                        |    17 |   16 || Plant                        |    16 |   17 || Public address system        |    16 |   17 || Artist                       |    16 |   17 || Leisure                      |    14 |   20 |+------------------------------+-------+------+

View the top 10 detected landmarks:

SELECTdescription,COUNT(description)AScountFROMvision_analytics.landmark_annotationGROUPBYdescriptionORDERBYcountDESCLIMIT10

The output is as follows:

  +--------------------+-------+  |    description     | count |  +--------------------+-------+  | Times Square       |    55 |  | Rockefeller Center |    21 |  | St. Mark's Square  |    16 |  | Bryant Park        |    13 |  | Millennium Park    |    13 |  | Ponte Vecchio      |    13 |  | Tuileries Garden   |    13 |  | Central Park       |    12 |  | Starbucks          |    12 |  | National Mall      |    11 |  +--------------------+-------+

Determine the images that most likely contain waterfalls:

SELECTSPLIT(gcs_uri,'/')[OFFSET(3)]file_name,description,scoreFROMvision_analytics.landmark_annotationWHERELOWER(description)LIKE'%fall%'ORDERBYscoreDESCLIMIT10

The output is as follows:

+----------------+----------------------------+-----------+|   file_name    |        description         |   score    |+----------------+----------------------------+-----------+| 895502702.jpg  | Waterfall Carispaccha      |  0.6181358 || 3639105305.jpg | Sahalie Falls Viewpoint    | 0.44379658 || 3672309620.jpg | Gullfoss Falls             | 0.41680416 || 2452686995.jpg | Wahclella Falls            | 0.39005348 || 2452686995.jpg | Wahclella Falls            |  0.3792498 || 3484649669.jpg | Kodiveri Waterfalls        | 0.35024035 || 539801139.jpg  | Mallela Thirtham Waterfall | 0.29260656 || 3639105305.jpg | Sahalie Falls              |  0.2807213 || 3050114829.jpg | Kawasan Falls              | 0.27511594 || 4707103760.jpg | Niagara Falls              | 0.18691841 |+----------------+----------------------------+-----------+

Find images of landmarks within 3 kilometers of the Colosseum in Rome(theST_GEOPOINT function uses the Colosseum's longitude and latitude):
```
WITHlandmarksWithDistancesAS(SELECTgcs_uri,description,location,ST_DISTANCE(location,ST_GEOGPOINT(12.492231,41.890222))distance_in_meters,FROM`vision_analytics.landmark_annotation`landmarksCROSSJOINUNNEST(landmarks.locations)ASlocation)SELECTSPLIT(gcs_uri,"/")[OFFSET(3)]file,description,ROUND(distance_in_meters)distance_in_meters,location,CONCAT("https://storage.cloud.google.com/",SUBSTR(gcs_uri,6))ASimage_urlFROMlandmarksWithDistancesWHEREdistance_in_meters <3000ORDERBYdistance_in_metersLIMIT100
```
When you run the query you will see that there are multiple images of theColosseum, but also images of the Arch Of Constantine, the Palatine Hill,and a number of other frequently photographed places.
You can visualize the data inBigQuery Geo Viz by pasting in the previous query. Select a point on the map, to seeits details. TheImage_url attribute contains a link to the image file.

One note on query results. Location information is usually present forlandmarks. The same image can contain multiple locations of the same landmark.This functionality is described in theAnnotateImageResponse type.

Because one location can indicate the location of the scene in the image,multipleLocationInfo elements can be present. Another location can indicate where the image wastaken.

Clean up

To avoid incurring charges to your Google Cloud account for the resourcesused in this guide, either delete the project that contains the resources, orkeep the project and delete the individual resources.

Delete the Google Cloud project

The easiest way to eliminate billing is to delete the Google Cloud projectyou created for the tutorial.

Caution: Deleting a project has the following effects:

Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.

If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

In the Google Cloud console, go to theManage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then clickDelete.
In the dialog, type the project ID, and then clickShut down to delete the project.

If you decide to delete resources individually, follow the steps in theClean up section of the GitHub repository.

What's next

For more reference architectures, diagrams, and best practices, explore theCloud Architecture Center.

Contributors

Authors:

Masud Hasan | Site Reliability Engineering Manager
Sergei Lilichenko | Solutions Architect
Lakshmanan Sethu | Technical Account Manager

Other contributors:

Jiyeon Kang | Customer Engineer
Sunil Kumar Jang Bahadur | Customer Engineer

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-05-16 UTC.

Movatterモバイル変換

Deploy an ML vision analytics solution with Dataflow and Cloud Vision API Stay organized with collections Save and categorize content based on your preferences.

Architecture

Objectives

Costs

Before you begin

Running the Dataflow pipeline for all implemented Vision API features

Analyzing a Flickr30K dataset

Analyze annotations in BigQuery

Clean up

Delete the Google Cloud project

What's next

Contributors

Deploy an ML vision analytics solution with Dataflow and Cloud Vision API