Deploy an ML vision analytics solution with Dataflow and Cloud Vision API

Last reviewed 2024-05-16 UTC

This deployment document describes how to deploy aDataflow pipeline to process image files at scale withCloud Vision API.This pipeline stores the results of the processed files in BigQuery.You can use the files for analytical purposes or to trainBigQuery ML models.

The Dataflow pipeline you create in this deployment can processmillions of images per day. The only limit is yourVision API quota.You can increase your Vision API quota based on your scalerequirements.

These instructions are intended for data engineers and data scientists. Thisdocument assumes you have basic knowledge of building Dataflowpipelines usingApache Beam's Java SDK,GoogleSQL for BigQuery, and basic shell scripting.It also assumes that you are familiar with Vision API.

Architecture

The following diagram illustrates the system flow for building an ML visionanalytics solution.

An architecture showing the flow of information for ingest and trigger, processing, and store and analyze processes.

In the preceding diagram, information flows through the architecture as follows:

  1. A client uploads image files to a Cloud Storage bucket.
  2. Cloud Storage sends a message about the data upload to Pub/Sub.
  3. Pub/Sub notifies Dataflow about the upload.
  4. The Dataflow pipeline sends the images to Vision API.
  5. Vision API processes the images and then returns the annotations.
  6. The pipeline sends the annotated files to BigQuery for you to analyze.

Objectives

  • Create an Apache Beam pipeline for image analysis of the images loadedin Cloud Storage.
  • UseDataflow Runner v2 to run the Apache Beam pipeline in a streaming mode to analyze the images as soon asthey're uploaded.
  • Use Vision API to analyze images for a set of feature types.
  • Analyze annotations with BigQuery.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use thepricing calculator.

New Google Cloud users might be eligible for afree trial.

When you finish building the example application, you can avoid continuedbilling by deleting the resources you created. For more information, seeClean up.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, aCloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  7. Clone the GitHub repository that contains the source code of theDataflow pipeline:
        git clone    https://github.com/GoogleCloudPlatform/dataflow-vision-analytics.git
  8. Go to the root folder of the repository:
        cd dataflow-vision-analytics
  9. Follow the instructions in theGetting started section of thedataflow-vision-analytics repository in GitHub to accomplish the following tasks:
    • Enable several APIs.
    • Create a Cloud Storage bucket.
    • Create a Pub/Sub topic and subscription.
    • Create a BigQuery dataset.
    • Set up several environment variables for this deployment.

Running the Dataflow pipeline for all implemented Vision API features

The Dataflow pipeline requests and processes a specific set ofVision API features and attributes within the annotated files.

The parameters listed in the following table are specific to theDataflow pipeline in this deployment. For the complete list ofstandard Dataflow execution parameters, seeSet Dataflow pipeline options.

Parameter nameDescription

batchSize

The number of images to include in a request toVision API. The default is 1. You can increase this valueto amaximum of 16.

datasetName

The name of the output BigQuery dataset.

features

A list of image-processingfeatures. The pipeline supportsthe label, landmark, logo, face, crop hint, and image propertiesfeatures.

keyRange

The parameter that defines the max number of parallel calls toVision API. The default is 1.

labelAnnottationTable,
landmarkAnnotationTable,
logoAnnotationTable,
faceAnnotationTable,
imagePropertiesTable,
cropHintAnnotationTable,
errorLogTable

String parameters with table names for various annotations. Thedefault values are provided for each table—for example,label_annotation.

maxBatchCompletionDurationInSecs

The length of time to wait before processing images when there is an incompletebatch of images. The default is 30 seconds.

subscriberId

The ID of the Pub/Sub subscription that receivesinput Cloud Storage notifications.

visionApiProjectId

The project ID to use for Vision API.
Note: The code samples in step 1 and step 5 don't specify a dedicated service account to run the deployment pipeline. The pipeline uses the default Compute Engine service account for the project that launches it.
  1. In Cloud Shell, run the following command to process imagesfor all feature types supported by the Dataflow pipeline:

    ./gradlewrun--args=" \--jobName=test-vision-analytics \  --streaming \  --runner=DataflowRunner \  --enableStreamingEngine \  --diskSizeGb=30 \  --project=${PROJECT} \  --datasetName=${BIGQUERY_DATASET} \  --subscriberId=projects/${PROJECT}/subscriptions/${GCS_NOTIFICATION_SUBSCRIPTION} \  --visionApiProjectId=${PROJECT} \  --features=IMAGE_PROPERTIES,LABEL_DETECTION,LANDMARK_DETECTION,LOGO_DETECTION,CROP_HINTS,FACE_DETECTION"

    The dedicated service account needs to have read access to the bucketcontaining the images. In other words, that account must have theroles/storage.objectViewer role granted on that bucket.

    For more information about using a dedicated service account, seeDataflow security and permissions.

  2. Open the displayed URL in a new browser tab, or go to theDataflow Jobs page and select thetest-vision-analytics pipeline.

    After a few seconds, the graph for the Dataflow job appears:

    Workflow diagram for the Dataflow job.

    The Dataflow pipeline is now running and waiting to receiveinput notifications from the Pub/Sub subscription.

  3. Trigger Dataflow image processing by uploading the six samplefiles into the input bucket:

    gcloudstoragecpdata-sample/*gs://${IMAGE_BUCKET}
  4. In the Google Cloud console, locate the Custom Counters panel and useit to review the custom counters in Dataflow and to verifythat Dataflow has processed all six images. You can use thefilter functionality of the panel to navigate to the correct metrics. Toonly display the counters start with thenumberOf prefix, typenumberOfin the filter.

    List of counters filtered to show only those counters that start with `numberof`.

  5. In Cloud Shell, validate that the tables were automaticallycreated:

    bqquery--nouse_legacy_sql"SELECT table_name FROM${BIGQUERY_DATASET}.INFORMATION_SCHEMA.TABLES ORDER BY table_name"

    The output is as follows:

    +----------------------+|      table_name      |+----------------------+| crop_hint_annotation || face_annotation      || image_properties     || label_annotation     || landmark_annotation  || logo_annotation      |+----------------------+
  6. View the schema for thelandmark_annotation table. TheLANDMARK_DETECTION feature capturesthe attributes returned from the API call.

    bqshow--schema--format=prettyjson${BIGQUERY_DATASET}.landmark_annotation

    The output is as follows:

    [   {      "name":"gcs_uri",      "type":"STRING"   },   {      "name":"feature_type",      "type":"STRING"   },   {      "name":"transaction_timestamp",      "type":"STRING"   },   {      "name":"mid",      "type":"STRING"   },   {      "name":"description",      "type":"STRING"   },   {      "name":"score",      "type":"FLOAT"   },   {      "fields":[         {            "fields":[               {                  "name":"x",                  "type":"INTEGER"               },               {              "name":"y",              "type":"INTEGER"           }        ],        "mode":"REPEATED",        "name":"vertices",        "type":"RECORD"     }  ],  "name":"boundingPoly",  "type":"RECORD"},{  "fields":[     {        "fields":[           {              "name":"latitude",              "type":"FLOAT"           },           {              "name":"longitude",              "type":"FLOAT"           }        ],            "name":"latLon",            "type":"RECORD"          }        ],      "mode":"REPEATED",      "name":"locations",      "type":"RECORD"   }]
  7. View the annotation data produced by the API by running the followingbq query commands to see all the landmarks found in these six imagesordered by the most likely score:

    bqquery--nouse_legacy_sql"SELECT SPLIT(gcs_uri, '/')[OFFSET(3)] file_name, description, score, locations FROM${BIGQUERY_DATASET}.landmark_annotation ORDER BY score DESC"

    The output is similar to the following:

    +------------------+-------------------+------------+---------------------------------+|    file_name     |    description    |   score    |            locations            |+------------------+-------------------+------------+---------------------------------+| eiffel_tower.jpg | Eiffel Tower      |  0.7251996 | ["POINT(2.2944813 48.8583701)"] || eiffel_tower.jpg | Trocadéro Gardens | 0.69601923 | ["POINT(2.2892823 48.8615963)"] || eiffel_tower.jpg | Champ De Mars     |  0.6800974 | ["POINT(2.2986304 48.8556475)"] |+------------------+-------------------+------------+---------------------------------+

    For detailed descriptions of all the columns that are specific to annotations, seeAnnotateImageResponse.

  8. To stop the streaming pipeline, run the following command. The pipelinecontinues to run even though there are no more Pub/Subnotifications to process.

    gclouddataflowjobscancel--region${REGION}$(gclouddataflowjobslist--region${REGION}--filter="NAME:test-vision-analytics AND STATE:Running"--format="get(JOB_ID)")

    The following section contains more sample queries that analyze differentimage features of the images.

Analyzing a Flickr30K dataset

In this section, you detect labels and landmarks in the publicFlickr30k image dataset hosted on Kaggle.

  1. In Cloud Shell, change the Dataflow pipelineparameters so that it's optimized for a large dataset. To allow higherthroughput, also increase thebatchSize andkeyRange values. Dataflow scales the number of workers asneeded:

    ./gradlewrun--args=" \  --runner=DataflowRunner \  --jobName=vision-analytics-flickr \  --streaming \  --enableStreamingEngine \  --diskSizeGb=30 \  --autoscalingAlgorithm=THROUGHPUT_BASED \  --maxNumWorkers=5 \  --project=${PROJECT} \  --region=${REGION} \  --subscriberId=projects/${PROJECT}/subscriptions/${GCS_NOTIFICATION_SUBSCRIPTION} \  --visionApiProjectId=${PROJECT} \  --features=LABEL_DETECTION,LANDMARK_DETECTION \  --datasetName=${BIGQUERY_DATASET} \  --batchSize=16 \  --keyRange=5"

    Because the dataset is large, you can't use Cloud Shell toretrieve the images from Kaggle and send them to the Cloud Storagebucket. You must use a VM with a larger disk size to do that.

  2. To retrieve Kaggle-based images and send them to theCloud Storage bucket, follow the instructions in theSimulate the images being uploaded to the storage bucket section in the GitHub repository.

  3. To observe the progress of the copying process by looking at the custommetrics available in the Dataflow UI, navigate to theDataflow Jobs page and select thevision-analytics-flickr pipeline. The customercounters should change periodically until the Dataflowpipeline processes all the files.

    The output is similar to the following screenshot of the Custom Counterspanel. One of the files in the datasetis of the wrong type, and therejectedFiles counter reflects that. Thesecounter values are approximate. You might see higher numbers. Also, thenumber of annotations will most likely change due to increased accuracy ofthe processing by Vision API.

    List of counters associated with processing the Kaggle-based images.

    To determine whether you're approaching or exceeding the availableresources, see the Vision API quota page.

    In our example, the Dataflow pipeline used only roughly50% of its quota. Based on the percentage of the quota you use, you candecide to increase the parallelism of the pipeline by increasing the valueof thekeyRange parameter.

  4. Shut down the pipeline:

    gclouddataflowjobslist--region$REGION--filter="NAME:vision-analytics-flickr AND STATE:Running"--format="get(JOB_ID)"

Analyze annotations in BigQuery

In this deployment, you've processed more than 30,000 images for label andlandmark annotation. In this section, you gather statistics about those files.You can run these queries in the GoogleSQL forBigQuery workspace or you can use the bq command-line tool.

Be aware that the numbers that you see can vary from the sample query resultsin this deployment. Vision API constantly improves the accuracy of its analysis;it can produce richer results by analyzing the same image after you initiallytest the solution.

  1. In the Google Cloud console, go to the BigQueryQuery editor page and run the following command to view the top 20labels in the dataset:

    Go to Query editor

    SELECTdescription,count(*)ascount\FROMvision_analytics.label_annotationGROUPBYdescriptionORDERBYcountDESCLIMIT20

    The output is similar to the following:

    +------------------+-------+|   description    | count |+------------------+-------+| Leisure          |  7663 || Plant            |  6858 || Event            |  6044 || Sky              |  6016 || Tree             |  5610 || Fun              |  5008 || Grass            |  4279 || Recreation       |  4176 || Shorts           |  3765 || Happy            |  3494 || Wheel            |  3372 || Tire             |  3371 || Water            |  3344 || Vehicle          |  3068 || People in nature |  2962 || Gesture          |  2909 || Sports equipment |  2861 || Building         |  2824 || T-shirt          |  2728 || Wood             |  2606 |+------------------+-------+
  2. Determine which other labels are present on an image with a particularlabel, ranked by frequency:

    DECLARElabelSTRINGDEFAULT'Plucked string instruments';WITHother_labelsAS(SELECTdescription,COUNT(*)countFROMvision_analytics.label_annotationWHEREgcs_uriIN(SELECTgcs_uriFROMvision_analytics.label_annotationWHEREdescription=label)ANDdescription!=labelGROUPBYdescription)SELECTdescription,count,RANK()OVER(ORDERBYcountDESC)rankFROMother_labelsORDERBYrankLIMIT20;

    The output is as follows. For thePlucked string instruments label used inthe preceding command, you should see:

    +------------------------------+-------+------+|         description          | count | rank |+------------------------------+-------+------+| String instrument            |   397 |    1 || Musical instrument           |   236 |    2 || Musician                     |   207 |    3 || Guitar                       |   168 |    4 || Guitar accessory             |   135 |    5 || String instrument accessory  |    99 |    6 || Music                        |    88 |    7 || Musical instrument accessory |    72 |    8 || Guitarist                    |    72 |    8 || Microphone                   |    52 |   10 || Folk instrument              |    44 |   11 || Violin family                |    28 |   12 || Hat                          |    23 |   13 || Entertainment                |    22 |   14 || Band plays                   |    21 |   15 || Jeans                        |    17 |   16 || Plant                        |    16 |   17 || Public address system        |    16 |   17 || Artist                       |    16 |   17 || Leisure                      |    14 |   20 |+------------------------------+-------+------+
  3. View the top 10 detected landmarks:

    SELECTdescription,COUNT(description)AScountFROMvision_analytics.landmark_annotationGROUPBYdescriptionORDERBYcountDESCLIMIT10

    The output is as follows:

      +--------------------+-------+  |    description     | count |  +--------------------+-------+  | Times Square       |    55 |  | Rockefeller Center |    21 |  | St. Mark's Square  |    16 |  | Bryant Park        |    13 |  | Millennium Park    |    13 |  | Ponte Vecchio      |    13 |  | Tuileries Garden   |    13 |  | Central Park       |    12 |  | Starbucks          |    12 |  | National Mall      |    11 |  +--------------------+-------+

  4. Determine the images that most likely contain waterfalls:

    SELECTSPLIT(gcs_uri,'/')[OFFSET(3)]file_name,description,scoreFROMvision_analytics.landmark_annotationWHERELOWER(description)LIKE'%fall%'ORDERBYscoreDESCLIMIT10

    The output is as follows:

    +----------------+----------------------------+-----------+|   file_name    |        description         |   score    |+----------------+----------------------------+-----------+| 895502702.jpg  | Waterfall Carispaccha      |  0.6181358 || 3639105305.jpg | Sahalie Falls Viewpoint    | 0.44379658 || 3672309620.jpg | Gullfoss Falls             | 0.41680416 || 2452686995.jpg | Wahclella Falls            | 0.39005348 || 2452686995.jpg | Wahclella Falls            |  0.3792498 || 3484649669.jpg | Kodiveri Waterfalls        | 0.35024035 || 539801139.jpg  | Mallela Thirtham Waterfall | 0.29260656 || 3639105305.jpg | Sahalie Falls              |  0.2807213 || 3050114829.jpg | Kawasan Falls              | 0.27511594 || 4707103760.jpg | Niagara Falls              | 0.18691841 |+----------------+----------------------------+-----------+
  5. Find images of landmarks within 3 kilometers of the Colosseum in Rome(theST_GEOPOINT function uses the Colosseum's longitude and latitude):

    WITHlandmarksWithDistancesAS(SELECTgcs_uri,description,location,ST_DISTANCE(location,ST_GEOGPOINT(12.492231,41.890222))distance_in_meters,FROM`vision_analytics.landmark_annotation`landmarksCROSSJOINUNNEST(landmarks.locations)ASlocation)SELECTSPLIT(gcs_uri,"/")[OFFSET(3)]file,description,ROUND(distance_in_meters)distance_in_meters,location,CONCAT("https://storage.cloud.google.com/",SUBSTR(gcs_uri,6))ASimage_urlFROMlandmarksWithDistancesWHEREdistance_in_meters <3000ORDERBYdistance_in_metersLIMIT100

    When you run the query you will see that there are multiple images of theColosseum, but also images of the Arch Of Constantine, the Palatine Hill,and a number of other frequently photographed places.

    You can visualize the data inBigQuery Geo Viz by pasting in the previous query. Select a point on the map, to seeits details. TheImage_url attribute contains a link to the image file.

    Map of locations and their distance from the Colosseum.

One note on query results. Location information is usually present forlandmarks. The same image can contain multiple locations of the same landmark.This functionality is described in theAnnotateImageResponse type.

Because one location can indicate the location of the scene in the image,multipleLocationInfo elements can be present. Another location can indicate where the image wastaken.

Clean up

To avoid incurring charges to your Google Cloud account for the resourcesused in this guide, either delete the project that contains the resources, orkeep the project and delete the individual resources.

Delete the Google Cloud project

The easiest way to eliminate billing is to delete the Google Cloud projectyou created for the tutorial.

    Caution: Deleting a project has the following effects:
    • Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
    • Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.

    If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

  1. In the Google Cloud console, go to theManage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then clickDelete.
  3. In the dialog, type the project ID, and then clickShut down to delete the project.

If you decide to delete resources individually, follow the steps in theClean up section of the GitHub repository.

What's next

Contributors

Authors:

Other contributors:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-05-16 UTC.