Introduction to Cloud Bigtable

    1. Introduction

    In this codelab, you'll get introduced to usingCloud Bigtable with theJava HBase client.

    You'll learn how to

    • Avoid common mistakes with schema design
    • Import data in a sequence file
    • Query your data

    When you're done, you'll have several maps that show New York City bus data. For example, you'll create this heatmap of bus trips in Manhattan:

    7349d94f7d41f1d1.png

    How would you rate your experience with using Google Cloud Platform?

    NoviceIntermediateProficient

    How will you use this tutorial?

    Read it through onlyRead it and complete the exercises

    2. About the dataset

    You'll be looking at a dataset about New York City buses. There are more than 300 bus routes and 5,800 vehicles following those routes.Our dataset is a log that includes destination name, vehicle id, latitude, longitude, expected arrival time, and scheduled arrival time. The dataset is made up of snapshots taken around every 10 minutes for June 2017.

    3. Schema design

    To get the best performance from Cloud Bigtable, you have to be thoughtful when youdesign your schema. Data in Cloud Bigtable isautomatically sorted lexicographically, so if you design your schema well, querying for related data is very efficient. Cloud Bigtable allows for queries usingpoint lookups by row key orrow-range scans that return a contiguous set of rows. However, if your schema isn't well thought out, you might find yourself piecing together multiple row lookups, or worse, doing full table scans, which are extremely slow operations.

    It is common for tables to have billions of rows, so doing a full table scan could take several minutes and take away resources for other queries.

    Plan out the queries

    Our data has a variety of information, but for this codelab, you will use thelocation anddestination of the bus.

    With that information, you could perform these queries:

    • Get the location of a single bus over a given hour.
    • Get a day's worth of data for a bus line or specific bus.
    • Find all the buses in a rectangle on a map.
    • Get the current locations of all the buses (if you were ingesting this data in real time).

    This set of queries can't all be done together optimally. For example, if you are sorting by time, you can't do a scan based on a location without doing a full table scan. You need to prioritize based on the queries you most commonly run.

    For this codelab, you'll focus on optimizing and executing the following set of queries:

    • Get the locations of a specific vehicle over an hour.
    • Get the locations of an entire bus line over an hour.
    • Get the locations of all buses in Manhattan in an hour.
    • Get the most recent locations of all buses in Manhattan in an hour.
    • Get the locations of an entire bus line over the month.
    • Get the locations of an entire bus line with a certain destination over an hour.

    Design the row key

    For this codelab, you will be working with a static dataset, but you will design a schema for scalability. You'll design a schema that allows you to stream more bus data into the table and still perform well.

    Here is the proposed schema for the row key:

    [Bus company/Bus line/Timestamp rounded down to the hour/Vehicle ID]. Each row has an hour of data, and each cell holds multiple time-stamped versions of the data.

    For this codelab, you will use one column family to keep things simple. Here is an example view of what the data looks like.The data is sorted by row key.

    Row key

    cf:VehicleLocation.Latitude

    cf:VehicleLocation.Longitude

    ...

    MTA/M86-SBS/1496275200000/NYCT_5824

    40.781212@20:52:54.0040.776163@20:43:19.0040.778714@20:33:46.00

    -73.961942@20:52:54.00-73.946949@20:43:19.00-73.953731@20:33:46.00

    ...

    MTA/M86-SBS/1496275200000/NYCT_5840

    40.780664@20:13:51.0040.788416@20:03:40.00

    -73.958357@20:13:51.00 -73.976748@20:03:40.00

    ...

    MTA/M86-SBS/1496275200000/NYCT_5867

    40.780281@20:51:45.0040.779961@20:43:15.0040.788416@20:33:44.00

    -73.946890@20:51:45.00-73.959465@20:43:15.00-73.976748@20:33:44.00

    ...

    ...

    ...

    ...

    ...

    Common mistake: You might think that makingtime the first value in the row key would be ideal, because you probably care about more recent data, and would want to run queries mainly around certain times.Doing this causes hotspots in the data, however, so you compromise by puttingtime third. This makes some of your queries more difficult, but you need to do this in order to get the full performance Cloud Bigtable has to offer. Also, you probably don't need to get all buses for a certain time at once. Check out thistalk by Twitter for information about how they optimized their schema.

    4. Create instance, table, and family

    Next, you'll create a Cloud Bigtable table.

    First, create a new project. Use the built-in Cloud Shell, which you can open by clicking the "Activate Cloud Shell" button in the upper-righthand corner.

    a74d156ca7862b28.png

    Set the following environment variables to make copying and pasting the codelab commands easier:

    INSTANCE_ID="bus-instance"CLUSTER_ID="bus-cluster"TABLE_ID="bus-data"CLUSTER_NUM_NODES=3CLUSTER_ZONE="us-central1-c"

    Cloud Shell comes with the tools that you'll use in this codelab, thegcloud command-line tool, thecbt command-line interface, andMaven, already installed.

    Enable the Cloud Bigtable APIs by running this command.

    gcloud services enable bigtable.googleapis.com bigtableadmin.googleapis.com

    Create an instance by running the following command:

    gcloud bigtable instances create $INSTANCE_ID \    --cluster=$CLUSTER_ID \    --cluster-zone=$CLUSTER_ZONE \    --cluster-num-nodes=$CLUSTER_NUM_NODES \    --display-name=$INSTANCE_ID

    Make sure you delete the instance when you are done with the codelab to prevent recurring charges.

    After you create the instance, populate the cbt configuration file and then create a table and column family by running the following commands:

    echo project = $GOOGLE_CLOUD_PROJECT > ~/.cbtrcecho instance = $INSTANCE_ID >> ~/.cbtrccbt createtable $TABLE_IDcbt createfamily $TABLE_ID cf

    5. Import data

    Import a set of sequence files for this codelab fromgs://cloud-bigtable-public-datasets/bus-data with the following steps:

    Enable the Cloud Dataflow API by running this command.

    gcloud services enable dataflow.googleapis.com

    Run the following commands to import the table.

    NUM_WORKERS=$(expr 3 \* $CLUSTER_NUM_NODES)gcloud beta dataflow jobs run import-bus-data-$(date +%s) \--gcs-location gs://dataflow-templates/latest/GCS_SequenceFile_to_Cloud_Bigtable \--num-workers=$NUM_WORKERS --max-workers=$NUM_WORKERS \--parameters bigtableProject=$GOOGLE_CLOUD_PROJECT,bigtableInstanceId=$INSTANCE_ID,bigtableTableId=$TABLE_ID,sourcePattern=gs://cloud-bigtable-public-datasets/bus-data/*

    Monitor the import

    You can monitor the job in theCloud Dataflow UI. Also, you can view the load on your Cloud Bigtable instance with itsmonitoring UI. It should take 5 minutes for the entire import.

    6. Get the code

    git clone https://github.com/googlecodelabs/cbt-intro-java.gitcd cbt-intro-java

    Change to Java 11 by running the following commands:

    sudo update-java-alternatives -s java-1.11.0-openjdk-amd64 && export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/

    7. Perform a lookup

    The first query you'll perform is a simple row lookup. You'll get the data for a bus on the M86-SBS line on June 1, 2017 from 12:00 am to 1:00 am. A vehicle with idNYCT_5824 is on the bus line then.

    With that information, and knowing the schema design (Bus company/Bus line/Timestamp rounded down to the hour/Vehicle ID,) you can deduce that the row key is:

    MTA/M86-SBS/1496275200000/NYCT_5824

    BusQueries.java

    privatestaticfinalbyte[]COLUMN_FAMILY_NAME=Bytes.toBytes("cf");privatestaticfinalbyte[]LAT_COLUMN_NAME=Bytes.toBytes("VehicleLocation.Latitude");privatestaticfinalbyte[]LONG_COLUMN_NAME=Bytes.toBytes("VehicleLocation.Longitude");StringrowKey="MTA/M86-SBS/1496275200000/NYCT_5824";ResultgetResult=table.get(newGet(Bytes.toBytes(rowKey)).addColumn(COLUMN_FAMILY_NAME,LAT_COLUMN_NAME).addColumn(COLUMN_FAMILY_NAME,LONG_COLUMN_NAME));

    The result should contain the most recent location of the bus within that hour. But you want to see all the locations, so set the maximum number of versions on the get request.

    BusQueries.java

    ResultgetResult=table.get(newGet(Bytes.toBytes(rowKey)).setMaxVersions(Integer.MAX_VALUE).addColumn(COLUMN_FAMILY_NAME,LAT_COLUMN_NAME).addColumn(COLUMN_FAMILY_NAME,LONG_COLUMN_NAME));

    In the Cloud Shell, run the following command to get a list of latitudes and longitudes for that bus over the hour:

    mvn package exec:java -Dbigtable.projectID=$GOOGLE_CLOUD_PROJECT \-Dbigtable.instanceID=$INSTANCE_ID -Dbigtable.table=$TABLE_ID \-Dquery=lookupVehicleInGivenHour

    You can copy and paste the latitudes and longitudes intoMapMaker App to visualize the results. After a few layers, it will tell you to create a free account. You can create an account or just delete the existing layers you have. This codelab includes a visualization for each step, if you just want to follow along. Here is the result for this first query:

    f1a1fac6051c6210.png

    8. Perform a scan

    Now, let's view all the data for the bus line for that hour. The scan code looks pretty similar to the get code. You give the scanner a starting position and then indicate you only want rows for the M86-SBS bus line within the hour denoted by the timestamp 1496275200000.

    BusQueries.java

    Scanscan;scan=newScan();scan.setMaxVersions(Integer.MAX_VALUE).addColumn(COLUMN_FAMILY_NAME,LAT_COLUMN_NAME).addColumn(COLUMN_FAMILY_NAME,LONG_COLUMN_NAME).withStartRow(Bytes.toBytes("MTA/M86-SBS/1496275200000")).setRowPrefixFilter(Bytes.toBytes("MTA/M86-SBS/1496275200000"));ResultScannerscanner=table.getScanner(scan);

    Run the following command to get the results.

    mvn package exec:java -Dbigtable.projectID=$GOOGLE_CLOUD_PROJECT \-Dbigtable.instanceID=$INSTANCE_ID -Dbigtable.table=$TABLE_ID \-Dquery=scanBusLineInGivenHour

    c18a4ac6522d08a2.png

    The Map Maker app can display multiple lists at once, so you can see which of the buses are the vehicle from the first query you ran.

    234c1b51e3b201e.png

    An interesting modification to this query is to view the entire month of data for the M86-SBS bus line, and this is very easy to do. Remove the timestamp from the start row and prefix filter to get the result.

    BusQueries.java

    scan.withStartRow(Bytes.toBytes("MTA/M86-SBS/")).setRowPrefixFilter(Bytes.toBytes("MTA/M86-SBS/"));// Optionally, reduce the results to receive one version per column// since there are so many data points.scan.setMaxVersions(1);

    Run the following command to get the results. (There will be a long list of results.)

    mvn package exec:java -Dbigtable.projectID=$GOOGLE_CLOUD_PROJECT \-Dbigtable.instanceID=$INSTANCE_ID -Dbigtable.table=$TABLE_ID \-Dquery=scanEntireBusLine

    If you copy the results into MapMaker, you can view a heatmap of the bus route. The orange blobs indicate the stops, and the bright red blobs are the start and end of the route.

    346f52e61b3d8902.png

    9. Introduce filters

    Next, you will filter on buses heading east and buses heading west, and create a separate heatmap for each.

    Note: This filter will only check the latest version, so you will set the max versions to one to ensure the results match the filter. Read more aboutconfiguring your filter.

    BusQueries.java

    Scanscan;ResultScannerscanner;scan=newScan();SingleColumnValueFiltervalueFilter=newSingleColumnValueFilter(COLUMN_FAMILY_NAME,Bytes.toBytes("DestinationName"),CompareOp.EQUAL,Bytes.toBytes("Select Bus Service Yorkville East End AV"));scan.setMaxVersions(1).addColumn(COLUMN_FAMILY_NAME,LAT_COLUMN_NAME).addColumn(COLUMN_FAMILY_NAME,LONG_COLUMN_NAME);scan.withStartRow(Bytes.toBytes("MTA/M86-SBS/")).setRowPrefixFilter(Bytes.toBytes("MTA/M86-SBS/"));scan.setFilter(valueFilter);scanner=table.getScanner(scan);

    Run the following command to get the results for buses going east.

    mvn package exec:java -Dbigtable.projectID=$GOOGLE_CLOUD_PROJECT \-Dbigtable.instanceID=$INSTANCE_ID -Dbigtable.table=$TABLE_ID \-Dquery=filterBusesGoingEast

    To get the buses going west, change the string in the valueFilter:

    BusQueries.java

    SingleColumnValueFiltervalueFilter=newSingleColumnValueFilter(COLUMN_FAMILY_NAME,Bytes.toBytes("DestinationName"),CompareOp.EQUAL,Bytes.toBytes("Select Bus Service Westside West End AV"));

    Run the following command to get the results for buses going west.

    mvn package exec:java -Dbigtable.projectID=$GOOGLE_CLOUD_PROJECT \-Dbigtable.instanceID=$INSTANCE_ID -Dbigtable.table=$TABLE_ID \-Dquery=filterBusesGoingWest

    Buses heading east

    76f6f62096a6847a.png

    Buses heading west

    2b5771ee9046399f.png

    By comparing the two heatmaps, you can see the differences in the routes as well as notice differences in the pacing. One interpretation of the data is that on the route heading west, the buses are getting stopped more, especially when entering Central Park. And on the buses heading east, you don't really see many choke points.

    10. Perform a multi-range scan

    For the final query, you'll address the case when you care about many bus lines in an area:

    BusQueries.java

    privatestaticfinalString[]MANHATTAN_BUS_LINES={"M1","M2","M3",...Scanscan;ResultScannerscanner;List<RowRange>ranges=newArrayList<>();for(StringbusLine:MANHATTAN_BUS_LINES){ranges.add(newRowRange(Bytes.toBytes("MTA/"+busLine+"/1496275200000"),true,Bytes.toBytes("MTA/"+busLine+"/1496275200001"),false));}Filterfilter=newMultiRowRangeFilter(ranges);scan=newScan();scan.setFilter(filter);scan.setMaxVersions(Integer.MAX_VALUE).addColumn(COLUMN_FAMILY_NAME,LAT_COLUMN_NAME).addColumn(COLUMN_FAMILY_NAME,LONG_COLUMN_NAME);scan.withStartRow(Bytes.toBytes("MTA/M")).setRowPrefixFilter(Bytes.toBytes("MTA/M"));scanner=table.getScanner(scan);

    Run the following command to get the results.

    mvn package exec:java -Dbigtable.projectID=$GOOGLE_CLOUD_PROJECT \-Dbigtable.instanceID=$INSTANCE_ID -Dbigtable.table=$TABLE_ID \-Dquery=scanManhattanBusesInGivenHour

    7349d94f7d41f1d1.png

    11. Finish up

    Clean up to avoid charges

    To avoid incurring charges to your Google Cloud Platform account for the resources used in this codelab you should delete your instance.

    gcloud bigtable instances delete $INSTANCE_ID

    What we've covered

    • Schema design
    • Setting up an instance, table, and family
    • Importing sequence files with dataflow
    • Querying with a lookup, a scan, a scan with a filter, and a multi-range scan

    Next steps

    Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.