Google Cloud BigQuery
Google BigQuery enables super-fast, SQL-like queries against massive datasets,using the processing power of Google's infrastructure. To learn more, readWhatis BigQuery?.
The goal of google-cloud is to provide an API that is comfortable to Rubyists.Your authentication credentials are detected automatically in Google CloudPlatform (GCP), including Google Compute Engine (GCE), Google Kubernetes Engine(GKE), Google App Engine (GAE), Google Cloud Functions (GCF) and Cloud Run. Inother environments you can configure authentication easily, either directly inyour code or via environment variables. Read more about the options forconnecting in theAuthentication Guide.
To help you get started quickly, the first few examples below use a publicdataset provided by Google. As soon as you havesignedup to use BigQuery, and providedthat you stay in the free tier for queries, you should be able to run thesefirst examples without the need to set up billing or to load data (althoughwe'll show you how to do that too.)
Listing Datasets and Tables
A BigQuery project contains datasets, which in turn contain tables. Assumingthat you have not yet created datasets or tables in your own project, let'sconnect to Google'sbigquery-public-data project, and see what we find.
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newproject:"bigquery-public-data"bigquery.datasets.count#=> 1bigquery.datasets.first.dataset_id#=> "samples"dataset=bigquery.datasets.firsttables=dataset.tablestables.count#=> 7tables.map&:table_id#=> [..., "shakespeare", "trigrams", "wikipedia"]
In addition to listing all datasets and tables in the project, you can alsoretrieve individual datasets and tables by ID. Let's look at the structure oftheshakespeare table, which contains an entry for every word in every playwritten by Shakespeare.
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newproject:"bigquery-public-data"dataset=bigquery.dataset"samples"table=dataset.table"shakespeare"table.headers#=> [:word, :word_count, :corpus, :corpus_date]table.rows_count#=> 164656
Now that you know the column names for the Shakespeare table, let's write andrun a few queries against it.
Running queries
BigQuery supports two SQL dialects:standardSQL and theolderlegacy SQl (BigQuerySQL), as discussedin the guideMigrating from legacySQL.
Standard SQL
Standard SQL is the preferred SQL dialect for querying data stored in BigQuery.It is compliant with the SQL 2011 standard, and has extensions that supportquerying nested and repeated data. This is the default syntax. It has severaladvantages over legacy SQL, including:
- Composability using
WITHclauses and SQL functions - Subqueries in the
SELECTlist andWHEREclause - Correlated subqueries
ARRAYandSTRUCTdata types- Inserts, updates, and deletes
COUNT(DISTINCT <expr>)is exact and scalable, providing the accuracy ofEXACT_COUNT_DISTINCTwithout its limitations- Automatic predicate push-down through
JOINs - Complex
JOINpredicates, including arbitrary expressions
For examples that demonstrate some of these features, seeStandard SQLhighlights.
As shown in this example, standard SQL is the library default:
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newsql="SELECT word, SUM(word_count) AS word_count "\"FROM `bigquery-public-data.samples.shakespeare`"\"WHERE word IN ('me', 'I', 'you') GROUP BY word"data=bigquery.querysql
Notice that in standard SQL, a fully-qualified table name uses the followingformat:.my-dashed-project.dataset1.tableName
Legacy SQL (formerly BigQuery SQL)
Before version 2.0, BigQuery executed queries using a non-standard SQL dialectknown as BigQuery SQL. This variant is optional, and can be enabled by passingthe flaglegacy_sql: true with your query. (If you get an SQL syntax errorwith a query that may be written in legacy SQL, be sure that you are passingthis option.)
To use legacy SQL, pass the optionlegacy_sql: true with your query:
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newsql="SELECT TOP(word, 50) as word, COUNT(*) as count "\"FROM [bigquery-public-data:samples.shakespeare]"data=bigquery.querysql,legacy_sql:true
Notice that in legacy SQL, a fully-qualified table name uses brackets instead ofback-ticks, and a colon instead of a dot to separate the project and thedataset:[my-dashed-project:dataset1.tableName].
Query parameters
With standard SQL, you can use positional or named query parameters. Thisexample shows the use of named parameters:
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newsql="SELECT word, SUM(word_count) AS word_count "\"FROM `bigquery-public-data.samples.shakespeare`"\"WHERE word IN UNNEST(@words) GROUP BY word"data=bigquery.querysql,params:{words:['me','I','you']}
As demonstrated above, passing theparams option will automatically setstandard_sql totrue.
Data types
BigQuery standard SQL supports simple data types such as integers, as well asmore complex types such asARRAY andSTRUCT.
The BigQuery data types are converted to and from Ruby types as follows:
| BigQuery | Ruby | Notes |
|---|---|---|
BOOL | true/false | |
INT64 | Integer | |
FLOAT64 | Float | |
NUMERIC | BigDecimal | BigDecimal values will be rounded to scale 9. |
BIGNUMERIC | converted toBigDecimal | Pass data asString and map query param values intypes. |
STRING | String | |
DATETIME | DateTime | DATETIME does not support time zone. |
DATE | Date | |
TIMESTAMP | Time | |
TIME | Google::Cloud::BigQuery::Time | |
BYTES | File,IO,StringIO, or similar | |
ARRAY | Array | Nested arrays,nil values are not supported. |
STRUCT | Hash | Hash keys may be strings or symbols. |
SeeDataTypesfor an overview of each BigQuery data type, including allowed values.
Running Queries
Let's start with the simplest way to run a query. Notice that this time you areconnecting using your own default project. It is necessary to have write accessto the project for running a query, since queries need to create tables to holdresults.
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newsql="SELECT APPROX_TOP_COUNT(corpus, 10) as title, "\"COUNT(*) as unique_words "\"FROM `bigquery-public-data.samples.shakespeare`"data=bigquery.querysqldata.next?#=> falsedata.first#=> {:title=>[{:value=>"hamlet", :count=>5318}, ...}
TheAPPROX_TOP_COUNT function shown above is just one of a variety offunctions offered by BigQuery. See theQuery Reference (standardSQL)for a full listing.
Query Jobs
It is usually best not to block for most BigQuery operations, including queryingas well as importing, exporting, and copying data. Therefore, the BigQuery APIprovides facilities for managing longer-running jobs. With this approach, aninstance ofGoogle::Cloud::Bigquery::QueryJob is returned, rather than aninstance ofGoogle::Cloud::Bigquery::Data.
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newsql="SELECT APPROX_TOP_COUNT(corpus, 10) as title, "\"COUNT(*) as unique_words "\"FROM `bigquery-public-data.samples.shakespeare`"job=bigquery.query_jobsqljob.wait_until_done!if!job.failed?job.data.first#=> {:title=>[{:value=>"hamlet", :count=>5318}, ...}end
Once you have determined that the job is done and has not failed, you can obtainan instance ofGoogle::Cloud::Bigquery::Data by callingdata on the jobinstance. The query results for both of the above examples are stored intemporary tables with a lifetime of about 24 hours. See the final example belowfor a demonstration of how to store query results in a permanent table.
Creating Datasets and Tables
The first thing you need to do in a new BigQuery project is to create aGoogle::Cloud::Bigquery::Dataset. Datasets hold tables and control access tothem.
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newdataset=bigquery.create_dataset"my_dataset"
Now that you have a dataset, you can use it to create a table. Every table isdefined by a schema that may contain nested and repeated fields. The examplebelow shows a schema with a repeated record field namedcities_lived. (Formore information about nested and repeated fields, seeUsing Nested and Repeated Fields.)
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newdataset=bigquery.dataset"my_dataset"table=dataset.create_table"people"do|schema|schema.string"first_name",mode::requiredschema.record"cities_lived",mode::repeateddo|nested_schema|nested_schema.string"place",mode::requirednested_schema.integer"number_of_years",mode::requiredendend
Because of the repeated field in this schema, we cannot use the CSV format toload data into the table.
Loading records
To follow along with these examples, you will need to set up billing on theGoogle Developers Console.
In addition to CSV, data can be imported from files that are formatted asNewline-delimited JSON,Avro,ORC,Parquet or from a Google Cloud Datastore backup.It can also be "streamed" into BigQuery.
Streaming records
For situations in which you want new data to be available for querying as soonas possible, inserting individual records directly from your Ruby application isa great approach.
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newdataset=bigquery.dataset"my_dataset"table=dataset.table"people"rows=[{"first_name"=>"Anna","cities_lived"=>[{"place"=>"Stockholm","number_of_years"=>2}]},{"first_name"=>"Bob","cities_lived"=>[{"place"=>"Seattle","number_of_years"=>5},{"place"=>"Austin","number_of_years"=>6}]}]table.insertrows
To avoid making RPCs (network requests) to retrieve the dataset and tableresources when streaming records, pass theskip_lookup option. This createslocal objects without verifying that the resources exist on the BigQueryservice.
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newdataset=bigquery.dataset"my_dataset",skip_lookup:truetable=dataset.table"people",skip_lookup:truerows=[{"first_name"=>"Anna","cities_lived"=>[{"place"=>"Stockholm","number_of_years"=>2}]},{"first_name"=>"Bob","cities_lived"=>[{"place"=>"Seattle","number_of_years"=>5},{"place"=>"Austin","number_of_years"=>6}]}]table.insertrows
There are some trade-offs involved with streaming, so be sure to read thediscussion of data consistency inStreaming Data IntoBigQuery.
Uploading a file
To follow along with this example, please download thenames.zip archive inhttps://www.ssa.gov/OACT/babynames/names.zip, sourcedfrom the U.S. Social Security Administration. Inside the archive you will findover 100 files containing baby name records since the year 1880.
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newdataset=bigquery.dataset"my_dataset"table=dataset.create_table"baby_names"do|schema|schema.string"name",mode::requiredschema.string"gender",mode::requiredschema.integer"count",mode::requiredendfile=File.open"names/yob2014.txt"table.loadfile,format:"csv"
Because the names data, although formatted as CSV, is distributed in files witha.txt extension, this example explicitly passes theformat option in orderto demonstrate how to handle such situations. Because CSV is the default formatfor load operations, the option is not actually necessary. For JSON saved with a.txt extension, however, it would be.
Exporting query results to Google Cloud Storage
The example below shows how to pass thetable option with a query in order tostore results in a permanent table. It also shows how to export the result datato a Google Cloud Storage file. In order to follow along, you will need toenable the Google Cloud Storage API in addition to setting up billing.
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newdataset=bigquery.dataset"my_dataset"source_table=dataset.table"baby_names"result_table=dataset.create_table"baby_names_results"sql="SELECT name, count "\"FROM baby_names "\"WHERE gender = 'M' "\"ORDER BY count ASC LIMIT 5"query_job=dataset.query_jobsql,table:result_tablequery_job.wait_until_done!if!query_job.failed?require"google/cloud/storage"storage=Google::Cloud::Storage.newbucket_id="bigquery-exports-#{SecureRandom.uuid}"bucket=storage.create_bucketbucket_idextract_url="gs://#{bucket.id}/baby-names.csv"result_table.extractextract_url# Download to local filesystembucket.files.first.download"baby-names.csv"end
If a table you wish to export contains a large amount of data, you can pass awildcard URI to export to multiple files (for sharding), or an array of URIs(for partitioning), or both. SeeExportingData for details.
Configuring retries and timeout
You can configure how many times API requests may be automatically retried. Whenan API request fails, the response will be inspected to see if the request meetscriteria indicating that it may succeed on retry, such as500 and503 statuscodes or a specific internal error code such asrateLimitExceeded. If it meetsthe criteria, the request will be retried after a delay. If another erroroccurs, the delay will be increased before a subsequent attempt, until theretries limit is reached.
You can also set the requesttimeout value in seconds.
require"google/cloud/bigquery"bigquery=Google::Cloud::Bigquery.newretries:10,timeout:120
See theBigQuery errortable fora list of error conditions.
Additional information
Google BigQuery can be configured to use logging. To learn more, see theLogging guide.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-01-11 UTC.