- Notifications
You must be signed in to change notification settings - Fork3
Sparklyr extension package to connect to Google BigQuery
License
miraisolutions/sparkbq
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
sparkbq is asparklyrextension package providing an integration withGoogle BigQuery. It builds on top ofspark-bigquery, which provides a Google BigQuery data source toApache Spark.
You can install the released version ofsparkbq from CRAN via
install.packages("sparkbq")or the latest development version through
devtools::install_github("miraisolutions/sparkbq",ref="develop")
The following table provides an overview over supported versions of Apache Spark, Scala, andGoogle Dataproc:
| sparkbq | spark-bigquery | Apache Spark | Scala | Google Dataproc |
|---|---|---|---|---|
| 0.1.x | 0.1.0 | 2.2.x and 2.3.x | 2.11 | 1.2.x and 1.3.x |
sparkbq is based on the Spark packagespark-bigquery which is available in a separateGitHub repository.
library(sparklyr)library(sparkbq)library(dplyr)config<- spark_config()sc<- spark_connect(master="local[*]",config=config)# Set Google BigQuery default settingsbigquery_defaults(billingProjectId="<your_billing_project_id>",gcsBucket="<your_gcs_bucket>",datasetLocation="US",serviceAccountKeyFile="<your_service_account_key_file>",type="direct")# Reading the public shakespeare data table# https://cloud.google.com/bigquery/public-data/# https://cloud.google.com/bigquery/sample-tableshamlet<- spark_read_bigquery(sc,name="hamlet",projectId="bigquery-public-data",datasetId="samples",tableId="shakespeare") %>% filter(corpus=="hamlet")# NOTE: predicate pushdown to BigQuery!# Retrieve results into a local tibblehamlet %>% collect()# Write result into "mysamples" dataset in our BigQuery (billing) projectspark_write_bigquery(hamlet,datasetId="mysamples",tableId="hamlet",mode="overwrite")
When running outside of Google Cloud it is necessary to specify a service account JSON key file. The service account key file can be passed as parameterserviceAccountKeyFile tobigquery_defaults or directly tospark_read_bigquery andspark_write_bigquery.
Alternatively, an environment variableexport GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service_account_keyfile.json can be set (seehttps://cloud.google.com/docs/authentication/getting-started for more information). Make sure the variable is set before starting the R session.
When running on Google Cloud, e.g. Google Cloud Dataproc, application default credentials (ADC) may be used in which case it is not necessary to specify a service account key file.
About
Sparklyr extension package to connect to Google BigQuery
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Contributors4
Uh oh!
There was an error while loading.Please reload this page.
