Connect to Apache Spark

As a BigQuery administrator, you can create aconnection to enable data analysts torun stored procedures for Apache Spark.

Before you begin

Location considerations

When you choose a location for your data, consider the following:

Multi-regions

You must specify Google Cloud resources located in the same largegeographic area:

  • A connection in the BigQuery US multi-region canreference aSpark History Serveror aDataproc Metastorein any single region in the US geographic area, such asus-central1,us-east4, orus-west2.

  • A connection in the BigQuery EU multi-region canreference a Spark History Server or a Dataproc Metastore inmemberstates of theEuropean Union, such aseurope-north1 oreurope-west3.

Single regions

A connection in a single region can only reference Google Cloudresources in the same region. For example, a connection in the singleregionus-east4 can only reference a Spark History Server or aDataproc Metastore inus-east4.

Create connections

Select one of the following options:

Console

  1. Go to theBigQuery page.

    Go to BigQuery

  2. In theExplorer pane, clickAdd data.

    TheAdd data dialog opens.

  3. In theFilter By pane, in theData Source Type section, selectBusiness Applications.

    Alternatively, in theSearch for data sources field, you can enterSpark.

  4. In theFeatured data sources section, clickApache Spark.

  5. Click theApache Spark: BigQuery Federation solution card.

  6. In theExternal data source pane, enter the following information:

    • In theConnection type list, selectApache Spark.

    • In theConnection ID field, enter a name for your connection—forexample,spark_connection.

    • In theData location list, select a region.

    You can create a connection inregions and multi-regions that supportBigQuery.For more information, seeLocation considerations.

  7. ClickCreate connection.

  8. ClickGo to connection.

  9. In theConnection info pane, copy the service account ID for use in afollowing step.

bq

  1. In a command-line environment, use thebq mkcommand to create aconnection:

    bq mk --connection --connection_type='SPARK' \ --properties=PROPERTIES \ --project_id=PROJECT_ID \ --location=LOCATIONCONNECTION_ID

    Replace the following:

    • PROPERTIES: a key-value pair to provideconnection-specific parameters in JSON format

      For example:

      --properties='{"metastoreServiceConfig": {"metastoreService": "METASTORE_SERVICE_NAME"},"sparkHistoryServerConfig": {"dataprocCluster": "DATAPROC_CLUSTER_NAME"}}'

      Replace the following:

    • PROJECT_ID: your Google Cloud project ID

    • LOCATION: the location where you want to storethe connection—for example,US

    • CONNECTION_ID: the connectionID—for example,myconnection

      When youview the connection detailsin the Google Cloud console, the connection ID is the value in the lastsection of the fully qualified connection ID that is shown inConnection ID—for exampleprojects/.../locations/.../connections/myconnection

  2. Retrieve and copy the service account ID because you need it in another step:

    bq show --location=LOCATION --connectionPROJECT_ID.LOCATION.CONNECTION_ID

    The output is similar to the following:

    Connection myproject.us.myconnection       name           type                    properties---------------------- ------- ---------------------------------------------------myproject.us.myconnection  SPARK   {"serviceAccountId": "bqserver@example.iam.gserviceaccount.com"}

For information about how to manage connections, seeManage connections.

Grant access to the service account

To let a stored procedure for Apache Spark access your Google Cloudresources, you need to grant the service account that'sassociated with the stored procedure's connection the necessary IAMpermissions. Alternatively, you can use yourcustom service accountfor data access.

  • To read and write data from and to BigQuery, you need to givethe service account the following IAM permissions:

    • bigquery.tables.* on your BigQuery tables
    • bigquery.readsessions.* on your project

    Theroles/bigquery.admin IAM role includes the permissionsthat the service account needs in order to read and write data from and toBigQuery.

    Note: If your stored procedure writes data to a temporaryCloud Storage bucket and thenloads theCloud Storage data to BigQuery,then you need to give the service account thebigquery.jobs.create permission onyour project. For more information about IAM roles andpermissions in BigQuery, seeAccess control withIAM.
  • To read and write data from and to Cloud Storage, you need to givethe service account thestorage.objects.* permission on yourCloud Storage objects.

    Theroles/storage.objectAdmin IAM role includes the permissionsthat the service account needs in order to read and write data from and to Cloud Storage.

  • If you specify Dataproc Metastore when you create aconnection, then for BigQuery to retrieve details about themetastore configuration, you need to give the service account themetastore.services.get permission on your Dataproc Metastore.

    The predefinedroles/metastore.metadataViewer role includes the permissionthat the service account needs in order to retrieve details about themetastore configuration.

    You also need to grant the service account theroles/storage.objectAdminrole on the Cloud Storage bucket so that your storedprocedure can access the Hive warehouse directory of yourDataproc Metastore (hive.metastore.warehouse.dir).If your stored procedure performs operations on the metastore, youmight need to give additional permissions. For more information aboutIAM roles and permissions in Dataproc Metastore,seeDataproc Metastore predefined roles and permissions.

  • If you specify a Dataproc Persistent History Server when youcreate a connection, then you need to grant the service account thefollowing roles:

    • Theroles/dataproc.viewer role on your DataprocPersistent History Server that contains thedataproc.clusters.getpermission.
    • Theroles/storage.objectAdmin role on the Cloud Storage bucketthat you specify for the propertyspark:spark.history.fs.logDirectorywhen you create the Dataproc Persistent History Server.

    For more information, seeDataproc Persistent HistoryServer andDataproc roles and permissions.

Share connections with users

You can grant the following roles to let users query data and manage connections:

  • roles/bigquery.connectionUser: enables users to use connections to connectwith external data sources and run queries on them.

  • roles/bigquery.connectionAdmin: enables users to manage connections.

For more information about IAM roles and permissions inBigQuery, seePredefined roles and permissions.

Select one of the following options:

Console

  1. Go to theBigQuery page.

    Go to BigQuery

    Connections are listed in your project, in a group calledConnections.

  2. In the left pane, clickExplorer:

    Highlighted button for the Explorer pane.

    If you don't see the left pane, clickExpand left pane to open the pane.

  3. Click your project, clickConnections, and then select a connection.

  4. In theDetails pane, clickShare to share a connection.Then do the following:

    1. In theConnection permissions dialog, share theconnection with other principals by adding or editingprincipals.

    2. ClickSave.

bq

You cannot share a connection with the bq command-line tool.To share a connection, use the Google Cloud console orthe BigQuery Connections API method to share a connection.

API

Use theprojects.locations.connections.setIAM methodin the BigQuery Connections REST API reference section, andsupply an instance of thepolicy resource.

Java

Before trying this sample, follow theJava setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryJava API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

importcom.google.api.resourcenames.ResourceName;importcom.google.cloud.bigquery.connection.v1.ConnectionName;importcom.google.cloud.bigqueryconnection.v1.ConnectionServiceClient;importcom.google.iam.v1.Binding;importcom.google.iam.v1.Policy;importcom.google.iam.v1.SetIamPolicyRequest;importjava.io.IOException;// Sample to share connectionspublicclassShareConnection{publicstaticvoidmain(String[]args)throwsIOException{// TODO(developer): Replace these variables before running the sample.StringprojectId="MY_PROJECT_ID";Stringlocation="MY_LOCATION";StringconnectionId="MY_CONNECTION_ID";shareConnection(projectId,location,connectionId);}staticvoidshareConnection(StringprojectId,Stringlocation,StringconnectionId)throwsIOException{try(ConnectionServiceClientclient=ConnectionServiceClient.create()){ResourceNameresource=ConnectionName.of(projectId,location,connectionId);Bindingbinding=Binding.newBuilder().addMembers("group:example-analyst-group@google.com").setRole("roles/bigquery.connectionUser").build();Policypolicy=Policy.newBuilder().addBindings(binding).build();SetIamPolicyRequestrequest=SetIamPolicyRequest.newBuilder().setResource(resource.toString()).setPolicy(policy).build();client.setIamPolicy(request);System.out.println("Connection shared successfully");}}}

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.