Create a Dataproc cluster by using client libraries

The sample code listed, below, shows you how to use theCloud Client Libraries to createa Dataproc cluster, run a job on the cluster, then delete thecluster.

You can also perform these tasks using:

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Enable the Dataproc API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the Dataproc API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the API

Run the Code

Try the walkthrough: ClickOpen in Cloud Shell to run a Python Cloud Client Libraries walkthrough that creates a cluster, runs a PySpark job, then deletes the cluster.

Open in Cloud Shell

Go

  1. Install the client libraryFor more information, SeeSetting up your development environment.
  2. Set up authentication
  3. Clone and run the sample GitHub code.Selecting a PySpark input file:The Google Client Library code samples, below, run a PySpark job that you specifyas an input parameter. You can pass in the simple"Hello World" PySpark app located in Cloud Storage atgs://dataproc-examples/pyspark/hello-world/hello-world.py or your ownPySpark app. SeeUploading objects to learnmore about uploading files to Cloud Storage.
  4. View the output. The code outputs the job driver log to the defaultDataprocstaging bucketin Cloud Storage. You can view job driver output from the Google Cloud consolein your project's DataprocJobssection. Click on theJob ID to view job output onthe Job details page.

// This quickstart shows how you can use the Dataproc Client library to create a// Dataproc cluster, submit a PySpark job to the cluster, wait for the job to finish// and finally delete the cluster.//// Usage:////go build//./quickstart --project_id <PROJECT_ID> --region <REGION> \//    --cluster_name <CLUSTER_NAME> --job_file_path <GCS_JOB_FILE_PATH>packagemainimport("context""flag""fmt""io""log""regexp"dataproc"cloud.google.com/go/dataproc/apiv1""cloud.google.com/go/dataproc/apiv1/dataprocpb""cloud.google.com/go/storage""google.golang.org/api/option")funcmain(){varprojectID,clusterName,region,jobFilePathstringflag.StringVar(&projectID,"project_id","","Cloud Project ID, used for creating resources.")flag.StringVar(&region,"region","","Region that resources should be created in.")flag.StringVar(&clusterName,"cluster_name","","Name of Cloud Dataproc cluster to create.")flag.StringVar(&jobFilePath,"job_file_path","","Path to job file in GCS.")flag.Parse()ctx:=context.Background()// Create the cluster client.endpoint:=fmt.Sprintf("%s-dataproc.googleapis.com:443",region)clusterClient,err:=dataproc.NewClusterControllerClient(ctx,option.WithEndpoint(endpoint))iferr!=nil{log.Fatalf("error creating the cluster client: %s\n",err)}// Create the cluster config.createReq:=&dataprocpb.CreateClusterRequest{ProjectId:projectID,Region:region,Cluster:&dataprocpb.Cluster{ProjectId:projectID,ClusterName:clusterName,Config:&dataprocpb.ClusterConfig{MasterConfig:&dataprocpb.InstanceGroupConfig{NumInstances:1,MachineTypeUri:"n1-standard-2",},WorkerConfig:&dataprocpb.InstanceGroupConfig{NumInstances:2,MachineTypeUri:"n1-standard-2",},},},}// Create the cluster.createOp,err:=clusterClient.CreateCluster(ctx,createReq)iferr!=nil{log.Fatalf("error submitting the cluster creation request: %v\n",err)}createResp,err:=createOp.Wait(ctx)iferr!=nil{log.Fatalf("error creating the cluster: %v\n",err)}// Defer cluster deletion.deferfunc(){dReq:=&dataprocpb.DeleteClusterRequest{ProjectId:projectID,Region:region,ClusterName:clusterName,}deleteOp,err:=clusterClient.DeleteCluster(ctx,dReq)deleteOp.Wait(ctx)iferr!=nil{fmt.Printf("error deleting cluster %q: %v\n",clusterName,err)return}fmt.Printf("Cluster %q successfully deleted\n",clusterName)}()// Output a success message.fmt.Printf("Cluster created successfully: %q\n",createResp.ClusterName)// Create the job client.jobClient,err:=dataproc.NewJobControllerClient(ctx,option.WithEndpoint(endpoint))// Create the job config.submitJobReq:=&dataprocpb.SubmitJobRequest{ProjectId:projectID,Region:region,Job:&dataprocpb.Job{Placement:&dataprocpb.JobPlacement{ClusterName:clusterName,},TypeJob:&dataprocpb.Job_PysparkJob{PysparkJob:&dataprocpb.PySparkJob{MainPythonFileUri:jobFilePath,},},},}submitJobOp,err:=jobClient.SubmitJobAsOperation(ctx,submitJobReq)iferr!=nil{fmt.Printf("error with request to submitting job: %v\n",err)return}submitJobResp,err:=submitJobOp.Wait(ctx)iferr!=nil{fmt.Printf("error submitting job: %v\n",err)return}re:=regexp.MustCompile("gs://(.+?)/(.+)")matches:=re.FindStringSubmatch(submitJobResp.DriverOutputResourceUri)iflen(matches) <3{fmt.Printf("regex error: %s\n",submitJobResp.DriverOutputResourceUri)return}// Dataproc job outget gets saved to a GCS bucket allocated to it.storageClient,err:=storage.NewClient(ctx)iferr!=nil{fmt.Printf("error creating storage client: %v\n",err)return}obj:=fmt.Sprintf("%s.000000000",matches[2])reader,err:=storageClient.Bucket(matches[1]).Object(obj).NewReader(ctx)iferr!=nil{fmt.Printf("error reading job output: %v\n",err)return}deferreader.Close()body,err:=io.ReadAll(reader)iferr!=nil{fmt.Printf("could not read output from Dataproc Job: %v\n",err)return}fmt.Printf("Job finished successfully: %s",body)}

Java

  1. Install the client libraryFor more information, SeeSetting Up a Java Development Environment.
  2. Set up authentication
  3. Clone and run the sample GitHub code.Selecting a PySpark input file:The Google Client Library code samples, below, run a PySpark job that you specifyas an input parameter. You can pass in the simple"Hello World" PySpark app located in Cloud Storage atgs://dataproc-examples/pyspark/hello-world/hello-world.py or your ownPySpark app. SeeUploading objects to learnmore about uploading files to Cloud Storage.
  4. View the output. The code outputs the job driver log to the defaultDataprocstaging bucketin Cloud Storage. You can view job driver output from the Google Cloud consolein your project's DataprocJobssection. Click on theJob ID to view job output onthe Job details page.

/* This quickstart sample walks a user through creating a Cloud Dataproc * cluster, submitting a PySpark job from Google Cloud Storage to the * cluster, reading the output of the job and deleting the cluster, all * using the Java client library. * * Usage: *     mvn clean package -DskipTests * *     mvn exec:java -Dexec.args="<PROJECT_ID> <REGION> <CLUSTER_NAME> <GCS_JOB_FILE_PATH>" * *     You can also set these arguments in the main function instead of providing them via the CLI. */importcom.google.api.gax.longrunning.OperationFuture;importcom.google.cloud.dataproc.v1.Cluster;importcom.google.cloud.dataproc.v1.ClusterConfig;importcom.google.cloud.dataproc.v1.ClusterControllerClient;importcom.google.cloud.dataproc.v1.ClusterControllerSettings;importcom.google.cloud.dataproc.v1.ClusterOperationMetadata;importcom.google.cloud.dataproc.v1.InstanceGroupConfig;importcom.google.cloud.dataproc.v1.Job;importcom.google.cloud.dataproc.v1.JobControllerClient;importcom.google.cloud.dataproc.v1.JobControllerSettings;importcom.google.cloud.dataproc.v1.JobMetadata;importcom.google.cloud.dataproc.v1.JobPlacement;importcom.google.cloud.dataproc.v1.PySparkJob;importcom.google.cloud.storage.Blob;importcom.google.cloud.storage.Storage;importcom.google.cloud.storage.StorageOptions;importcom.google.protobuf.Empty;importjava.io.IOException;importjava.util.concurrent.ExecutionException;importjava.util.regex.Matcher;importjava.util.regex.Pattern;publicclassQuickstart{publicstaticvoidquickstart(StringprojectId,Stringregion,StringclusterName,StringjobFilePath)throwsIOException,InterruptedException{StringmyEndpoint=String.format("%s-dataproc.googleapis.com:443",region);// Configure the settings for the cluster controller client.ClusterControllerSettingsclusterControllerSettings=ClusterControllerSettings.newBuilder().setEndpoint(myEndpoint).build();// Configure the settings for the job controller client.JobControllerSettingsjobControllerSettings=JobControllerSettings.newBuilder().setEndpoint(myEndpoint).build();// Create both a cluster controller client and job controller client with the// configured settings. The client only needs to be created once and can be reused for// multiple requests. Using a try-with-resources closes the client, but this can also be done// manually with the .close() method.try(ClusterControllerClientclusterControllerClient=ClusterControllerClient.create(clusterControllerSettings);JobControllerClientjobControllerClient=JobControllerClient.create(jobControllerSettings)){// Configure the settings for our cluster.InstanceGroupConfigmasterConfig=InstanceGroupConfig.newBuilder().setMachineTypeUri("n1-standard-2").setNumInstances(1).build();InstanceGroupConfigworkerConfig=InstanceGroupConfig.newBuilder().setMachineTypeUri("n1-standard-2").setNumInstances(2).build();ClusterConfigclusterConfig=ClusterConfig.newBuilder().setMasterConfig(masterConfig).setWorkerConfig(workerConfig).build();// Create the cluster object with the desired cluster config.Clustercluster=Cluster.newBuilder().setClusterName(clusterName).setConfig(clusterConfig).build();// Create the Cloud Dataproc cluster.OperationFuture<Cluster,ClusterOperationMetadata>createClusterAsyncRequest=clusterControllerClient.createClusterAsync(projectId,region,cluster);ClusterclusterResponse=createClusterAsyncRequest.get();System.out.println(String.format("Cluster created successfully: %s",clusterResponse.getClusterName()));// Configure the settings for our job.JobPlacementjobPlacement=JobPlacement.newBuilder().setClusterName(clusterName).build();PySparkJobpySparkJob=PySparkJob.newBuilder().setMainPythonFileUri(jobFilePath).build();Jobjob=Job.newBuilder().setPlacement(jobPlacement).setPysparkJob(pySparkJob).build();// Submit an asynchronous request to execute the job.OperationFuture<Job,JobMetadata>submitJobAsOperationAsyncRequest=jobControllerClient.submitJobAsOperationAsync(projectId,region,job);JobjobResponse=submitJobAsOperationAsyncRequest.get();// Print output from Google Cloud Storage.Matchermatches=Pattern.compile("gs://(.*?)/(.*)").matcher(jobResponse.getDriverOutputResourceUri());matches.matches();Storagestorage=StorageOptions.getDefaultInstance().getService();Blobblob=storage.get(matches.group(1),String.format("%s.000000000",matches.group(2)));System.out.println(String.format("Job finished successfully: %s",newString(blob.getContent())));// Delete the cluster.OperationFuture<Empty,ClusterOperationMetadata>deleteClusterAsyncRequest=clusterControllerClient.deleteClusterAsync(projectId,region,clusterName);deleteClusterAsyncRequest.get();System.out.println(String.format("Cluster \"%s\" successfully deleted.",clusterName));}catch(ExecutionExceptione){System.err.println(String.format("quickstart: %s ",e.getMessage()));}}publicstaticvoidmain(String...args)throwsIOException,InterruptedException{if(args.length!=4){System.err.println("Insufficient number of parameters provided. Please make sure a "+"PROJECT_ID, REGION, CLUSTER_NAME and JOB_FILE_PATH are provided, in this order.");return;}StringprojectId=args[0];// project-id of project to create the cluster inStringregion=args[1];// region to create the clusterStringclusterName=args[2];// name of the clusterStringjobFilePath=args[3];// location in GCS of the PySpark jobquickstart(projectId,region,clusterName,jobFilePath);}}

Node.js

  1. Install the client libraryFor more information, SeeSetting up a Node.js development environment.
  2. Set up authentication
  3. Clone and run the sample GitHub code.Selecting a PySpark input file:The Google Client Library code samples, below, run a PySpark job that you specifyas an input parameter. You can pass in the simple"Hello World" PySpark app located in Cloud Storage atgs://dataproc-examples/pyspark/hello-world/hello-world.py or your ownPySpark app. SeeUploading objects to learnmore about uploading files to Cloud Storage.
  4. View the output. The code outputs the job driver log to the defaultDataprocstaging bucketin Cloud Storage. You can view job driver output from the Google Cloud consolein your project's DataprocJobssection. Click on theJob ID to view job output onthe Job details page.

// This quickstart sample walks a user through creating a Dataproc// cluster, submitting a PySpark job from Google Cloud Storage to the// cluster, reading the output of the job and deleting the cluster, all// using the Node.js client library.'use strict';functionmain(projectId,region,clusterName,jobFilePath){constdataproc=require('@google-cloud/dataproc');const{Storage}=require('@google-cloud/storage');// Create a cluster client with the endpoint set to the desired cluster regionconstclusterClient=newdataproc.v1.ClusterControllerClient({apiEndpoint:`${region}-dataproc.googleapis.com`,projectId:projectId,});// Create a job client with the endpoint set to the desired cluster regionconstjobClient=newdataproc.v1.JobControllerClient({apiEndpoint:`${region}-dataproc.googleapis.com`,projectId:projectId,});asyncfunctionquickstart(){// Create the cluster configconstcluster={projectId:projectId,region:region,cluster:{clusterName:clusterName,config:{masterConfig:{numInstances:1,machineTypeUri:'n1-standard-2',},workerConfig:{numInstances:2,machineTypeUri:'n1-standard-2',},},},};// Create the clusterconst[operation]=awaitclusterClient.createCluster(cluster);const[response]=awaitoperation.promise();// Output a success messageconsole.log(`Cluster created successfully:${response.clusterName}`);constjob={projectId:projectId,region:region,job:{placement:{clusterName:clusterName,},pysparkJob:{mainPythonFileUri:jobFilePath,},},};const[jobOperation]=awaitjobClient.submitJobAsOperation(job);const[jobResponse]=awaitjobOperation.promise();constmatches=jobResponse.driverOutputResourceUri.match('gs://(.*?)/(.*)');conststorage=newStorage();constoutput=awaitstorage.bucket(matches[1]).file(`${matches[2]}.000000000`).download();// Output a success message.console.log(`Job finished successfully:${output}`);// Delete the cluster once the job has terminated.constdeleteClusterReq={projectId:projectId,region:region,clusterName:clusterName,};const[deleteOperation]=awaitclusterClient.deleteCluster(deleteClusterReq);awaitdeleteOperation.promise();// Output a success messageconsole.log(`Cluster${clusterName} successfully deleted.`);}quickstart();}constargs=process.argv.slice(2);if(args.length!==4){console.log('Insufficient number of parameters provided. Please make sure a '+'PROJECT_ID, REGION, CLUSTER_NAME and JOB_FILE_PATH are provided, in this order.');}main(...args);

Python

  1. Install the client libraryFor more information, SeeSetting Up a Python Development Environment.
  2. Set up authentication
  3. Clone and run the sample GitHub code.Selecting a PySpark input file:The Google Client Library code samples, below, run a PySpark job that you specifyas an input parameter. You can pass in the simple"Hello World" PySpark app located in Cloud Storage atgs://dataproc-examples/pyspark/hello-world/hello-world.py or your ownPySpark app. SeeUploading objects to learnmore about uploading files to Cloud Storage.
  4. View the output. The code outputs the job driver log to the defaultDataprocstaging bucketin Cloud Storage. You can view job driver output from the Google Cloud consolein your project's DataprocJobssection. Click on theJob ID to view job output onthe Job details page.

"""This quickstart sample walks a user through creating a Cloud Dataproccluster, submitting a PySpark job from Google Cloud Storage to thecluster, reading the output of the job and deleting the cluster, allusing the Python client library.Usage:    python quickstart.py --project_id <PROJECT_ID> --region <REGION> \        --cluster_name <CLUSTER_NAME> --job_file_path <GCS_JOB_FILE_PATH>"""importargparseimportrefromgoogle.cloudimportdataproc_v1asdataprocfromgoogle.cloudimportstoragedefquickstart(project_id,region,cluster_name,job_file_path):# Create the cluster client.cluster_client=dataproc.ClusterControllerClient(client_options={"api_endpoint":"{}-dataproc.googleapis.com:443".format(region)})# Create the cluster config.cluster={"project_id":project_id,"cluster_name":cluster_name,"config":{"master_config":{"num_instances":1,"machine_type_uri":"n1-standard-2","disk_config":{"boot_disk_size_gb":100},},"worker_config":{"num_instances":2,"machine_type_uri":"n1-standard-2","disk_config":{"boot_disk_size_gb":100},},},}# Create the cluster.operation=cluster_client.create_cluster(request={"project_id":project_id,"region":region,"cluster":cluster})result=operation.result()print("Cluster created successfully:{}".format(result.cluster_name))# Create the job client.job_client=dataproc.JobControllerClient(client_options={"api_endpoint":"{}-dataproc.googleapis.com:443".format(region)})# Create the job config.job={"placement":{"cluster_name":cluster_name},"pyspark_job":{"main_python_file_uri":job_file_path},}operation=job_client.submit_job_as_operation(request={"project_id":project_id,"region":region,"job":job})response=operation.result()# Dataproc job output gets saved to the Google Cloud Storage bucket# allocated to the job. Use a regex to obtain the bucket and blob info.matches=re.match("gs://(.*?)/(.*)",response.driver_output_resource_uri)output=(storage.Client().get_bucket(matches.group(1)).blob(f"{matches.group(2)}.000000000").download_as_bytes().decode("utf-8"))print(f"Job finished successfully:{output}")# Delete the cluster once the job has terminated.operation=cluster_client.delete_cluster(request={"project_id":project_id,"region":region,"cluster_name":cluster_name,})operation.result()print("Cluster{} successfully deleted.".format(cluster_name))if__name__=="__main__":parser=argparse.ArgumentParser(description=__doc__,formatter_class=argparse.RawDescriptionHelpFormatter,)parser.add_argument("--project_id",type=str,required=True,help="Project to use for creating resources.",)parser.add_argument("--region",type=str,required=True,help="Region where the resources should live.",)parser.add_argument("--cluster_name",type=str,required=True,help="Name to use for creating a cluster.",)parser.add_argument("--job_file_path",type=str,required=True,help="Job in GCS to execute against the cluster.",)args=parser.parse_args()quickstart(args.project_id,args.region,args.cluster_name,args.job_file_path)

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.