Transfer from HDFS to Cloud Storage

Storage Transfer Service supports transfers from cloud and on-premises HadoopDistributed File System (HDFS) sources.

Transfers from HDFS must use Cloud Storage as the destination.

Use cases include migrating from on-premises storage to Cloud Storage,archiving data to free up on-premises storage space, replicating data toGoogle Cloud for business continuity, or transferring data toGoogle Cloud for analysis and processing.

Configure permissions

Before creating a transfer, you must configure permissions for the followingentities:

Theuser account being used to create the transfer. This is the account that is signed in to the Google Cloud console, or the account that is specified when authenticating to the `gcloud` CLI. The user account can be a regular user account, or a user-managed service account.
TheGoogle-managed service account, also known as the service agent, used by Storage Transfer Service. This account is generally identified by its email address, which uses the formatproject-PROJECT_NUMBER@storage-transfer-service.iam.gserviceaccount.com.
Thetransfer agent account that provides Google Cloud permissions for transfer agents. Transfer agent accounts use the credentials of the user installing them, or the credentials of a user-managed service account, to authenticate.

SeeAgent-based transfer permissionsfor instructions.

Install agents into an agent pool

Agent-based transfers use software agents to orchestrate transfers. These agentsmust be installed on one or more machines with access toyour file system. Agents must have access to the namenode, all datanodes,the Hadoop Key Management Server (KMS), and the Kerberos Key Distribution Center(KDC).

Transfer agents work together in an agent pool. Increasing the number of agentscan increase overall job performance, but this is dependent on several factors.

  • Adding more agents can help, up to about half the number of nodes in your HDFScluster. For example, with a 30-node cluster, increasing from 5 to 15 agentsshould improve performance, but beyond 15 is unlikely to make much difference.

  • For a small HDFS cluster, one agent may be sufficient.

  • Additional agents tend to have a larger impact on performance when a transferincludes a large number of small files. Storage Transfer Service achieves highthroughput by parallelizing transfer tasks among multiple agents. The morefiles in the workload, the more benefit there is to adding more agents.

Don't include sensitive information such as personally identifiable information(PII) or security data in your agent pool name or agent ID prefix. Resource names may bepropagated to the names of other Google Cloud resources and may be exposedto Google-internal systems outside of your project.

Create an agent pool

Create anagent pool. Use youruser accountUser account symbolfor this action.

Install agents

Install agents into the agent pool. Use yourtransfer agent accountfor this action.

Google Cloud console

  1. In the Google Cloud console, go to theAgent pools page.

    Go to Agent pools

  2. Select the agent pool to which to add the new agent.

  3. ClickInstall agent.

  4. Follow the instructions to install and run the agent.

    For more information about the agent's command-line options, seeAgent command-line options.

gcloud CLI

To install one or more agents using the gcloud CLI, rungcloud transfer agents install:

gcloudtransferagentsinstall--pool=POOL_NAME\--count=NUM_AGENTS\--mount-directories=MOUNT_DIRECTORIES\--hdfs-namenode-uri=HDFS_NAMENODE_URI\--hdfs-username=HDFS_USERNAME\--hdfs-data-transfer-protection=HDFS_DATA_TRANSFER_PROTECTION\--kerberos-config-file=KERBEROS_CONFIG_FILE\--kerberos-keytab-file=KERBEROS_KEYTAB_FILE\--kerberos-user-principal=KERBEROS_USER_PRINCIPAL\--kerberos-service-principal=KERBEROS_SERVICE_PRINCIPAL\

Replace the following:

  • --hdfs-namenode-uri specifies an HDFS cluster including a schema,namenode, and port, in URI format. For example:

    • rpc://my-namenode:8020
    • http://my-namenode:9870

    Use HTTP or HTTPS for WebHDFS. If no schema is provided, we assume RPC. Ifno port is provided, the default is8020 for RPC,9870 for HTTP, and9871 for HTTPS. For example, the inputmy-namenode becomesrpc://my-namenode:8020.

    If your cluster is configured with multiple namenodes, specify thecurrent primary node. SeeClusters with multiple namenodes for moreinformation.

  • --hdfs-username is the username for connecting to an HDFS cluster withsimple auth. Omit this flag if you're authenticating with Kerberos, or ifyou're connecting without any authentication.

  • --hdfs-data-transfer-protection (optional) is the client-side quality ofprotection (QOP) setting for Kerberized clusters. The value cannot be morerestrictive than the server-side QOP value. Valid values are:authentication,integrity, andprivacy.

If you're authenticating with Kerberos, also include the following flags:

  • --kerberos-config-file is the path to a Kerberos configuration file. Forexample,--kerberos-config-file=/etc/krb5.conf.

  • --kerberos-user-principal is the Kerberos user principal to use. Forexample,--kerberos-user-principal=user1.

  • --kerberos-keytab-file is the path to a Keytab file containing the userprincipal specified with the--kerberos-user-principal flag. Forexample,--kerberos-keytab-file=/home/me/kerberos/user1.keytab.

  • --kerberos-service-principal is the Kerberos service principal to use,of the form<primary>/<instance>. Realm is mapped from your Kerberosconfiguration file; any supplied realm is ignored. If this flag is notspecified, the default ishdfs/<namenode_fqdn> where<namenode_fqdn> isthe fully qualified domain name specified in the configuration file.

    For example,--kerberos-service-principal=hdfs/my-namenode.a.example.com.

The tool walks you through any required steps to install the agent(s). Thiscommand installsNUM_AGENTS agent(s) on your machine, mapped tothe pool name specified asPOOL_NAME, and authenticates theagent using yourgcloud credentials. The pool name must exist, or anerror is returned.

The--mount-directories flag is optional but is strongly recommended. Itsvalue is a comma-separated list of directories on the file system to whichto grant the agent access.Omitting this flag mounts the entire file system to the agent container. Seethegcloud referencefor more details.

docker run

Before usingdocker run to install agents, follow the instructions toinstall Docker.

Thedocker run command installs one agent. To increase the number of agentsin your pool, re-run this command as many times as required.

The required command flags depend on the authentication type you'reusing.

Kerberos

To authenticate to your file system using Kerberos, use thefollowing command:

sudodockerrun-d--ulimitmemlock=64000000--rm\--network=host\-v/:/transfer_root\gcr.io/cloud-ingest/tsop-agent:latest\--enable-mount-directory\--project-id=${PROJECT_ID}\--hostname=$(hostname)\--creds-file="service_account.json"\--agent-pool=${AGENT_POOL_NAME}\--hdfs-namenode-uri=cluster-namenode\--kerberos-config-file=/etc/krb5.conf\--kerberos-user-principal=user\--kerberos-keytab-file=/path/to/folder.keytab

Replace the following:

  • --network=host should be omitted if you're running more than one agenton this machine.
  • --hdfs-namenode-uri: A schema, namenode, and port, in URI format,representing an HDFS cluster. For example:

    • rpc://my-namenode:8020
    • http://my-namenode:9870

Use HTTP or HTTPS for WebHDFS. If no schema is provided, we assume RPC. Ifno port is provided, the default is8020 for RPC,9870 for HTTP, and9871 for HTTPS. For example, the inputmy-namenode becomesrpc://my-namenode:8020.

If your cluster is configured with multiple namenodes, specify thecurrent primary node. SeeClusters with multiple namenodes for moreinformation.

  • --kerberos-config-file: Path to a Kerberos configuration file. Defaultis/etc/krb5.conf.
  • --kerberos-user-principal: The Kerberos user principal.
  • --kerberos-keytab-file: Path to a Keytab file containing the userprincipal specified with--kerberos-user-principal.
  • --kerberos-service-principal: Kerberos service principal to use, of theform 'service/instance'. Realm is mapped from your Kerberos configurationfile; any supplied realm is ignored. If this flag is not specified, thedefault ishdfs/<namenode_fqdn> wherefqdn is the fully qualifieddomain name.

Simple auth

To authenticate to your file system using simple auth:

sudodockerrun-d--ulimitmemlock=64000000--rm\--network=host\-v/:/transfer_root\gcr.io/cloud-ingest/tsop-agent:latest\--enable-mount-directory\--project-id=${PROJECT_ID}\--hostname=$(hostname)\--creds-file="${CREDS_FILE}"\--agent-pool="${AGENT_POOL_NAME}"\--hdfs-namenode-uri=cluster-namenode\--hdfs-username="${USERNAME}"

Replace the following:

  • --hdfs-username: Username to use when connecting to an HDFS clusterusing simple auth.
  • --hdfs-namenode-uri: A schema, namenode, and port, in URI format,representing an HDFS cluster. For example:
    • rpc://my-namenode:8020
    • http://my-namenode:9870

Use HTTP or HTTPS for WebHDFS. If no schema is provided, we assume RPC.If no port is provided, the default is8020 for RPC,9870 for HTTP,and9871 for HTTPS. For example, the inputmy-namenode becomesrpc://my-namenode:8020.

If your cluster is configured with multiple namenodes, specify thecurrent primary node. SeeClusters with multiple namenodes for moreinformation.

No auth

To connect to your file system without any authentication:

sudodockerrun-d--ulimitmemlock=64000000--rm\--network=host\-v/:/transfer_root\gcr.io/cloud-ingest/tsop-agent:latest\--enable-mount-directory\--project-id=${PROJECT_ID}\--hostname=$(hostname)\--creds-file="${CREDS_FILE}"\--agent-pool="${AGENT_POOL_NAME}"\--hdfs-namenode-uri=cluster-namenode\

Replace the following:

  • --hdfs-namenode-uri: A schema, namenode, and port, in URI format,representing an HDFS cluster. For example:
    • rpc://my-namenode:8020
    • http://my-namenode:9870

Use HTTP or HTTPS for WebHDFS. If no schema is provided, we assume RPC.If no port is provided, the default is8020 for RPC,9870 for HTTP,and9871 for HTTPS. For example, the inputmy-namenode becomesrpc://my-namenode:8020.

If your cluster is configured with multiple namenodes, specify thecurrent primary node. SeeClusters with multiple namenodes for moreinformation.

Transfer options

The following Storage Transfer Service features are available for transfers fromHDFS to Cloud Storage.

Files transferred from HDFS do not retain theirmetadata.

Create a transfer

Don't include sensitive information such as personally identifiable information(PII) or security data in your transfer job name. Resource names may bepropagated to the names of other Google Cloud resources and may be exposedto Google-internal systems outside of your project.

Storage Transfer Service provides multiple interfaces through which to create atransfer.

Google Cloud console

  1. Go to theStorage Transfer Service page in the Google Cloud console.

    Go to Storage Transfer Service

  2. ClickCreate transfer job. TheCreate a transfer job page isdisplayed.

  3. SelectHadoop Distributed File System as theSource type. Thedestination must beGoogle Cloud Storage.

    ClickNext step.

Configure your source

  1. Specify the required information for this transfer:

    1. Select theagent pool you configured for thistransfer.

    2. Enter thePath to transfer from, relative to the root directory.

  2. Optionally, specify anyfilters toapply to the source data.

  3. ClickNext step.

Configure your sink

  1. In theBucket or folder field, enter the destination bucketand (optionally) folder name, or clickBrowse to select a bucket from a list of existing buckets in yourcurrent project. To create a new bucket, clickBucket iconCreate new bucket.

  2. ClickNext step.

Schedule the transfer

You can schedule your transfer to run one time only, or configure a recurringtransfer.

ClickNext step.

Choose transfer settings

  1. In theDescription field, enter a description of thetransfer. As a best practice, enter a description that is meaningful andunique so that you can tell jobs apart.

  2. UnderMetadata options, select your Cloud Storage storage class,and whether to save each objects' creation time. SeeMetadata preservationfor details.

  3. UnderWhen to overwrite, select one of the following:

    • Never: Do not overwrite destination files. If a file exists withthe same name, it will not be transferred.

    • If different: Overwrites destination files if the source filewith the same name has different Etags or checksum values.

    • Always: Always overwrites destination files when the source filehas the same name, even if they're identical.

  4. UnderWhen to delete, select one of the following:

    • Never: Never delete files from either the source or destination.

    • Delete files from destination if they're not also at source: Iffiles in the destination Cloud Storage bucket aren't also inthe source, then delete the files from the Cloud Storagebucket.

      This option ensures that the destination Cloud Storage bucketexactly matches your source.

  5. Select whether to enabletransfer loggingand/orPub/Sub notifications.

ClickCreate to create the transfer job.

gcloud CLI

To create a new transfer job, use thegcloud transfer jobs createcommand. Creating a new job initiates the specified transfer, unless aschedule or--do-not-run is specified.

gcloudtransferjobscreate\hdfs:///PATH/gs://BUCKET_NAME/PATH/--source-agent-pool=AGENT_POOL_NAME

Replace the following:

  • PATH is an absolute path from the root of the HDFS cluster. The clusternamenode and port are configured at the agent level, so the job creationcommand only needs to specify the (optional) path and the agent pool.

  • --source-agent-pool specifies the source agent pool to use for thistransfer.

Additional options include:

  • --do-not-run prevents Storage Transfer Service from running the jobupon submission of the command. To run the job,update it toadd a schedule, or usejobs run tostart it manually.

  • --manifest-file specifies the path to a CSV file in Cloud Storagecontaining a list of files to transfer from your source. For manifest fileformatting, seeTransfer specific files or objects using a manifest.

  • Job information: You can specify--name and--description.

  • Schedule: Specify--schedule-starts,--schedule-repeats-every, and--schedule-repeats-until, or--do-not-run.

  • Object conditions: Use conditions to determine which objects aretransferred. These include--include-prefixes and--exclude-prefixes,and the time-based conditions in--include-modified-[before | after]-[absolute | relative]. If youspecified a folder with your source, prefix filters are relative to thatfolder. SeeFilter source objects by prefix for moreinformation.

  • Transfer options: Specify whether to overwrite destinationfiles (--overwrite-when=different oralways) and whether todelete certain files during or after the transfer(--delete-from=destination-if-unique orsource-after-transfer); andoptionally set a storage class on transferred objects(--custom-storage-class).

  • Notifications: ConfigurePub/Sub notifications for transferswith--notification-pubsub-topic,--notification-event-types, and--notification-payload-format.

To view all options, rungcloud transfer jobs create --help or refer to thegcloud reference documentation.

REST API

To create a transfer from an HDFS source using the REST API, createa JSON object similar to the following example.

POSThttps://storagetransfer.googleapis.com/v1/transferJobs{..."transferSpec":{"source_agent_pool_name":"POOL_NAME","hdfsDataSource":{"path":"/mount"},"gcsDataSink":{"bucketName":"SINK_NAME"},"transferOptions":{"deleteObjectsFromSourceAfterTransfer":false}}}

Refer to thetransferJobs.create reference for details aboutadditional supported fields.

Clusters with multiple namenodes

Storage Transfer Service agents can only be configured with a single namenode. If yourHDFS cluster is configured with multiple namenodes ("high availability"), andthere's a failover event that results in a new primary namenode, you mustre-install your agents with the correct namenode.

To delete the old agents, refer toDelete an agent.

Your cluster's active namenode can be retrieved by running:

hdfshaadmin-getAllServiceState

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.