File Format Conversion (Avro, Parquet, CSV) template

The File Format Conversion template is a batch pipeline that converts files stored on Cloud Storage from one supported format to another.

The following format conversions are supported:

  • CSV to Avro
  • CSV to Parquet
  • Avro to Parquet
  • Parquet to Avro

Pipeline requirements

  • The output Cloud Storage bucket must exist before running the pipeline.

Template parameters

ParameterDescription
inputFileFormatThe input file format. Must be one of[csv, avro, parquet].
outputFileFormatThe output file format. Must be one of[avro, parquet].
inputFileSpecThe Cloud Storage path pattern for input files. For example,gs://bucket-name/path/*.csv
outputBucketThe Cloud Storage folder to write output files. This path must end with a slash. For example,gs://bucket-name/output/
schemaThe Cloud Storage path to the Avro schema file. For example,gs://bucket-name/schema/my-schema.avsc
containsHeaders(Optional) The input CSV files contain a header record (true/false). The default value isfalse. Only required when reading CSV files.
csvFormat(Optional) The CSV format specification to use for parsing records. The default value isDefault. SeeApache Commons CSV Format for more details.
delimiter(Optional) The field delimiter used by the input CSV files.
outputFilePrefix(Optional) The output file prefix. The default value isoutput.
numShards(Optional) The number of output file shards.

Run the template

Console

  1. Go to the DataflowCreate job from template page.
  2. Go to Create job from template
  3. In theJob name field, enter a unique job name.
  4. Optional: ForRegional endpoint, select a value from the drop-down menu. The default region isus-central1.

    For a list of regions where you can run a Dataflow job, seeDataflow locations.

  5. From theDataflow template drop-down menu, select theConvert file formats template.
  6. In the provided parameter fields, enter your parameter values.
  7. ClickRun job.

gcloud

Note: To use the Google Cloud CLI to run flex templates, you must haveGoogle Cloud CLI version 284.0.0 or later.

In your shell or terminal, run the template:

gclouddataflowflex-templaterunJOB_NAME\--project=PROJECT_ID\--region=REGION_NAME\--template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/File_Format_Conversion\--parameters\inputFileFormat=INPUT_FORMAT,\outputFileFormat=OUTPUT_FORMAT,\inputFileSpec=INPUT_FILES,\schema=SCHEMA,\outputBucket=OUTPUT_FOLDER

Replace the following:

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, seeprojects.templates.launch.

POSThttps://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch{"launch_parameter":{"jobName":"JOB_NAME","parameters":{"inputFileFormat":"INPUT_FORMAT","outputFileFormat":"OUTPUT_FORMAT","inputFileSpec":"INPUT_FILES","schema":"SCHEMA","outputBucket":"OUTPUT_FOLDER"},"containerSpecGcsPath":"gs://dataflow-templates-LOCATION/VERSION/flex/File_Format_Conversion",}}

Replace the following:

Template source code

Java

/* * Copyright (C) 2019 Google LLC * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * *   http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */packagecom.google.cloud.teleport.v2.templates;importcom.google.cloud.teleport.metadata.Template;importcom.google.cloud.teleport.metadata.TemplateCategory;importcom.google.cloud.teleport.metadata.TemplateParameter;importcom.google.cloud.teleport.metadata.TemplateParameter.TemplateEnumOption;importcom.google.cloud.teleport.v2.common.UncaughtExceptionLogger;importcom.google.cloud.teleport.v2.templates.FileFormatConversion.FileFormatConversionOptions;importcom.google.cloud.teleport.v2.transforms.AvroConverters.AvroOptions;importcom.google.cloud.teleport.v2.transforms.CsvConverters.CsvPipelineOptions;importcom.google.cloud.teleport.v2.transforms.ParquetConverters.ParquetOptions;importjava.util.EnumMap;importorg.apache.beam.sdk.Pipeline;importorg.apache.beam.sdk.PipelineResult;importorg.apache.beam.sdk.options.PipelineOptions;importorg.apache.beam.sdk.options.PipelineOptionsFactory;importorg.apache.beam.sdk.options.Validation.Required;importorg.slf4j.Logger;importorg.slf4j.LoggerFactory;/** * The {@link FileFormatConversion} pipeline takes in an input file, converts it to a desired format * and saves it to Cloud Storage. Supported file transformations are: * * <ul> *   <li>Csv to Avro *   <li>Csv to Parquet *   <li>Avro to Parquet *   <li>Parquet to Avro * </ul> * * <p><b>Pipeline Requirements</b> * * <ul> *   <li>Input file exists in Google Cloud Storage. *   <li>Google Cloud Storage output bucket exists. * </ul> * * <p>Check out <a * href="https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/file-format-conversion/README_File_Format_Conversion.md">README</a> * for instructions on how to use or modify this template. */@Template(name="File_Format_Conversion",category=TemplateCategory.UTILITIES,displayName="Convert file formats between Avro, Parquet & CSV",description={"The File Format Conversion template is a batch pipeline that converts files stored on Cloud Storage from one supported format to another.\n","The following format conversions are supported:\n"+"- CSV to Avro\n"+"- CSV to Parquet\n"+"- Avro to Parquet\n"+"- Parquet to Avro"},optionsClass=FileFormatConversionOptions.class,optionalOptions={"deadletterTable"},flexContainerName="file-format-conversion",documentation="https://cloud.google.com/dataflow/docs/guides/templates/provided/file-format-conversion",contactInformation="https://cloud.google.com/support",requirements={"The output Cloud Storage bucket must exist before running the pipeline."})publicclassFileFormatConversion{/** Logger for class. */privatestaticfinalLoggerLOG=LoggerFactory.getLogger(FileFormatConversionFactory.class);privatestaticEnumMap<ValidFileFormats,String>validFileFormats=newEnumMap<ValidFileFormats,String>(ValidFileFormats.class);/**   * The {@link FileFormatConversionOptions} provides the custom execution options passed by the   * executor at the command-line.   */publicinterfaceFileFormatConversionOptionsextendsPipelineOptions,CsvPipelineOptions,AvroOptions,ParquetOptions{@TemplateParameter.Enum(order=1,enumOptions={@TemplateEnumOption("avro"),@TemplateEnumOption("csv"),@TemplateEnumOption("parquet")},description="File format of the input files.",helpText="File format of the input files. Needs to be either avro, parquet or csv.")@RequiredStringgetInputFileFormat();voidsetInputFileFormat(StringinputFileFormat);@TemplateParameter.Enum(order=2,enumOptions={@TemplateEnumOption("avro"),@TemplateEnumOption("parquet")},description="File format of the output files.",helpText="File format of the output files. Needs to be either avro or parquet.")@RequiredStringgetOutputFileFormat();voidsetOutputFileFormat(StringoutputFileFormat);}/** The {@link ValidFileFormats} enum contains all valid file formats. */publicenumValidFileFormats{CSV,AVRO,PARQUET}/**   * Main entry point for pipeline execution.   *   * @param args Command line arguments to the pipeline.   */publicstaticvoidmain(String[]args){UncaughtExceptionLogger.register();FileFormatConversionOptionsoptions=PipelineOptionsFactory.fromArgs(args).withValidation().as(FileFormatConversionOptions.class);run(options);}/**   * Runs the pipeline to completion with the specified options.   *   * @param options The execution options.   * @return The pipeline result.   * @throws RuntimeException thrown if incorrect file formats are passed.   */publicstaticPipelineResultrun(FileFormatConversionOptionsoptions){StringinputFileFormat=options.getInputFileFormat().toUpperCase();StringoutputFileFormat=options.getOutputFileFormat().toUpperCase();validFileFormats.put(ValidFileFormats.CSV,"CSV");validFileFormats.put(ValidFileFormats.AVRO,"AVRO");validFileFormats.put(ValidFileFormats.PARQUET,"PARQUET");if(!validFileFormats.containsValue(inputFileFormat)){thrownewIllegalArgumentException("Invalid input file format.");}if(!validFileFormats.containsValue(outputFileFormat)){thrownewIllegalArgumentException("Invalid output file format.");}if(inputFileFormat.equals(outputFileFormat)){thrownewIllegalArgumentException("Input and output file format cannot be the same.");}// Create the pipelinePipelinepipeline=Pipeline.create(options);pipeline.apply(inputFileFormat+" to "+outputFileFormat,FileFormatConversionFactory.FileFormat.newBuilder().setOptions(options).setInputFileFormat(inputFileFormat).setOutputFileFormat(outputFileFormat).build());returnpipeline.run();}}

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.