Dataform overview Stay organized with collections Save and categorize content based on your preferences.
This document introduces you to Dataform concepts and processes.
Dataform is a service for data analysts to develop, test,control versions, and schedule complex workflows fordata transformation in BigQuery.
Dataform lets you manage data transformation in the Extraction,Loading, and Transformation (ELT) process for data integration. After raw datais extracted from source systems and loaded into BigQuery,Dataform helps you to transform it into a well-defined, tested,and documented suite of data tables.
Dataform lets you perform the following data transformation actions:
- Develop and run workflows for data transformation.
- Collaborate with team members on workflow development through Git.
- Manage a large number of tables and their dependencies.
- Declare source data and manage table dependencies.
- View a visualization of the dependency tree of your workflow.
- Manage data with SQL code in a central repository.
- Reuse code with JavaScript.
- Test data correctness with quality tests on source and output tables.
- Version control SQL code.
- Document data tables inside SQL code.
Data transformation processes in Dataform
The data transformation workflow for Dataform is as follows:
- Dataform lets you create repositories to manage your code.
- Dataform lets you create workspaces for development.
- Dataform lets you develop workflows in a development workspace.
- Dataform compiles Dataform core into SQL.
- Dataform runs the dependency tree.
Dataform lets you create repositories to manage your code
In a Dataformrepository, you useDataform core, an extension of SQL,to write SQLX files in which you define your workflow. Dataformrepositories support version control. You can link a Dataformrepository to athird-party Git provider.
Dataform lets you create workspaces for development
You can createdevelopment workspaces insidea Dataform repository for Dataform core development. In adevelopment workspace, you can make changes to the repository, compile, test,and push them to the main repository through Git.
Dataform lets you develop Dataform core in a development workspace
In a development workspace, you can define and document tables,their dependencies, and transformation logic to build your workflow.You can alsoconfigure actions in JavaScript.
Dataform compiles Dataform core
Duringcompilation, Dataform performs thefollowing tasks:
- Compiles Dataform core into a workflow of Standard SQL.
- Adds boilerplate SQL statements, such as
CREATE TABLEorINSERT,to the code inline with your query configuration. - Transpiles (compiles source-to-source) JavaScript into SQL.
- Resolves dependencies and checks for errors including missingor circular dependencies.
- Builds the dependency tree of all actions to be run in BigQuery.
Dataform compilation is hermetic to verify compilation consistency,meaning that the same code compiles to the same SQL compilation result every time.Dataform compiles your code in a sandboxenvironment with no internet access. No additional actions, such as callingexternal APIs, are available during compilation.
To debug in real-time, you can inspect the compiled workflow ofyour project in an interactive graph in your development workspace.
Dataform runs the dependency tree
In BigQuery, Dataform performs the following tasks:
- Runs SQL commands, following the order of the dependency tree.
- Runs assertion queries against your tables and views tocheck data correctness.
- Runs other SQL operations that you defined.
After the execution, you can use your tables and views for all your analyticspurposes.
You canview logs to see what tables werecreated, if assertions passed or failed, how long each action took to complete,and other information. You can also view the exact SQL codethat was run in BigQuery.
Dataform features
With Dataform, you can develop and deploy tables, incremental tables, or views to BigQuery. Dataform offers a web environment for the following activities:
- Workflow development
- Connection with GitHub, GitLab, Azure DevOps Services, and Bitbucket
- Continuous integration and continuous deployment
- Workflow execution
The following sections describe the main features of Dataform.
Repositories
Each Dataform project is stored in arepository.A Dataform repository houses a collection of JSON configurationfiles, SQLX files, and JavaScript files.
Note: The Dataform repositories list includes onlyrepositories created for Dataform workflow development.Repositories for BigQuery Studio assets, such as notebooks or savedqueries, are managed separately and don't appear in this list.Dataform repositories contain the following types of files:
Config files
Config JSON or SQLX files let you configure your workflows.They contain general configuration, execution schedules, or schema forcreating new tables and views.
Definitions
Definitions are SQLX and JavaScript files that define new tables, views,andadditional SQL operationsto run in BigQuery.
Includes
Includesare JavaScript files where you can define variables and functions to use inyour project.
Each Dataform repository is connected to the defaultDataform service agent or a custom service account. You canconnect new repositories to only custom service accounts. You select thecustom service account when youcreate a repositoryoredit the service accountlater.
By default, Dataform uses a service agent or service accountderived from your project number in the following format:
service-YOUR_PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.comVersion control
Dataform uses the Git version control system to maintain a record ofeach change made to project files and to manage file versions.
Each Dataform repository can manage its own Git repository, or beconnected to a remote third-party Git repository. You canconnect a Dataform repository to a GitHub, GitLab, Azure DevOps Services, or Bitbucket repository.
Usersversion control their workflow codeinside Dataform workspaces. In a Dataform workspace,you can pull changes from the repository, commit all or selected changes,and push them to Git branches of the repository.
Workflow development
In Dataform, you make changes to files and directories inside adevelopment workspace.A development workspace is a virtual, editable copy ofthe contents of a Git repository. Dataform preserves the state offiles in your development workspace between sessions.
In a development workspace, you candevelop workflow actionsby usingDataform core withSQLX and JavaScript, orexclusively with JavaScript.You can automatically format your Dataform core or JavaScript code.
Each element of a Dataform workflow, such as a table or assertion,corresponds to an action that Dataform performs in BigQuery.For example, a table definition file is an action of creating or updating thetable in BigQuery.
In a Dataform workspace, you can develop the followingworkflow actions:
- Source data declarations
- Tables and views
- Incremental tables
- Table partitions and clusters
- Dependencies between actions
- Documentation of tables
- Custom SQL operations
- BigQuery labels
- BigQuery policy tags
- Dataform tags
- Data quality tests, called assertions
You can use JavaScript to reuse your Dataform workflow codein the following ways:
- Across a file with code encapsulation
- Across a repository with includes
- Across repositories with packages
Dataform compiles the workflow code in your workspace in real-time.In your workspace, you can view the compiled queries and details of actionsin each file. You can also view the compilation status and errors inthe edited file or in the repository.
To test the output of a compiled SQL query before you run it toBigQuery, you canrun preview of the queryin your Dataform workspace.
To inspect the entire workflow defined in your workspace, you canview an interactive compiled graphthat shows all compiled actions in your workflow and relationships between them.
Workflow compilation
Dataform uses default compilation settings,configured in the workflow settings file,to compile the workflow code in your workspace to SQL in real-time,creating a compilation result of the workspace.
You can override compilation settings to customize how Dataformcompiles your workflow into a compilation result.
Withworkspace compilation overrides,you can configure compilation overrides for all workspaces in a repository.You can set dynamic workspace overrides to create compilation results customfor each workspace, turning workspaces into isolated development environments.You can override the Google Cloud project in which Dataformruns the contents of a workspace, add a prefix to names of all compiled tables,and add a suffix to the default schema.
Withrelease configurations,you can configure templates of compilation settings for creating compilationresults of a Dataform repository. In a release configuration, youcan override the Google Cloud project in which Dataformruns the compilation results, add a prefix to names of all compiled tables, add asuffix the default schema, and add compilation variables. You can also setthe frequency of creating compilation results. To schedule runs of compilationresults created in a selected release configuration, you cancreate a workflow configuration.
Workflow run
During a workflow run, Dataform runs the compilation resultsof workflows to create or update assets in BigQuery.
To create or refresh the tables and views defined in your workflow inBigQuery, you canstart a workflow run manuallyin a development workspace or schedule runs.
You can schedule Dataform runs in BigQuery in thefollowing ways:
- Create workflow configurations toschedule runs of compilation results created in release configurations
- Schedule runs with Cloud Composer
- Schedule runs with Workflows and Cloud Scheduler
You can alsoautomate runs with Cloud Build triggers.
To debug errors, you can monitor runs in the following ways:
- View detailed Dataform execution logs
- View audit logs for Dataform
- View Cloud Logging logs for Dataform
Dataform core
Dataform core is an open source meta-language to create SQL tables andworkflows. Dataform core extends SQL by providing a dependencymanagement system, automated data quality testing, and data documentation.
You can use Dataform core for the following purposes:
- Defining tables, views, materialized views, or incremental tables.
- Defining data transformation logic.
- Declaring source data and managing table dependencies.
- Documenting table and column descriptions inside code.
- Reusing functions and variables across different queries.
- Writing data assertions to verify data consistency.
In Dataform, you use Dataform core to develop workflowsand deploy assets to BigQuery.
Dataform core is part of theopen-source Dataform data modeling frameworkthat also includesDataform CLI.You can compile and run Dataform core locally through the DataformCLI outside of Google Cloud.
To use Dataform core, you write SQLX files. Each SQLX file contains aquery that defines a database relation that Dataform creates andupdates inside BigQuery.
Dataform compiles your Dataform core code in real time tocreate a SQL compilation result that you can run in BigQuery.
Dataform compilation is hermetic to verify compilation consistency,meaning that the same code compiles to the same SQL compilation result every time.Dataform compiles your code in a sandbox environment with no internetaccess. No additional actions, such as calling external APIs, are availableduring compilation.
SQLX file config block
A SQLX file consists of a config block and a body. All config properties, andthe config block itself, are optional. Given this, any plain SQL file is avalid SQLX file that Dataform runs as-is.
In the config block, you can perform the following actions:
Specify query metadata
You can configure how Dataform materializes queries into BigQuery,for example the output table type, the target database, or labelsusing the config metadata.
Document data
You candocument your tablesand their fields directly in the config block. Documentation of your tables ispushed directly to BigQuery. You can parse this documentation and pushit out to other tools.
Define data quality tests
You can define data quality tests, calledassertions,to check for uniqueness, null values, or a custom condition. Dataformadds assertions defined in the config block to your workflow dependency treeafter table creation. You can also define assertions outside the config block,in a separate SQLX file.
The following code sample shows you how to define the output table type,document the table, and define a quality test in a config block of a SQLX file.
config{type:"table",description:"This table joins orders information from OnlineStore & payment information from PaymentApp",columns:{order_date:"The date when a customer placed their order",id:"Order ID as defined by OnlineStore",order_status:"The status of an order e.g. sent, delivered",customer_id:"Unique customer ID",payment_status:"The status of a payment e.g. pending, paid",payment_method:"How the customer chose to pay",item_count:"The number of items the customer ordered",amount:"The amount the customer paid"},assertions:{uniqueKey:["id"]}}SQLX file body
In the body of a SQLX file, you can perform the following actions:
- Define a table and its dependencies.
- Define additional SQL operations to run in BigQuery.
- Generate SQL code with JavaScript.
Define a table
To define a new table you can use SQLSELECT statements and theref function.
Theref function is a SQLX built-in function that is critical to dependencymanagement in Dataform. Theref function lets you reference tablesdefined in your Dataform project instead of hard coding the schema andtable names of your data table.
Dataform uses theref function to build a dependency tree of all thetables to be created or updated. After compiling, Dataform addsboilerplate statements likeCREATE,REPLACE, orINSERT.
The following code sample shows you how to reference a table in a SQLX filewith theref function.
config{type:"table"}SELECTorder_dateASdate,order_idASorder_id,order_statusASorder_status,SUM(item_count)ASitem_count,SUM(amount)ASrevenueFROM${ref("store_clean")}GROUPBY1,2,3The output is similar to the following:
CREATEORREPLACETABLEDataform.ordersASSELECTorder_dateASdate,order_idASorder_id,order_statusASorder_status,SUM(item_count)ASitem_count,SUM(amount)ASrevenueFROMDataform_stg.store_cleanGROUPBY1,2,3For more information on additional dependency management, for example,executing code conditionally, using other Dataform core built-infunctions, see theDataform core reference.
Define additional SQL operations
To configure Dataform to run one or more SQL statements before orafter creating a table or view, you canspecify pre-query and post-query operations.
The following code sample shows you how to configure table or view accesspermissions in a post-query operation.
SELECT*FROM...post_operations{GRANT`roles/bigquery.dataViewer`ONTABLE${self()}TO"group:someusers@dataform.co"}Encapsulate SQL code
To define reusable functions to generate repetitive parts of SQL code, you canuse JavaScript blocks. You can reuse code defined in a JavaScript block onlyinside the SLQX file where the block is defined. To reuse code across yourentire repository, you cancreate includes.
To dynamically modify a query, you can use inline JavaScript anywhere in the body.
The following code sample shows how to define a JavaScript block in a SQLX fileand use it inline inside a query:
js{constcolumnName="foo";}SELECT1AS${columnName}FROM"..."Limitations
Dataform has the following known limitations:
Dataform in Google Cloud runs on a plain V8 runtime and does notsupport additional capabilities and modules provided by Node.js. If yourexisting codebase requires anyNode.js modules,you need to remove these dependencies.
Projects without a name field in
package.jsongenerate diffs onpackage-lock.jsonevery time packages are installed. To avoid this outcome,you need to add anameproperty inpackage.json.git+https://URLs for dependencies inpackage.jsonare notsupported.Convert such URLs to plain
https://archive URLs.For example, convertgit+https://github.com/dataform-co/dataform-segment.git#1.5tohttps://github.com/dataform-co/dataform-segment/archive/1.5.tar.gz.Manually running unit tests is not available.
Searching for file content in development workspaces is not available.
Note: For the list of egress IP address ranges to use when allow-listingDataform IPs with third-party remote repositories, seeDataform locations.As ofDataform core
3.0.0.,Dataform doesn't distribute a Docker image. You can build your ownDocker image of Dataform, which you can use to run the equivalent ofDataform CLI commands. To build your own Docker image, seeContainerize an applicationin the Docker documentation.The following Dataform API methods don't comply with theAIP.134 guidelinesby treating the
*wildcard entry as a bad request and byupdating all fields instead of set fields whenfield_maskis omitted:If a scheduled workflow configuration run doesn't finish before the start ofthe next scheduled run, the next scheduled run is skipped and marked with anerror.
What's next
- To learn more about the code lifecycle in Dataform, seeIntroduction to code lifecycle in Dataform.
- To learn more about Dataform repositories, seeIntroduction to repositories.
- To learn more about Dataform workspaces, seeCreate a Dataform development workspace.
- To learn more about developing workflows in Dataform, seeOverview of workflows.
- To learn more about the Dataform CLI, seeUse the Dataform CLI.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-18 UTC.