Use the Colab Enterprise Data Science Agent with BigQuery
Preview
This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.
Note: To provide feedback, to ask questions, or to request to opt out of thisPreview feature, contactvertex-notebooks-previews-external@google.comor fill out theData Science Agent Public Preview Opt-out form.The Data Science Agent (DSA) for Colab Enterprise andBigQuery lets you automate exploratory data analysis, performmachine learning tasks, and deliver insights all within aColab Enterprise notebook.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the BigQuery, Vertex AI, Dataform, and Compute Engine APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.For new projects, the BigQuery API is automatically enabled.
If you're new to Colab Enterprise in BigQuery, seethe setup steps on theCreate notebookspage.
Limitations
- The Data Science Agent supports the following data sources:
- CSV files
- BigQuery tables
- The code produced by the Data Science Agent only runs in your notebook'sruntime.
- The Data Science Agent isn't supported in projects that have enabledVPC Service Controls.
- Searching for BigQuery tables using the
@mentionfunctionis limited to your current project. Use the table selector to search acrossprojects. - The
@mentionfunction only searches for BigQuery tables.To search for data files that you can upload, use the+symbol. - PySpark in the Data Science Agent only generatesServerless for Apache Spark 4.0 code. The DSA can help you upgrade toServerless for Apache Spark 4.0, but users who require earlier versionsshouldn't use the Data Science Agent.
When to use the Data Science Agent
The Data Science Agent helps you with tasks ranging from exploratory dataanalysis to generating machine learning predictions and forecasts. You can usethe DSA for:
- Large-scale data processing: Use BigQuery ML,BigQuery DataFrames, or Serverless for Apache Sparkto perform distributed data processing on large datasets. This lets youefficiently clean, transform, and analyze data that's too large to fit intomemory on a single machine.
- Generating a plan: Generate and modify a plan to complete a particulartask using common tools such as Python, SQL,Serverless for Apache Spark, and BigQuery DataFrames.
- Data exploration: Explore a dataset to understand its structure,identify potential issues like missing values and outliers, and examine thedistribution of key variables using Python or SQL.
- Data cleaning: Clean your data. For example, remove data points that areoutliers.
- Data wrangling: Convert categorical features into numericalrepresentations using techniques like one-hot encoding or label encoding orby using BigQuery MLfeature transformation tools.Create new features for analysis.
- Data analysis: Analyze the relationships between different variables.Calculate correlations between numerical features and explore distributionsof categorical features. Look for patterns and trends in the data.
- Data visualization: Create visualizations such as histograms, box plots,scatter plots, and bar charts that represent the distributions ofindividual variables and the relationships between them. You can also createvisualizations in Python for tables stored in BigQuery.
- Feature engineering: Engineer new features from a cleaned dataset.
- Data splitting: Split an engineered dataset into training, validation,and testing datasets.
- Model training: Train a model by using the training data in a pandasDataFrame (
X_train,y_train),BigQuery DataFrames,aPySpark DataFrame, or by using the BigQuery MLCREATE MODELstatement with BigQuery tables. - Model optimization: Optimize a model by using the validation set.Explore alternative models like
DecisionTreeRegressorandRandomForestRegressorand compare their performance. - Model evaluation: Evaluate model performance on a test datasetusing a pandas DataFrame, BigQuery DataFrames, or a PySpark DataFrame.You can also assess model quality and compare models by usingBigQuery MLmodel evaluation functionsfor models trained using BigQuery ML.
- Model inference: Perform inference with BigQuery ML trainedmodels, imported models, and remote models using BigQuery MLinference functions. You can also usethe BigFrames
model.predict()method or PySparktransformers to make predictions.
Use the Data Science Agent in BigQuery
The following steps show you how to use the Data Science Agent inBigQuery.
Create or open a Colab Enterprise notebook.
Reference your data in one of the following ways:
- Upload a CSV file or use the
+symbol in your prompt to search foravailable files - Choose one or more BigQuery tables in the table selectorfrom your current project or from other projects you have access to
- Reference a BigQuery table name in your prompt in thisformat:
project_id:dataset.table - Type the
@symbol to search for a BigQuery table nameusing the@mentionfunction
- Upload a CSV file or use the
Enter a prompt that describes the data analysis you want to perform or theprototype you want to build. The Data Science Agent's default behavior is togenerate Python code using open source libraries such as sklearn toaccomplish complex machine learning tasks. To use a specific tool, includethe following keywords in your prompt:
- If you want to use BigQuery ML, include the "SQL" keyword.
- If you want to use "BigQuery DataFrames", specify the "BigFrames" or"BigQuery DataFrames" keywords.
- If you want to use PySpark, include the "Apache Spark" or"PySpark" keywords.
For help, see thesample prompts.
Examine the results.
Analyze a CSV file
To analyze a CSV using the Data Science Agent in BigQuery,follow these steps.
Go to theBigQuery page.
On the BigQuery Studio welcome page, underCreate new,clickNotebook.
Alternatively, in the tab bar, click thedrop-down arrow next to the+ icon, and then clickNotebook> Empty notebook.
Click theToggle Gemini in Colab button to open the chat dialog.
Note: You can move the chat dialog into a separate panel outside thenotebook by clicking theMove to panel icon.Upload your CSV file.
In the chat dialog, clickAdd to Gemini> Upload.
If necessary, authorize your Google Account.
Browse to the location of the CSV file, and then clickOpen.
Alternatively, type the
+symbol in your prompt to search for availablefiles to upload.Enter your prompt in the chat window. For example:
Identify trends andanomalies in this file.ClickSend. The results appear in the chat window.
You can ask the agent to change the plan, or you can run it by clickingAccept & run. As the plan runs, generated code and text appear in thenotebook. ClickCancel to stop.
Analyze BigQuery tables
To analyze a BigQuery table, choose one or more tables in thetable selector, provide a reference to the table in your prompt, or search for atable by using the@ symbol.
Go to theBigQuery page.
On the BigQuery Studio welcome page, underCreate new,clickNotebook.
Alternatively, in the tab bar, click thedrop-down arrow next to the+ icon, and then clickNotebook> Empty notebook.
Click theToggle Gemini in Colab button to open the chat dialog.
Note: You can move the chat dialog into a separate panel outside thenotebook by clicking theMove to panel icon.Enter your prompt in the chat window.
Reference your data in one of the following ways:
Choose one or more tables using the table selector:
ClickAdd to Gemini> BigQuery tables.
In theBigQuery tables window, selectone or more tables in your project. You can search for tables acrossprojects and filter tables by using the search bar.
Include a BigQuery table name directly in your prompt.For example: "Help me perform exploratory data analysis and getinsights about the data in this table:
project_id:dataset.table."Replace the following:
project_id: your project IDdataset: the name of the dataset thatcontains the table you're analyzingtable: the name of the table you'reanalyzing
Type
@to search for a BigQuery table in your currentproject.
ClickSend.
The results appear in the chat window.
You can ask the agent to change the plan, or you can run it by clickingAccept & run. As the plan runs, generated code and text appear in thenotebook. For additional steps in the plan, you may be required to clickAccept & run again. ClickCancel to stop.
Sample prompts
Regardless of the complexity of the prompt that you use, the Data Science Agentgenerates a plan that you can refine to meet your needs.
The following examples show the types of prompts that you can use with the DSA.
Python prompts
Python code is generated by default unless you use a specific keyword in theprompt such as "BigQuery ML" or "SQL".
- Investigate and fill missing values by using the k-Nearest Neighbors (KNN)machine learning algorithm.
- Create a plot of salary by experience level. Use the
experience_levelcolumn to group the salaries, and create a box plot for each group showingthe values from thesalary_in_usdcolumn. - Use the XGBoost algorithm to make a model for determining the
classvariable of a particular fruit. Split the data into training and testingdatasets to generate a model and to determine the model's accuracy. Create aconfusion matrix to show the predictions amongst each class, including allpredictions that are correct and incorrect. - Forecast
target_variablefromfilename.csvfor thenext six months.
SQL and BigQuery ML prompts
- Create and evaluate a classification model on
bigquery-public-data.ml_datasets.census_adult_incomeusingBigQuery SQL. - Using SQL, forecast the future traffic of my website for the next monthbased on
bigquery-public-data.google_analytics_sample.ga_sessions_*.Then, plot the historical and forecasted values. - Group similar customers together to create targeting market campaigns usinga KMeans model and BigQuery ML SQL functions. Use three features forclustering. Then visualize the results by creating a series of 2D scatterplots. Use the table
bigquery-public-data.ml_datasets.census_adult_income. - Generate text embeddings in BigQuery ML using the review content in
bigquery-public-data.imdb.reviews.
For a list of supported models and machine learning tasks, see theBigQuery ML documentation.
DataFrame prompts
- Create a pandas DataFrame for the data in
project_id:dataset.table.Analyze the data for null values, and then graph the distribution of eachcolumn using the graph type. Use violin plots for measured values and barplots for categories. - Read
filename.csvand construct a DataFrame. Runanalysis on the DataFrame to determine what needs to be done with values.For example, are there missing values that need to be replaced or removed,or are there duplicate rows that need to be addressed. Use the data file todetermine the distribution of the money invested in USD per citylocation. Graph the top 20 results using a bar graph that shows the resultsin descending order as Location versus Avg Amount Invested (USD). - Create and evaluate a classification model on
project_id:dataset.tableusingBigQuery DataFrames. - Create a time series forecasting model on
project_id:dataset.tableusingBigQuery DataFrames, and visualize the model evaluations. - Visualize the sales figures in the past year in BigQuerytable
project_id:dataset.tableusing BigQuery DataFrames. - Find the features that can best predict the penguin species from thetable
bigquery-public_data.ml_datasets.penguinsusing BigQuery DataFrames.
PySpark prompts
- Create and evaluate a classification model on
project_id:dataset.tableusingServerless for Apache Spark. - Group similar customers together to create targeting market campaigns, butfirst do dimensionality reduction using a PCA model. Use PySpark to do thison table
project_id:dataset.table.
Turn off Gemini in BigQuery
To turn off Gemini in BigQuery for a Google Cloud project, an administrator must turn off the Gemini for Google Cloud API. SeeDisabling services.
To turn off Gemini in BigQuery for a specific user, an administrator needs to revoke theGemini for Google Cloud User (roles/cloudaicompanion.user) role for that user. SeeRevoke a single IAM role.
Pricing
During Preview, you are charged for running code in the notebook's runtime andfor any BigQueryslots you used. For moreinformation, seeColab Enterprise pricing.
Supported regions
To view the supported regions for Colab Enterprise's Data ScienceAgent, seeLocations.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.