googleapis/python-bigquery-pandasPublic

NotificationsYou must be signed in to change notification settings
Fork128
Star491

Comments

feat: support biglake tables in pandas_gbq.sample#1014

Draft

tswast wants to merge 1 commit intomainfrom

tswast-biglake-sample

Draft

feat: support biglake tables in pandas_gbq.sample#1014
tswast wants to merge 1 commit intomainfrom
tswast-biglake-sample

Conversation

Copy link

Collaborator

tswast commentedJan 21, 2026

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as abug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

feat: support biglake tables in pandas_gbq.sample

8784664

product-auto-labelbot added size: m

Pull request size is medium.

api: bigquery

Issues related to the googleapis/python-bigquery-pandas API.

samplesIssues that are directly related to samples. labels

Jan 21, 2026

Copy link

Contributor

gemini-code-assistbot commentedJan 21, 2026

Summary of Changes

Hello@tswast, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances thepandas_gbq.sample functionality by integrating support for BigLake tables. It introduces a dedicated module for BigLake-specific operations, refactors the existing sampling mechanisms to accommodate different table types, and updates dependencies to support the new BigLake interactions. The overall aim is to broaden the utility of the sampling function to a wider range of BigQuery-related data sources.

Highlights

BigLake Table Support: Introduced initial support for BigLake tables within thepandas_gbq.sample function, allowing it to differentiate and handle BigLake table identifiers.
New BigLake Utility Module: Added a new modulepandas_gbq/core/biglake.py which includes utilities for interacting with BigLake tables, such as aBigLakeTableId dataclass and aget_table_metadata function to fetch table metadata using the Iceberg REST Catalog API.
Refactored Sampling Logic: The core sampling logic inpandas_gbq.core.sample.py has been refactored into two new helper functions:_sample_bq_table for standard BigQuery tables and_sample_biglake_table for BigLake tables. The mainsample function now dispatches to these helpers based on thetable_id format.
Dependency Update: Addedrequests as a new dependency tosetup.py and pinned its version intesting/constraints-3.9.txt, which is required for making HTTP requests to the BigLake Iceberg REST Catalog.
Internal API Change: The_sample_with_tablesample and_sample_with_limit internal functions now accept atable_id string directly instead of agoogle.cloud.bigquery.Table object, making them more flexible.

🧠New Feature in Public Preview: You can now enableMemory to helpGemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style.Clickhere to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on ourdocumentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either/gemini <command> or@gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a.gemini/ folder in the base of the repository. Detailed instructions can be foundhere.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on@gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign uphere.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with theGemini Code Assist IDE Extension.

Footnotes

Review thePrivacy Notices,Generative AI Prohibited Use Policy,Terms of Service, and learn how to configure Gemini Code Assist in GitHubhere. Gemini can make mistakes, so double check it anduse code with caution.↩

gemini-code-assistbot reviewed

Jan 21, 2026

View reviewed changes

Copy link

Contributor

gemini-code-assistbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Code Review

This pull request adds the initial structure to support sampling from BigLake tables inpandas_gbq.sample. It refactors the existing sampling logic into separate functions for BigQuery and BigLake tables and introduces a newbiglake.py module to interact with the BigLake REST API.

My review has identified a few critical issues. The core logic for sampling BigLake tables is not yet implemented. The newbiglake.py module contains a critical bug in URL construction and lacks proper error handling for API requests. I've also noted some areas for improvement regarding the robustness of table ID parsing and documentation. Please address these points to complete the feature.

pandas_gbq/core/biglake.py

Comment on lines +52 to +61

		return session.get(
		f"{_ICEBERG_REST_CATALOG_URI}.{path}",
		headers={
		"x-goog-user-project": billing_project_id,
		"Content-Type": "application/json; charset=utf-8",
		# TODO(tswast): parameter for this option (or get from catalog metadata?)
		# /iceberg/{$api_version}/restcatalog/extensions/{name=projects//catalogs/}
		"X-Iceberg-Access-Delegation": "vended-credentials",
		},
		).json()

Copy link

Contributor

gemini-code-assistbotJan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The URL for the BigLake REST API is constructed incorrectly. There's an extra. between the base URI and the path, which will lead to a404 Not Found error. The path already starts with a/. Additionally, the response fromsession.get is not checked for errors before attempting to parse it as JSON. This can lead to unhelpfulJSONDecodeError exceptions on HTTP failures. You should callresponse.raise_for_status() to handle non-2xx responses gracefully.

response=session.get(f"{_ICEBERG_REST_CATALOG_URI}{path}",headers={"x-goog-user-project":billing_project_id,"Content-Type":"application/json; charset=utf-8",# TODO(tswast): parameter for this option (or get from catalog metadata?)# /iceberg/{$api_version}/restcatalog/extensions/{name=projects/*/catalogs/*}"X-Iceberg-Access-Delegation":"vended-credentials",        },    )response.raise_for_status()returnresponse.json()

pandas_gbq/core/sample.py

Comment on lines +207 to +216

		def _sample_biglake_table(
		*,
		table_id: str,
		credentials: google.oauth2.credentials.Credentials,
		bqclient: google.cloud.bigquery.Client,
		target_bytes: int,
		progress_bar_type: str \| None,
		use_bqstorage_api: bool,
		) -> Optional[pandas.DataFrame]:
		pass

Copy link

Contributor

gemini-code-assistbotJan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The function_sample_biglake_table is a stub with apass statement. The PR title suggests support for BigLake tables is being added, but the core implementation is missing. This function needs to be implemented to fulfill the feature's goal. Additionally, corresponding unit tests should be added to verify its functionality.

pandas_gbq/core/biglake.py

		"""
		# https://iceberg.apache.org/spec/#metrics
		# total-files-size
		project, catalog, namespace, table = table_id.split(".")

Copy link

Contributor

gemini-code-assistbotJan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Unpackingtable_id.split('.') directly into four variables is unsafe. Iftable_id doesn't contain exactly three dots, this will raise aValueError. It's better to validate the number of parts before unpacking to provide a more informative error message and prevent unexpected crashes.

parts=table_id.split(".")iflen(parts)!=4:raiseValueError("Invalid BigLake table ID. Expected format: ""project.catalog.namespace.table"        )project,catalog,namespace,table=parts

pandas_gbq/core/biglake.py

Comment on lines +32 to +33

		"""
		Docstring for get_table_metadata

Copy link

Contributor

gemini-code-assistbotJan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The docstring forget_table_metadata is a placeholder. Please add a proper docstring explaining the function's purpose, its parameters (table_id,credentials,billing_project_id), and what it returns. This is important for maintainability and usability.

pandas_gbq/core/sample.py

		# BigLake tables can't be read directly by the BQ Storage Read API, so make
		# sure we run a query first.
		parts = table_id.split(".")
		if len(parts) == 4:

Copy link

Contributor

gemini-code-assistbotJan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Relying onlen(table_id.split('.')) == 4 to identify a BigLake table is fragile. BigQuery identifiers can contain dots if they are quoted (e.g.,my-project.dataset.with.dot.table). This could lead to misidentifying a standard BigQuery table as a BigLake table. Consider a more robust detection mechanism. For example, you could attempt to parse the ID as a BigLake ID and handle failure, or introduce an explicit parameter to specify the table type.

Labels

api: bigquery

Issues related to the googleapis/python-bigquery-pandas API.

samples

Issues that are directly related to samples.

size: m

Pull request size is medium.

Movatterモバイル変換

Comments

Conversation

tswast commentedJan 21, 2026

Uh oh!

gemini-code-assistbot commentedJan 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assistbot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assistbotJan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assistbotJan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assistbotJan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assistbotJan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assistbotJan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant