Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Comments

feat: support biglake tables in pandas_gbq.sample#1014

Draft
tswast wants to merge 1 commit intomainfrom
tswast-biglake-sample
Draft

feat: support biglake tables in pandas_gbq.sample#1014
tswast wants to merge 1 commit intomainfrom
tswast-biglake-sample

Conversation

@tswast
Copy link
Collaborator

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as abug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-labelproduct-auto-labelbot added size: mPull request size is medium. api: bigqueryIssues related to the googleapis/python-bigquery-pandas API. samplesIssues that are directly related to samples. labelsJan 21, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello@tswast, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances thepandas_gbq.sample functionality by integrating support for BigLake tables. It introduces a dedicated module for BigLake-specific operations, refactors the existing sampling mechanisms to accommodate different table types, and updates dependencies to support the new BigLake interactions. The overall aim is to broaden the utility of the sampling function to a wider range of BigQuery-related data sources.

Highlights

  • BigLake Table Support: Introduced initial support for BigLake tables within thepandas_gbq.sample function, allowing it to differentiate and handle BigLake table identifiers.
  • New BigLake Utility Module: Added a new modulepandas_gbq/core/biglake.py which includes utilities for interacting with BigLake tables, such as aBigLakeTableId dataclass and aget_table_metadata function to fetch table metadata using the Iceberg REST Catalog API.
  • Refactored Sampling Logic: The core sampling logic inpandas_gbq.core.sample.py has been refactored into two new helper functions:_sample_bq_table for standard BigQuery tables and_sample_biglake_table for BigLake tables. The mainsample function now dispatches to these helpers based on thetable_id format.
  • Dependency Update: Addedrequests as a new dependency tosetup.py and pinned its version intesting/constraints-3.9.txt, which is required for making HTTP requests to the BigLake Iceberg REST Catalog.
  • Internal API Change: The_sample_with_tablesample and_sample_with_limit internal functions now accept atable_id string directly instead of agoogle.cloud.bigquery.Table object, making them more flexible.

🧠New Feature in Public Preview: You can now enableMemory to helpGemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style.Clickhere to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on ourdocumentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either/gemini <command> or@gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

FeatureCommandDescription
Code Review/gemini reviewPerforms a code review for the current pull request in its current state.
Pull Request Summary/gemini summaryProvides a summary of the current pull request in its current state.
Comment@gemini-code-assistResponds in comments when explicitly tagged, both in pull request comments and review comments.
Help/gemini helpDisplays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a.gemini/ folder in the base of the repository. Detailed instructions can be foundhere.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on@gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign uphere.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with theGemini Code Assist IDE Extension.

Footnotes

  1. Review thePrivacy Notices,Generative AI Prohibited Use Policy,Terms of Service, and learn how to configure Gemini Code Assist in GitHubhere. Gemini can make mistakes, so double check it anduse code with caution.

Copy link
Contributor

@gemini-code-assistgemini-code-assistbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Code Review

This pull request adds the initial structure to support sampling from BigLake tables inpandas_gbq.sample. It refactors the existing sampling logic into separate functions for BigQuery and BigLake tables and introduces a newbiglake.py module to interact with the BigLake REST API.

My review has identified a few critical issues. The core logic for sampling BigLake tables is not yet implemented. The newbiglake.py module contains a critical bug in URL construction and lacks proper error handling for API requests. I've also noted some areas for improvement regarding the robustness of table ID parsing and documentation. Please address these points to complete the feature.

Comment on lines +52 to +61
return session.get(
f"{_ICEBERG_REST_CATALOG_URI}.{path}",
headers={
"x-goog-user-project": billing_project_id,
"Content-Type": "application/json; charset=utf-8",
# TODO(tswast): parameter for this option (or get from catalog metadata?)
# /iceberg/{$api_version}/restcatalog/extensions/{name=projects/*/catalogs/*}
"X-Iceberg-Access-Delegation": "vended-credentials",
},
).json()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

critical

The URL for the BigLake REST API is constructed incorrectly. There's an extra. between the base URI and the path, which will lead to a404 Not Found error. The path already starts with a/. Additionally, the response fromsession.get is not checked for errors before attempting to parse it as JSON. This can lead to unhelpfulJSONDecodeError exceptions on HTTP failures. You should callresponse.raise_for_status() to handle non-2xx responses gracefully.

response=session.get(f"{_ICEBERG_REST_CATALOG_URI}{path}",headers={"x-goog-user-project":billing_project_id,"Content-Type":"application/json; charset=utf-8",# TODO(tswast): parameter for this option (or get from catalog metadata?)# /iceberg/{$api_version}/restcatalog/extensions/{name=projects/*/catalogs/*}"X-Iceberg-Access-Delegation":"vended-credentials",        },    )response.raise_for_status()returnresponse.json()

Comment on lines +207 to +216
def _sample_biglake_table(
*,
table_id: str,
credentials: google.oauth2.credentials.Credentials,
bqclient: google.cloud.bigquery.Client,
target_bytes: int,
progress_bar_type: str | None,
use_bqstorage_api: bool,
) -> Optional[pandas.DataFrame]:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

critical

The function_sample_biglake_table is a stub with apass statement. The PR title suggests support for BigLake tables is being added, but the core implementation is missing. This function needs to be implemented to fulfill the feature's goal. Additionally, corresponding unit tests should be added to verify its functionality.

"""
# https://iceberg.apache.org/spec/#metrics
# total-files-size
project, catalog, namespace, table = table_id.split(".")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

high

Unpackingtable_id.split('.') directly into four variables is unsafe. Iftable_id doesn't contain exactly three dots, this will raise aValueError. It's better to validate the number of parts before unpacking to provide a more informative error message and prevent unexpected crashes.

parts=table_id.split(".")iflen(parts)!=4:raiseValueError("Invalid BigLake table ID. Expected format: ""project.catalog.namespace.table"        )project,catalog,namespace,table=parts

Comment on lines +32 to +33
"""
Docstring for get_table_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

medium

The docstring forget_table_metadata is a placeholder. Please add a proper docstring explaining the function's purpose, its parameters (table_id,credentials,billing_project_id), and what it returns. This is important for maintainability and usability.

# BigLake tables can't be read directly by the BQ Storage Read API, so make
# sure we run a query first.
parts = table_id.split(".")
if len(parts) == 4:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

medium

Relying onlen(table_id.split('.')) == 4 to identify a BigLake table is fragile. BigQuery identifiers can contain dots if they are quoted (e.g.,my-project.dataset.with.dot.table). This could lead to misidentifying a standard BigQuery table as a BigLake table. Consider a more robust detection mechanism. For example, you could attempt to parse the ID as a BigLake ID and handle failure, or introduce an explicit parameter to specify the table type.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

1 more reviewer

@gemini-code-assistgemini-code-assist[bot]gemini-code-assist[bot] left review comments

Reviewers whose approvals may not affect merge requirements

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

api: bigqueryIssues related to the googleapis/python-bigquery-pandas API.samplesIssues that are directly related to samples.size: mPull request size is medium.

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

@tswast

[8]ページ先頭

©2009-2026 Movatter.jp