Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Usejob_config.schema for data type conversion if specified inload_table_from_dataframe.#8105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation

@tswast
Copy link
Contributor

@tswasttswast commentedMay 22, 2019
edited
Loading

Use the BigQuery schema to inform encoding of file used in load job.
This fixes an issue where a dataframe with ambiguous types (such as an
object column containing allNone values) could not be appended to
an existing table, since the schemas wouldn't match in most cases.

Closes#7370.

@googlebotgooglebot added the cla: yesThis human has signed the Contributor License Agreement. labelMay 22, 2019
@tswasttswast marked this pull request as ready for reviewMay 22, 2019 23:36
@tswasttswast requested a review froma teamMay 22, 2019 23:36
@tswasttswast added the api: bigqueryIssues related to the BigQuery API. labelMay 22, 2019
@tswasttswast changed the titleUsejob_config.schema if specified inload_table_from_dataframe.Usejob_config.schema for data type conversion if specified inload_table_from_dataframe.May 22, 2019
@tswasttswastforce-pushed theissue7370-b132658518-load-dataframe-nulls branch from4c90553 to2e8290eCompareMay 22, 2019 23:45
@tswasttswast requested a review fromshollymanMay 23, 2019 14:17
raiseValueError("pyarrow is required for BigQuery schema conversion")

iflen(bq_schema)!=len(dataframe.columns):
raiseValueError(
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note from chat: Maybe we want to allow the bq_schema to be used as an override? Any unmentioned columns get the default pandas type inference.

This is how pandas-gbq works. The schema argument is more used as an override for when a particular column is ambiguous.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

On second thought, let's leave this as-is and fixup later. Filed#8140 as a feature request.


try:
dataframe.to_parquet(tmppath)
ifjob_config.schema:
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note from chat: if schema isn't populated, we might want to call get_table and use the table's schema if it the table already exists and we're appending to it. (This is what pandas-gbq does)

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ditto. Filed#8142. I think this would make a good feature, but shouldn't block this PR.

Copy link
Contributor

@shollymanshollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Everything seems reasonable, but the multiple conversions make me a bit twitchy. The part that I didn't verify is the parquet-to-bq type mappings: is there anything special we need to do similar to the avro logicaltype annotations to get the correct type mappings from there?



BQ_TO_ARROW_SCALARS= {}
ifpyarrowisnotNone:# pragma: NO COVER

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Consider this format, which is more idiomatic:

if pyarrow:
MY_CONST = {
...
}
else:
MY_CONST = {}

Note two things: (1) all initialization is inside a branch and (2) we no longer use "is not None" or "is None".


iflen(bq_schema)!=len(dataframe.columns):
raiseValueError(
"Number of columns in schema must match number of columns in dataframe"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Period?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done.

"STRING":pyarrow.string,
"TIME":pyarrow_time,
"TIMESTAMP":pyarrow_timestamp,
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is there a list somewhere that defines BQ types? I wonder if we can add an assertion here that BQ_TO_ARROW_SCALARS.keys() == BQ_TYPES.keys(), so we have a better guarantee that all types are accounted for.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not yet. There's an open FR at#7632 I've been hesitant to add such a list, since it's yet another thing to keep in sync manually, but I agree it'd be useful for cases such as this.

Returns None if default Arrow type inspection should be used.
"""
# TODO: Use pyarrow.list_(item_type) for repeated (array) fields.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would be good to include a little more context in the TODO comment as to why we are not adding support for these in this change.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There wasn't a good reason before, so I implemented this.

I tried adding it to the system tests, but now I see there are some open issues in pyarrow that are preventing this support. I think REPEATED support may get fixed when#8093 is fixed, since there's a mode mismatch right now (fields are always set to nullable in the parquet file).

Struct support depends onhttps://jira.apache.org/jira/browse/ARROW-2587. I've filedhttps://github.com/googleapis/google-cloud-python/issues/8191 to track this as an open issue.

)
num_rows=100
nulls= [None]*num_rows
dataframe=pandas.DataFrame(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: I would suggest putting in non-null values for the sample data to make the test more complete.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The bug actually only shows up when the whole column contains nulls, because when at least one non-null value is present, pandas auto-detect code works correctly. I do include non-nulls in the unit tests.

table=retry_403(Config.CLIENT.create_table)(
Table(table_id,schema=table_schema)
)
self.to_delete.insert(0,table)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is there a reason why we prepend the table ref to to_delete instead of appending it?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So that the table gets deleted before the dataset does.

try:
importpandas
exceptImportError:# pragma: NO COVER
pandas=None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If you ever get tired of the try/except pattern, you can write a maybe_import(*args) function that returns a tuple of modules:

def maybe_import(*args):  modules = []  for arg in args:    try:      modules.append(__import__(arg))    except ImportError:      return tupe([None] * len(args))  return tuple(modules)

@pytest.mark.skipIf(pyarrowisNone,"Requires `pyarrow`")
deftest_bq_to_arrow_data_type(module_under_test,bq_type,bq_mode,is_correct_type):
field=schema.SchemaField("ignored_name",bq_type,mode=bq_mode)
got=module_under_test.bq_to_arrow_data_type(field)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

s/got/actual/?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done.

…d_table_from_dataframe`.Use the BigQuery schema to inform encoding of file used in load job.This fixes an issue where a dataframe with ambiguous types (such as an`object` column containing all `None` values) could not be appended toan existing table, since the schemas wouldn't match in most cases.
@tswasttswastforce-pushed theissue7370-b132658518-load-dataframe-nulls branch from2e8290e toaa38e42CompareMay 30, 2019 00:38
Copy link
ContributorAuthor

@tswasttswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for the review@aryann !

I tried to add support for REPEATED and RECORD columns, but hit some roadblocks. I'll follow-up with those types.

Note: Since I did add partial support, I know test coverage will fail. I'll add a commit with additional tests before submitting.

)
num_rows=100
nulls= [None]*num_rows
dataframe=pandas.DataFrame(
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The bug actually only shows up when the whole column contains nulls, because when at least one non-null value is present, pandas auto-detect code works correctly. I do include non-nulls in the unit tests.

@pytest.mark.skipIf(pyarrowisNone,"Requires `pyarrow`")
deftest_bq_to_arrow_data_type(module_under_test,bq_type,bq_mode,is_correct_type):
field=schema.SchemaField("ignored_name",bq_type,mode=bq_mode)
got=module_under_test.bq_to_arrow_data_type(field)
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done.

Returns None if default Arrow type inspection should be used.
"""
# TODO: Use pyarrow.list_(item_type) for repeated (array) fields.
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There wasn't a good reason before, so I implemented this.

I tried adding it to the system tests, but now I see there are some open issues in pyarrow that are preventing this support. I think REPEATED support may get fixed when#8093 is fixed, since there's a mode mismatch right now (fields are always set to nullable in the parquet file).

Struct support depends onhttps://jira.apache.org/jira/browse/ARROW-2587. I've filedhttps://github.com/googleapis/google-cloud-python/issues/8191 to track this as an open issue.


iflen(bq_schema)!=len(dataframe.columns):
raiseValueError(
"Number of columns in schema must match number of columns in dataframe"
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done.

@tswasttswast requested a review fromshollymanMay 30, 2019 21:06
@tswast
Copy link
ContributorAuthor

@aryann Just pushed some commits that get unit test coverage back to 100%. Added system test for non-null scalar values + explicit schema.

@pytest.fixture
defmodule_under_test():
fromgoogle.cloud.bigqueryimport_pandas_helpers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: rm empty line?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks. We actually don't have much control over this, sinceblack (Python code formatter) will just add it back.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@shollymanshollymanAwaiting requested review from shollyman

1 more reviewer

@aryannaryannaryann approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

api: bigqueryIssues related to the BigQuery API.cla: yesThis human has signed the Contributor License Agreement.

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

BigQuery: Use BigQuery schema (from LoadJobConfig) if available when converting to Parquet in load_table_from_dataframe

4 participants

@tswast@shollyman@aryann@googlebot

[8]ページ先頭

©2009-2025 Movatter.jp