Use the BigQuery schema to inform encoding of file used in load job.
This fixes an issue where a dataframe with ambiguous types (such as an
object column containing allNone values) could not be appended to
an existing table, since the schemas wouldn't match in most cases.

Closes#7370.

googlebot added the cla: yesThis human has signed the Contributor License Agreement. label

May 22, 2019

tswast marked this pull request as ready for review

May 22, 2019 23:36

tswast requested a review froma team

May 22, 2019 23:36

tswast added the api: bigqueryIssues related to the BigQuery API. label

May 22, 2019

tswast changed the title~~Usejob_config.schema if specified inload_table_from_dataframe.~~Usejob_config.schema for data type conversion if specified inload_table_from_dataframe.

May 22, 2019

tswast force-pushed theissue7370-b132658518-load-dataframe-nulls branch from4c90553 to2e8290eCompare

May 22, 2019 23:45

tswast requested a review fromshollyman

May 23, 2019 14:17

tswast commented

May 23, 2019

View reviewed changes

bigquery/google/cloud/bigquery/_pandas_helpers.py

		raiseValueError("pyarrow is required for BigQuery schema conversion")

		iflen(bq_schema)!=len(dataframe.columns):
		raiseValueError(

Copy link

ContributorAuthor

tswastMay 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note from chat: Maybe we want to allow the bq_schema to be used as an override? Any unmentioned columns get the default pandas type inference.

This is how pandas-gbq works. The schema argument is more used as an override for when a particular column is ambiguous.

Copy link

ContributorAuthor

tswastMay 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

On second thought, let's leave this as-is and fixup later. Filed#8140 as a feature request.

tswast commented

May 23, 2019

View reviewed changes

bigquery/google/cloud/bigquery/client.py


		try:
		dataframe.to_parquet(tmppath)
		ifjob_config.schema:

Copy link

ContributorAuthor

tswastMay 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note from chat: if schema isn't populated, we might want to call get_table and use the table's schema if it the table already exists and we're appending to it. (This is what pandas-gbq does)

Copy link

ContributorAuthor

tswastMay 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ditto. Filed#8142. I think this would make a good feature, but shouldn't block this PR.

shollyman reviewed

May 23, 2019

View reviewed changes

Copy link

Contributor

shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Everything seems reasonable, but the multiple conversions make me a bit twitchy. The part that I didn't verify is the parquet-to-bq type mappings: is there anything special we need to do similar to the avro logicaltype annotations to get the correct type mappings from there?

This was referencedMay 24, 2019

BigQuery: Allow partial schema inload_table_from_dataframe#8140

Closed

BigQuery: get table schema if not supplied (and have pyarrow) inload_table_from_dataframe#8142

Closed

aryann suggested changes

May 28, 2019

View reviewed changes

bigquery/google/cloud/bigquery/_pandas_helpers.py Outdated



		BQ_TO_ARROW_SCALARS= {}
		ifpyarrowisnotNone:# pragma: NO COVER

Copy link

aryannMay 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Consider this format, which is more idiomatic:

if pyarrow:
MY_CONST = {
...
}
else:
MY_CONST = {}

Note two things: (1) all initialization is inside a branch and (2) we no longer use "is not None" or "is None".

bigquery/google/cloud/bigquery/_pandas_helpers.py Outdated


		iflen(bq_schema)!=len(dataframe.columns):
		raiseValueError(
		"Number of columns in schema must match number of columns in dataframe"

Copy link

aryannMay 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Period?

Copy link

ContributorAuthor

tswastMay 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done.

bigquery/google/cloud/bigquery/_pandas_helpers.py

		"STRING":pyarrow.string,
		"TIME":pyarrow_time,
		"TIMESTAMP":pyarrow_timestamp,
		}

Copy link

aryannMay 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is there a list somewhere that defines BQ types? I wonder if we can add an assertion here that BQ_TO_ARROW_SCALARS.keys() == BQ_TYPES.keys(), so we have a better guarantee that all types are accounted for.

Copy link

ContributorAuthor

tswastMay 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not yet. There's an open FR at#7632 I've been hesitant to add such a list, since it's yet another thing to keep in sync manually, but I agree it'd be useful for cases such as this.

bigquery/google/cloud/bigquery/_pandas_helpers.py Outdated


		Returns None if default Arrow type inspection should be used.
		"""
		# TODO: Use pyarrow.list_(item_type) for repeated (array) fields.

Copy link

aryannMay 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would be good to include a little more context in the TODO comment as to why we are not adding support for these in this change.

Copy link

ContributorAuthor

tswastMay 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There wasn't a good reason before, so I implemented this.

I tried adding it to the system tests, but now I see there are some open issues in pyarrow that are preventing this support. I think REPEATED support may get fixed when#8093 is fixed, since there's a mode mismatch right now (fields are always set to nullable in the parquet file).

Struct support depends onhttps://jira.apache.org/jira/browse/ARROW-2587. I've filedhttps://github.com/googleapis/google-cloud-python/issues/8191 to track this as an open issue.

bigquery/tests/system.py

		)
		num_rows=100
		nulls= [None]*num_rows
		dataframe=pandas.DataFrame(

Copy link

aryannMay 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: I would suggest putting in non-null values for the sample data to make the test more complete.

Copy link

ContributorAuthor

tswastMay 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The bug actually only shows up when the whole column contains nulls, because when at least one non-null value is present, pandas auto-detect code works correctly. I do include non-nulls in the unit tests.

bigquery/tests/system.py

		table=retry_403(Config.CLIENT.create_table)(
		Table(table_id,schema=table_schema)
		)
		self.to_delete.insert(0,table)

Copy link

aryannMay 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is there a reason why we prepend the table ref to to_delete instead of appending it?

Copy link

ContributorAuthor

tswastMay 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So that the table gets deleted before the dataset does.

bigquery/tests/unit/test__pandas_helpers.py

		try:
		importpandas
		exceptImportError:# pragma: NO COVER
		pandas=None

Copy link

aryannMay 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If you ever get tired of the try/except pattern, you can write a maybe_import(*args) function that returns a tuple of modules:

def maybe_import(*args):  modules = []  for arg in args:    try:      modules.append(__import__(arg))    except ImportError:      return tupe([None] * len(args))  return tuple(modules)

bigquery/tests/unit/test__pandas_helpers.py Outdated

		@pytest.mark.skipIf(pyarrowisNone,"Requires `pyarrow`")
		deftest_bq_to_arrow_data_type(module_under_test,bq_type,bq_mode,is_correct_type):
		field=schema.SchemaField("ignored_name",bq_type,mode=bq_mode)
		got=module_under_test.bq_to_arrow_data_type(field)

Copy link

aryannMay 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

s/got/actual/?

Copy link

ContributorAuthor

tswastMay 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done.

tswast added4 commits

May 29, 2019 11:09

Usejob_config.schema for data type conversion if specified in `loa…

d6c2ab5

…d_table_from_dataframe`.Use the BigQuery schema to inform encoding of file used in load job.This fixes an issue where a dataframe with ambiguous types (such as an`object` column containing all `None` values) could not be appended toan existing table, since the schemas wouldn't match in most cases.

Improve code coverage.

46c3a12

Link to LoadJobConfig.schema in docstring.

58e59ea

Support array and struct data type conversions.

aa38e42

tswast force-pushed theissue7370-b132658518-load-dataframe-nulls branch from2e8290e toaa38e42Compare

May 30, 2019 00:38

tswast commented

May 30, 2019

View reviewed changes

Copy link

ContributorAuthor

tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for the review@aryann !

I tried to add support for REPEATED and RECORD columns, but hit some roadblocks. I'll follow-up with those types.

Note: Since I did add partial support, I know test coverage will fail. I'll add a commit with additional tests before submitting.

bigquery/tests/system.py

		)
		num_rows=100
		nulls= [None]*num_rows
		dataframe=pandas.DataFrame(

Copy link

ContributorAuthor

tswastMay 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

bigquery/tests/unit/test__pandas_helpers.py Outdated

		@pytest.mark.skipIf(pyarrowisNone,"Requires `pyarrow`")
		deftest_bq_to_arrow_data_type(module_under_test,bq_type,bq_mode,is_correct_type):
		field=schema.SchemaField("ignored_name",bq_type,mode=bq_mode)
		got=module_under_test.bq_to_arrow_data_type(field)