Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Cloud fetch does not guarantee correct column names on queries when the plan is cached #522

Closed
Assignees
madhav-db
@ewengillies

Description

@ewengillies

Motivation

We use Databricks SQL to grab data regularly. We do this interactively, so sometimes we slightly tweak queries and rerun them. Something about the caching involved inuse_cloud_fetch=True can mean our returned query does not have the correct column names when aliases are used. Basic way to trigger this is:

  1. Ensureuse_cloud_fetch=True in the connection.
  2. Write a query with a non-trivial plan against a managed delta table.
  3. Ensure the returned data is large enough to actually utilize cloud fetch (not 100% sure if this is important).
  4. Have an alias in the finalselect statement, something likesome_column as a .
  5. Run the query.
  6. Rename the alias, i.e.some_column as b.
  7. Rerun the query.
  8. Note that the query run in 7 has the schema (column names) of the query run in 5.
  9. This means thatsome_column is still calleda and not calledb in query 7.

Minimum Reproducible Example

I made an MRE for convenience using the taxi dataset. This is with version 4. I've seen it consistently in version 3 as well. Haven't tested 2. Here is mypip freeze for v4 setup from minimal install ofpip install 'databricks-sql-connector[pyarrow]' :

certifi==2025.1.31charset-normalizer==3.4.1databricks-sql-connector==4.0.0et_xmlfile==2.0.0idna==3.10lz4==4.4.3numpy==1.26.4oauthlib==3.2.2openpyxl==3.1.5pandas==2.2.3pyarrow==19.0.1python-dateutil==2.9.0.post0pytz==2025.1requests==2.32.3six==1.17.0thrift==0.20.0tzdata==2025.1urllib3==2.3.0

Here is the code. I generate an example managed table from thesamples.nyctaxi.trips data so its truly reproducible:https://docs.databricks.com/aws/en/discover/databricks-datasets#unity-catalog-datasets

fromdatabricks.sqlimportconnectSERVER_HOSTNAME='<some_server>'HTTP_PATH='/sql/1.0/warehouses/<some_warehouse_id>'ACCESS_TOKEN='<some_access_token>'EXAMPLE_TABLE_NAME='<some_catalog>.<some_schema>.minimum_example_table'defrun_query_via_databricks_sql(query:str,use_cloud_fetch:bool=False,):withconnect(server_hostname=SERVER_HOSTNAME,http_path=HTTP_PATH,access_token=ACCESS_TOKEN,use_cloud_fetch=use_cloud_fetch,    )asconnection:withconnection.cursor()ascursor:cursor.execute(query)data=cursor.fetchall_arrow()data=data.to_pandas()returndata# Create a managed table to play with from the Delta Shared table# We do a cross join here just to generate enough rows to trigger# cloud fetch and hence the bug.print("Creating the example table")create_the_example_table=f"""create or replace table{EXAMPLE_TABLE_NAME} as SELECT   some_trips.tpep_pickup_datetime,   some_trips.tpep_dropoff_datetime,   some_trips.trip_distance,   some_trips.fare_amount,   some_trips.pickup_zip,   some_trips.dropoff_zipFROM  samples.nyctaxi.trips as some_tripsfull outer join  samples.nyctaxi.tripslimit 20000000"""run_query_via_databricks_sql(create_the_example_table)# Query A ends with an alias of "as a"query_a=f"""with some_trips as (    SELECT       tpep_pickup_datetime,       tpep_dropoff_datetime,       trip_distance,       fare_amount,       pickup_zip,       dropoff_zip    FROM{EXAMPLE_TABLE_NAME}),grouped_trips_pickup as (    SELECT       some_trips.pickup_zip,       count(*) as trip_count    FROM       some_trips    GROUP BY       some_trips.pickup_zip),grouped_trips_dropoff as (    SELECT       some_trips.dropoff_zip,       count(*) as trip_count    FROM       some_trips    GROUP BY       some_trips.dropoff_zip),trips_with_grouped_pickup_and_dropoff as (    SELECT       some_trips.*,       grouped_trips_pickup.trip_count as pickup_trip_count,       grouped_trips_dropoff.trip_count as dropoff_trip_count    FROM       some_trips    LEFT JOIN       grouped_trips_pickup    ON       some_trips.pickup_zip = grouped_trips_pickup.pickup_zip    LEFT JOIN       grouped_trips_dropoff    ON       some_trips.dropoff_zip = grouped_trips_dropoff.dropoff_zip)select    tpep_pickup_datetime,    dropoff_trip_count as afrom    trips_with_grouped_pickup_and_dropoff"""# Query B ends with an alias of "as b"query_b=query_a.replace(" as a"," as b")# Without cloud fetch, correct schema in bothprint("Running query a without cloud fetch")df_a__no_cloud_fetch=run_query_via_databricks_sql(query_a,use_cloud_fetch=False)print(f"Get back a column named 'a':{df_a__no_cloud_fetch.columns.to_list()=}")print("Running query b without cloud fetch")df_b__no_cloud_fetch=run_query_via_databricks_sql(query_b,use_cloud_fetch=False)print(f"Get back a column named 'b':{df_b__no_cloud_fetch.columns.to_list()=}")# With cloud fetch, schema is not updated in query_bprint("Running query a WITH cloud fetch")df_a__with_cloud_fetch=run_query_via_databricks_sql(query_a,use_cloud_fetch=True)print(f"Get back a column named 'a':{df_a__with_cloud_fetch.columns.to_list()=}")print("Running query b WITH cloud fetch")df_b__with_cloud_fetch=run_query_via_databricks_sql(query_b,use_cloud_fetch=True)print(f"DO NOT back a column named 'b', still have column named 'a':{df_b__with_cloud_fetch.columns.to_list()=}")

What I get back is this:

Creating the example tableRunning query a without cloud fetchGet back a column named 'a': df_a__no_cloud_fetch.columns.to_list()=['tpep_pickup_datetime', 'a']Running query b without cloud fetchGet back a column named 'b': df_b__no_cloud_fetch.columns.to_list()=['tpep_pickup_datetime', 'b']Running query a WITH cloud fetchGet back a column named 'a': df_a__with_cloud_fetch.columns.to_list()=['tpep_pickup_datetime', 'a']Running query b WITH cloud fetchDO NOT back a column named 'b', still have column named 'a': df_b__with_cloud_fetch.columns.to_list()=['tpep_pickup_datetime', 'a']

Issues

  1. We don't get the column names back we expect.
  2. Potentially, if two aliases are used and swapped, columns can be mislabelled (haven't checked this case).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions


    [8]ページ先頭

    ©2009-2025 Movatter.jp