NotificationsYou must be signed in to change notification settings
Fork18.7k
Star46k

Fix BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend#60105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

kastkeepitjumpinlikekangaroos wants to merge39 commits intopandas-dev:mainfromkastkeepitjumpinlikekangaroos:fix-59242

Closed

Fix BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend#60105

kastkeepitjumpinlikekangaroos wants to merge39 commits intopandas-dev:mainfromkastkeepitjumpinlikekangaroos:fix-59242

Conversation

Copy link

kastkeepitjumpinlikekangaroos commentedOct 25, 2024

closesBUG: read_sql tries to convert blob/varbinary to string with pyarrow backend #59242
Tests added and passed if fixing a bug or adding a new feature
Allcode checks passed.
Addedtype annotations to new arguments/methods/functions.
Added an entry in the latestdoc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This PR allows for bytes based data to be returned instead of throwing an exception like before

kastkeepitjumpinlikekangaroos added2 commits

October 25, 2024 00:35

Add fix forpandas-dev#59242

2244869

add skip import

bd00fc5

Copy link

Author

kastkeepitjumpinlikekangaroos commentedOct 25, 2024

The most recent change on main seems to also have 17 failing tasks so Ithink these failures are unrelated to the change in my PR but please let me know if that's wrong!

mroeschke requested changes

Oct 30, 2024

View reviewed changes

pandas/core/internals/construction.py Outdated


		# Try and greedily convert to string
		# Will fail if the object is bytes
		arr = arr_cls._from_sequence(arr, dtype=new_dtype)

Copy link

Member

mroeschkeOct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Shouldn't this ideally returnpandas.ArrowDtype(pyarrow.binary()) type?

Copy link

Author

kastkeepitjumpinlikekangaroosNov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

that makes sense thanks! so looks like the previous logic was not taking into account pyarrow types when doing this conversation so I've added logic similar to my initial change where we try to convert to a pyarrow string but then fall back to binary if we run into an invalid error (i.e. we tried to parse but it failed due to an encoding error). Please let me know what you think! Was also considering trying to type check the contents ofarr to see if it has string or bytes data but seems like greedily trying to convert ends up being better performance in most cases (since we might have to search the whole arr to see if one of the elements is a bytes sequence that can't be converted to a string)

mroeschke added IO SQL

to_sql, read_sql, read_sql_query

Arrowpyarrow functionality labels

Oct 30, 2024

kastkeepitjumpinlikekangaroos added2 commits

November 12, 2024 17:40

address comment

6a23f05

merged main

2f261c8

kastkeepitjumpinlikekangaroos requested a review frommroeschke

November 13, 2024 04:15

also fix for dtype_backend=numpy_nullable

8f900b8

kastkeepitjumpinlikekangaroos commented

Nov 13, 2024

View reviewed changes

pandas/tests/io/test_sql.py OutdatedShow resolvedHide resolved

Merge branch 'main' into fix-59242

536b1ed

WillAyd requested changes

Nov 14, 2024

View reviewed changes

Copy link

Member

WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I definitely appreciate what we are trying to do here although I'm wary about trying to infer "non-first class" types at runtime.

Have you tried using an ADBC driver instead? That should be Arrow-native and yield the proper dtypes

pandas/core/internals/construction.py Outdated

		@@ -968,9 +972,31 @@ def convert(arr):
		# i.e. maybe_convert_objects didn't convert
		arr = maybe_infer_to_datetimelike(arr)
		if dtype_backend != "numpy" and arr.dtype == np.dtype("O"):

Copy link

Member

WillAydNov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think this is the wrong place to be doing this; in the sql.py module can't we read in the type of the database and only try to convert BINARY types to Arrow binary types?

Copy link

Author

kastkeepitjumpinlikekangaroosNov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Based on some local testing using the ADBC driver I can confirm that it yields apandas.core.arrays.arrow.array.ArrowExtensionArray with ._dtype ofpandas.ArrowDtype. When the query returns a bytes type column we get a.type ofbytes, and likewise a.type ofstring is returned for a string type column. Seems like we don't need to do any conversions when using the ADBC driver as you've stated if I'm understanding correctly here!

Wondering if it makes sense to remove the code here trying to convert based on adtype_backend != 'numpy' since this will fix the cause of the exception in the issue? and maybe raise an exception when trying to use apyarrowdtype_backend with theSQLiteDatabase connection type here:https://github.com/pandas-dev/pandas/blob/main/pandas/io/sql.py#L695 ?

Copy link

Member

WillAydNov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think the general problem is that pandas does not have a first class "binary" data type, so I'm not sure how to solve this for anything but the pyarrow backend.

With the pyarrow backend, I think you can still move this logic tosql.py and check the type of the column coming back from the database. If it is a binary type in the database, using the PyArrow binary type with that backend makes sense.

Not sure if@mroeschke has other thoughts to the general issue. This is likely another good use case to track in PDEP-13#58455

Copy link

Member

mroeschkeNov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I agree that this is the incorrect place to handle this conversion logic and this should only be a valid conversion for the pyarrow backend (ArrowExtensionArray._from_sequence should be able to return a binary type with binary data.)

Copy link

Author

kastkeepitjumpinlikekangaroosNov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

insql.py it looks like the result of this conversion is being overwritten when thedtype_backend ispyarrow:https://github.com/pandas-dev/pandas/blob/main/pandas/io/sql.py#L181 and the dtype returned by the current logic isArrowDtype(pa.binary()) for the example in the issue, so maybe just removing the conversion logic is all that's needed to resolve this issue? I've removed the block doing the conversion and added a test case showing that the resulting df has a dtype ofArrowDtype(pa.binary()) when thedtype_backend='pyarrow'

kastkeepitjumpinlikekangaroos added3 commits

November 17, 2024 09:14

merge main

8965a1d

fix

a32b4a6

Merge branch 'main' into fix-59242

f412c72

WillAyd requested changes

Nov 20, 2024

View reviewed changes

Copy link

Member

WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Very nice - simplification is always good and I think this implementation looks reasonable :-)

Just need a few tweaks to the test case

pandas/tests/io/test_sql.py Outdated

		select cast(x'0123456789abcdef0123456789abcdef' as blob) a
		"""
		df = pd.read_sql(query, sqlite_buildin, dtype_backend=dtype_backend)
		assert df.a.values[0] == b"\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef"

Copy link

Member

WillAydNov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can you use our built-in test helpers instead? I think you can just do:

result=pd.read_sql(...)expected=pd.DataFrame({"a": ...},dtype=pd.ArrowDtype(pa.binary()))tm.assert_frame_equal(result,expected)

What data type does this produce currently with thenumpy_nullable backend - object?

Copy link

Author

kastkeepitjumpinlikekangaroosNov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

for sure, changed the testing logic over to using this! fornumpy_nullable andlib.no_default the dtype returned is an object

kastkeepitjumpinlikekangaroos added2 commits

November 20, 2024 18:48

address comment

a0200d0

Merge branch 'main' into fix-59242

10ca030

WillAyd requested changes

Nov 21, 2024

View reviewed changes

pandas/tests/io/test_sql.py Outdated



		@pytest.mark.parametrize("dtype_backend", ["pyarrow", "numpy_nullable", lib.no_default])
		def test_bytes_column(sqlite_buildin, dtype_backend):

Copy link

Member

WillAydNov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	deftest_bytes_column(sqlite_buildin,dtype_backend):
	deftest_bytes_column(all_connectable,dtype_backend):

Should test this against all databases

Copy link

Author

kastkeepitjumpinlikekangaroosDec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

sure thing! I was trying to pass in the cartesian product of theall_connectable anddtype_backend arrays usingitertools.product to@pytest.mark.parametrize but was running into issues with the connections getting passed. I instead made it so the connectables are being passed in theparametrize and then we loop through the dtypes in the test. Would love to know if there's a better way to do this so we're testing eachdtype_backend/connection combination independently

pandas/tests/io/test_sql.py Outdated

		@pytest.mark.parametrize("dtype_backend", ["pyarrow", "numpy_nullable", lib.no_default])
		def test_bytes_column(sqlite_buildin, dtype_backend):
		pa = pytest.importorskip("pyarrow")
		"""

Copy link

Member

WillAydNov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is well intentioned but can you remove the docstring? We don't use them in tests.

Instead, you can just add a comment pointing to the github issue number in the function body

kastkeepitjumpinlikekangaroos added4 commits

December 4, 2024 00:13

address comment

ba2e82d

Merge branch 'fix-59242' of github.com:kastkeepitjumpinlikekangaroos/…

8dd45ca

…pandas into fix-59242

remove cast

602194a

fix test

f8ae285

kastkeepitjumpinlikekangaroos commented

Dec 4, 2024

View reviewed changes

pandas/tests/io/test_sql.py OutdatedShow resolvedHide resolved

kastkeepitjumpinlikekangaroos added5 commits

December 4, 2024 17:39

fix logic

02514e0

fix

ef5c2ed

fix

da7817a

also include pa.OpaqueType

de5c604

fix

e025d84

kastkeepitjumpinlikekangaroos added3 commits

December 8, 2024 22:57

Merge branch 'main' into fix-59242

3d83bab

Merge branch 'main' into fix-59242

6ea5785

Merge branch 'main' into fix-59242

88accb3

Copy link

Author

kastkeepitjumpinlikekangaroos commentedDec 24, 2024

trying to figure out why this test is failinghttps://github.com/pandas-dev/pandas/actions/runs/12484030959/job/34840808563?pr=60105 despite passing locally and in the other unit test checks - seems to be due to a library version mismatch between the different unit test CI checks and theFuture infer strings check

Copy link

Member

WillAyd commentedDec 27, 2024

@kastkeepitjumpinlikekangaroos when running that locally you would need to set an environment variable:

PANDAS_FUTURE_INFER_STRING=1 python -m pytest pandas/tests/io/test_sql.py

Can you reproduce the error if you do that?

kastkeepitjumpinlikekangaroos added2 commits

December 27, 2024 15:20

fix failing test

c066ebf

Merge branch 'fix-59242' of github.com:kastkeepitjumpinlikekangaroos/…

2afeed5

…pandas into fix-59242

Copy link

Author

kastkeepitjumpinlikekangaroos commentedDec 27, 2024

@kastkeepitjumpinlikekangaroos when running that locally you would need to set an environment variable:
PANDAS_FUTURE_INFER_STRING=1 python -m pytest pandas/tests/io/test_sql.py
Can you reproduce the error if you do that?

@WillAyd ah yes thank you!! was able to reproduce and fix the failing test locally. Waiting on CI now but should be passing there too. Please let me know if there's another way to be checking for theinfer_string config option other than the way I'm currently doing it (pd.options.future.infer_string).

kastkeepitjumpinlikekangaroos added2 commits

December 27, 2024 16:10

Merge branch 'main' into fix-59242

de286a4

Merge branch 'main' into fix-59242

a99365a

WillAyd requested changes

Dec 30, 2024

View reviewed changes

pandas/tests/io/test_sql.py OutdatedShow resolvedHide resolved

pandas/tests/io/test_sql.py Outdated

		conn = request.getfixturevalue(con)
		pa = pytest.importorskip("pyarrow")

		dtype = "O"

Copy link

Member

WillAydDec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hmm why is this defaulting to object out here? That seems like something to avoid for the pyarrow backend.

I think you are doing that in this test anyway, but the structure is confusing. I would just set up the expected data type from the backend up front, so something like:

ifdtype_backend=="pyarrow":exp_dtype= ...else:     ...

Copy link

Author

kastkeepitjumpinlikekangaroosJan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

makes sense, let me know if the modified structure is less confusing!

pandas/tests/io/test_sql.py Outdated

		dtype = (
		pd.ArrowDtype(pa.string())
		if "adbc" not in con
		else pd.ArrowDtype(pa.opaque(pa.binary(), "bit", "PostgreSQL"))

Copy link

Member

WillAydDec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why is this an opaque type instead of just binary? That seems like a bug somewhere in this implementation or the driver

Copy link

Author

kastkeepitjumpinlikekangaroosJan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this is coming fromadbc-driver-postgresql, I guess they've chosenopaque because this is a bit of a weird type in postgres.https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-BIT-STRINGS vshttps://www.postgresql.org/docs/current/datatype-binary.html

postgres=# select x'0011';     ?column?     ------------------ 0000000000010001(1 row)postgres=# select pg_typeof(x'0011'); pg_typeof ----------- bit(1 row)postgres=# select cast('0011' as bytea);   bytea    ------------ \x30303131(1 row)postgres=# select pg_typeof(cast('0011' as bytea)); pg_typeof ----------- bytea(1 row)

kastkeepitjumpinlikekangaroos added4 commits

January 20, 2025 21:54

addressing comments

a8f69e5

Merge branch 'fix-59242' of github.com:kastkeepitjumpinlikekangaroos/…

2f8ac63

…pandas into fix-59242

fix

1a0b2cf

Merge branch 'main' into fix-59242

1da6ee4

WillAyd requested changes

Jan 21, 2025

View reviewed changes

pandas/core/dtypes/dtypes.py

		@@ -2261,7 +2261,9 @@ def type(self):
		elif pa.types.is_null(pa_type):
		# TODO: None? pd.NA? pa.null?
		return type(pa_type)
		elif isinstance(pa_type, pa.ExtensionType):
		elif isinstance(pa_type, pa.ExtensionType) or isinstance(
		pa_type, pa.OpaqueType

Copy link

Member

WillAydJan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hmm AFAIK the OpaqueType is returned when there is no suitable database type, but isn't binary already returned directly? It is a bit unclear what this additionalisinstance check is for

Copy link

Author

kastkeepitjumpinlikekangaroosJan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I had to add this becauseadbc-driver-postgresql is returning anOpaqueType - without this included an exception is thrown

pandas/tests/io/test_sql.py

		val = b"\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef"
		if "postgres" in con:
		if "adbc" in con:
		val = b"\x00\x00\x00\x80\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef"

Copy link

Member

WillAydJan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why does postgres have different content than the other databases?

Copy link

Author

kastkeepitjumpinlikekangaroosJan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

thex'...' syntax in sqlite and mysql represents the same data (called a blob literal in sqlite ]) but in postgres this is slighly different in that it's not abytea type , but instead abit. Most of the drivers return this as a string representation of the bits (which is whyval = "0000000100100011010001..." in one of the cases)

Copy link

Member

WillAydFeb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Instead of selecting a literal value can you write and read back the dataframe? I think this PR is missing something about how the different drivers are going to return binary data.

The ADBC driver has maps binary data to the BYTEA type (see alsohttps://github.com/apache/arrow-adbc/blob/c6e529980f1369202adb101f2e1ea03c4e12dcdc/c/driver/postgresql/postgres_type.h#L599) so we should be round tripping that, not bits

pandas/tests/io/test_sql.py Outdated

		conn = request.getfixturevalue(con)
		pa = pytest.importorskip("pyarrow")

		val = b"\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef"

Copy link

Member

WillAydJan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What does#Eg represent?

Copy link

Author

kastkeepitjumpinlikekangaroosJan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this is the raw bytes returned from the hex string"0123456789abcdef0123456789abcdef" - I've changed the test to usebytes.fromhex("0123456789abcdef0123456789abcdef") to make things more clear here

pandas/tests/io/test_sql.py

		if "adbc" in con:
		val = b"\x00\x00\x00\x80\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef"
		else:
		val = (

Copy link

Member

WillAydJan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is this supposed to be the binary representation of the bytes? I'm a bit confused why this test has so many different expected values

Copy link

Author

kastkeepitjumpinlikekangaroosJan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

yeah exactly and this is due to postgres returning thebit type which most of the drivers turn into this string

kastkeepitjumpinlikekangaroos added8 commits

January 28, 2025 19:07

clarify

4c1c282

Merge branch 'main' into fix-59242

fe4bdba

Merge branch 'main' into fix-59242

71d94e5

Merge branch 'main' into fix-59242

8813f77

Merge branch 'main' into fix-59242

30ce3b3

Merge branch 'main' into fix-59242

9721fc3

Merge branch 'main' into fix-59242

bcb7e3d

Merge branch 'main' into fix-59242

fe52a7a

WillAyd requested changes

Feb 10, 2025

View reviewed changes

pandas/tests/io/test_sql.py

		val = b"\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef"
		if "postgres" in con:
		if "adbc" in con:
		val = b"\x00\x00\x00\x80\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef"

Copy link

Member

WillAydFeb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Instead of selecting a literal value can you write and read back the dataframe? I think this PR is missing something about how the different drivers are going to return binary data.

Copy link

Contributor

github-actionsbot commentedMay 22, 2025

This pull request is stale because it has been open for thirty days with no activity. Pleaseupdate and respond to this comment if you're still interested in working on this.

github-actionsbot added the Stale label

May 22, 2025

Copy link

Member

mroeschke commentedMay 27, 2025

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

mroeschke closed this

May 27, 2025

Labels

Arrow

pyarrow functionality

IO SQL

to_sql, read_sql, read_sql_query

Stale

3 participants

Movatterモバイル変換

Uh oh!

Fix BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend#60105

Fix BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend#60105

Uh oh!

Conversation

kastkeepitjumpinlikekangaroos commentedOct 25, 2024

Uh oh!

kastkeepitjumpinlikekangaroos commentedOct 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kastkeepitjumpinlikekangaroos commentedDec 24, 2024

Uh oh!

WillAyd commentedDec 27, 2024

Uh oh!

kastkeepitjumpinlikekangaroos commentedDec 27, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actionsbot commentedMay 22, 2025

Uh oh!

mroeschke commentedMay 27, 2025