Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork18.7k
Fix BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend#60105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Changes from1 commit
2244869
bd00fc5
6a23f05
2f261c8
8f900b8
536b1ed
8965a1d
a32b4a6
f412c72
a0200d0
10ca030
ba2e82d
8dd45ca
602194a
f8ae285
02514e0
ef5c2ed
da7817a
de5c604
e025d84
3d83bab
6ea5785
88accb3
c066ebf
2afeed5
de286a4
a99365a
a8f69e5
2f8ac63
1a0b2cf
1da6ee4
4c1c282
fe4bdba
71d94e5
8813f77
30ce3b3
9721fc3
bcb7e3d
fe52a7a
File filter
Filter by extension
Conversations
Uh oh!
There was an error while loading.Please reload this page.
Jump to
Uh oh!
There was an error while loading.Please reload this page.
Diff view
Diff view
- Loading branch information
Uh oh!
There was an error while loading.Please reload this page.
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -970,7 +970,17 @@ def convert(arr): | ||
if dtype_backend != "numpy" and arr.dtype == np.dtype("O"): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. I think this is the wrong place to be doing this; in the sql.py module can't we read in the type of the database and only try to convert BINARY types to Arrow binary types? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. Based on some local testing using the ADBC driver I can confirm that it yields a Wondering if it makes sense to remove the code here trying to convert based on a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. I think the general problem is that pandas does not have a first class "binary" data type, so I'm not sure how to solve this for anything but the pyarrow backend. With the pyarrow backend, I think you can still move this logic to Not sure if@mroeschke has other thoughts to the general issue. This is likely another good use case to track in PDEP-13#58455 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. I agree that this is the incorrect place to handle this conversion logic and this should only be a valid conversion for the pyarrow backend ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. in | ||
new_dtype = StringDtype() | ||
arr_cls = new_dtype.construct_array_type() | ||
try: | ||
# Addressing (#59242) | ||
# Byte data that could not be decoded into | ||
# a string would throw a UnicodeDecodeError exception | ||
# Try and greedily convert to string | ||
# Will fail if the object is bytes | ||
arr = arr_cls._from_sequence(arr, dtype=new_dtype) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. Shouldn't this ideally return There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. that makes sense thanks! so looks like the previous logic was not taking into account pyarrow types when doing this conversation so I've added logic similar to my initial change where we try to convert to a pyarrow string but then fall back to binary if we run into an invalid error (i.e. we tried to parse but it failed due to an encoding error). Please let me know what you think! Was also considering trying to type check the contents of | ||
except UnicodeDecodeError: | ||
pass | ||
elif dtype_backend != "numpy" and isinstance(arr, np.ndarray): | ||
if arr.dtype.kind in "iufb": | ||
arr = pd_array(arr, copy=False) | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -4352,3 +4352,17 @@ def test_xsqlite_if_exists(sqlite_buildin): | ||
(5, "E"), | ||
] | ||
drop_table(table_name, sqlite_buildin) | ||
def test_bytes_column(sqlite_buildin): | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. This is well intentioned but can you remove the docstring? We don't use them in tests. Instead, you can just add a comment pointing to the github issue number in the function body | ||
Regression test for (#59242) | ||
Bytes being returned in a column that could not be converted | ||
to a string would raise a UnicodeDecodeError | ||
when using dtype_backend='pyarrow' | ||
""" | ||
query = """ | ||
select cast(x'0123456789abcdef0123456789abcdef' as blob) a | ||
""" | ||
df = pd.read_sql(query, sqlite_buildin, dtype_backend="pyarrow") | ||
assert df.a.values[0] == b"\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. Can you use our built-in test helpers instead? I think you can just do: result=pd.read_sql(...)expected=pd.DataFrame({"a": ...},dtype=pd.ArrowDtype(pa.binary()))tm.assert_frame_equal(result,expected) What data type does this produce currently with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. for sure, changed the testing logic over to using this! for |
Uh oh!
There was an error while loading.Please reload this page.