- Notifications
You must be signed in to change notification settings - Fork249
Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)
License
cloudera/impyla
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Python client for HiveServer2 implementations (e.g., Impala, Hive) fordistributed query engines.
For higher-level Impala functionality, including a Pandas-like interface overdistributed data sets, see theIbis project.
HiveServer2 compliant; works with Impala and Hive, including nested data
FullyDB API 2.0 (PEP 249)-compliant Python client (similar tosqlite or MySQL clients) supporting Python 2.6+ and Python 3.3+.
Works with Kerberos, LDAP, SSL
SQLAlchemy connector
Converter topandas
DataFrame
, allowing easy integration into thePython data stack (includingscikit-learn andmatplotlib); but see theIbis project for a richerexperience
Required:
Python 2.7+ or 3.5+
six
,bitarray
thrift==0.16.0
thrift_sasl==0.4.3
Optional:
kerberos>=1.3.0
for Kerberos over HTTP support. This also requires Kerberos librariesto be installed on your system - seeSystem Kerberospandas
for conversion toDataFrame
objects; but see theIbis project insteadsqlalchemy
for the SQLAlchemy enginepytest
andrequests
for running tests;unittest2
for testing on Python 2.6
Different systems require different packages to be installed to enable Kerberos support inImpyla. Some examples of how to install the packages on different distributions follow.
Ubuntu:
apt-get install libkrb5-dev krb5-user
RHEL/CentOS:
yum install krb5-libs krb5-devel krb5-server krb5-workstation
Install the latest release withpip
:
pip install impyla
For the latest (dev) version, install directly from the repo:
pip install git+https://github.com/cloudera/impyla.git
or clone the repo:
git clone https://github.com/cloudera/impyla.gitcd impylapython setup.py install
impyla uses thepytest toolchain, and depends on the followingenvironment variables:
export IMPYLA_TEST_HOST=your.impalad.comexport IMPYLA_TEST_PORT=21050export IMPYLA_TEST_AUTH_MECH=NOSASL
To run the maximal set of tests, run
cd path/to/impylapy.test --connect impala
Leave out the--connect
option to skip tests for DB API compliance.
To test impyla with different Python versionstox can be used.The commands below will run all impyla tests with all supported andinstalled Python versions:
cd path/to/impylatox
To filter environments / tests use-e
andpytest arguments after--
:
tox -e py310 -- -ktest_utf8_strings
Impyla implements thePython DB API v2.0 (PEP 249) database interface(refer to it for API details):
fromimpala.dbapiimportconnectconn=connect(host='my.host.com',port=21050)# auth_mechanism='PLAIN' for unsecured Hive connection, see function doccursor=conn.cursor()cursor.execute('SELECT * FROM mytable LIMIT 100')printcursor.description# prints the result set's schemaresults=cursor.fetchall()
TheCursor
object also exposes the iterator interface, which is buffered(controlled bycursor.arraysize
):
cursor.execute('SELECT * FROM mytable LIMIT 100')forrowincursor:print(row)
Furthermore theCursor
object returns you information about the columnsreturned in the query. This is useful to export your data as a csv file.
importcsvcursor.execute('SELECT * FROM mytable LIMIT 100')columns= [datum[0]fordatumincursor.description]targetfile='/tmp/foo.csv'withopen(targetfile,'w',newline='')asoutcsv:writer=csv.writer(outcsv,delimiter=',',quotechar='"',quoting=csv.QUOTE_ALL,lineterminator='\n')writer.writerow(columns)forrowincursor:writer.writerow(row)
You can also get back a pandas DataFrame object
fromimpala.utilimportas_pandasdf=as_pandas(cur)# carry df through scikit-learn, for example
You need to first sign and return anICLAandCCLAbefore we can accept and redistribute your contribution. Once these are submitted you arefree to start contributing to impyla. Submit these toCLA@cloudera.com.
We use Github issues to track bugs for this project. Find an issue that you would like towork on (or file one if you have discovered a new issue!). If no-one is working on it,assign it to yourself only if you intend to work on it shortly.
It's a good idea to discuss your intended approach on the issue. You are much morelikely to have your patch reviewed and committed if you've already got buy-in from theimpyla community before you start.
Now start coding! As you are writing your patch, please keep the following things in mind:
First, please include tests with your patch. If your patch adds a feature or fixes a bugand does not include tests, it will generally not be accepted. If you are unsure how towrite tests for a particular component, please ask on the issue for guidance.
Second, please keep your patch narrowly targeted to the problem described by the issue.It's better for everyone if we maintain discipline about the scope of each patch. Ingeneral, if you find a bug while working on a specific feature, file a issue for the bug,check if you can assign it to yourself and fix it independently of the feature. This helpsus to differentiate between bug fixes and features and allows us to build stablemaintenance releases.
Finally, please write a good, clear commit message, with a short, descriptive title anda message that is exactly long enough to explain what the problem was, and how it wasfixed.
Please create a pull request on github with your patch.
About
Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)