- Notifications
You must be signed in to change notification settings - Fork382
Utils for streaming large files (S3, HDFS, gzip, bz2...)
License
piskvorky/smart_open
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
smart_open
is a Python 3 library forefficient streaming of very large files from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.
smart_open
is a drop-in replacement for Python's built-inopen()
: it can do anythingopen
can (100% compatible, falls back to nativeopen
wherever possible), plus lots of nifty extra stuff on top.
Python 2.7 is no longer supported. If you need Python 2.7, please usesmart_open 1.10.1,the last version to support Python 2.
Working with large remote files, for example using Amazon'sboto3 Python library, is a pain.boto3
'sObject.upload_fileobj()
andObject.download_fileobj()
methods require gotcha-prone boilerplate to use successfully, such as constructing file-like object wrappers.smart_open
shields you from that. It builds on boto3 and other remote storage libraries, but offers aclean unified Pythonic API. The result is less code for you to write and fewer bugs to make.
smart_open
is well-tested, well-documented, and has a simple Pythonic API:
>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)74807879>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
Other examples of URLs thatsmart_open
accepts:
s3://my_bucket/my_keys3://my_key:my_secret@my_bucket/my_keys3://my_key:my_secret@my_server:my_port@my_bucket/my_keygs://my_bucket/my_blobazure://my_bucket/my_blobhdfs:///path/filehdfs://path/filewebhdfs://host:port/path/file./local/path/file~/local/path/filelocal/path/file./local/path/file.gzfile:///home/user/filefile:///home/user/file.bz2[ssh|scp|sftp]://username@host//path/file[ssh|scp|sftp]://username@host/path/file[ssh|scp|sftp]://username:password@host/path/file
smart_open
supports a wide range of storage solutions, including AWS S3, Google Cloud and Azure.Each individual solution has its own dependencies.By default,smart_open
does not install any dependencies, in order to keep the installation size small.You can install these dependencies explicitly using:
pip install smart_open[azure] # Install Azure depspip install smart_open[gcs] # Install GCS depspip install smart_open[s3] # Install S3 deps
Or, if you don't mind installing a large number of third party libraries, you can install all dependencies using:
pip install smart_open[all]
Be warned that this option increases the installation size significantly, e.g. over 100MB.
If you're upgrading fromsmart_open
versions 2.x and below, please check out theMigration Guide.
For detailed API info, see the online help:
help('smart_open')
or clickhere to view the help in your browser.
For the sake of simplicity, the examples below assume you have all the dependencies installed, i.e. you have done:
pip install smart_open[all]
>>>importos,boto3>>>fromsmart_openimportopen>>>>>># stream content *into* S3 (write mode) using a custom session>>>session=boto3.Session(...aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],...aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],... )>>>url='s3://smart-open-py37-benchmark-results/test.txt'>>>withopen(url,'wb',transport_params={'client':session.client('s3')})asfout:...bytes_written=fout.write(b'hello world!')...print(bytes_written)12
# stream from HDFSforlineinopen('hdfs://user/hadoop/my_file.txt',encoding='utf8'):print(line)# stream from WebHDFSforlineinopen('webhdfs://host:port/user/hadoop/my_file.txt'):print(line)# stream content *into* HDFS (write mode):withopen('hdfs://host:port/user/hadoop/my_file.txt','wb')asfout:fout.write(b'hello world')# stream content *into* WebHDFS (write mode):withopen('webhdfs://host:port/user/hadoop/my_file.txt','wb')asfout:fout.write(b'hello world')# stream from a completely custom s3 server, like s3proxy:forlineinopen('s3u://user:secret@host:port@mybucket/mykey.txt'):print(line)# Stream to Digital Ocean Spaces bucket providing credentials from boto3 profilesession=boto3.Session(profile_name='digitalocean')client=session.client('s3',endpoint_url='https://ams3.digitaloceanspaces.com')transport_params= {'client':client}withopen('s3://bucket/key.txt','wb',transport_params=transport_params)asfout:fout.write(b'here we stand')# stream from GCSforlineinopen('gs://my_bucket/my_file.txt'):print(line)# stream content *into* GCS (write mode):withopen('gs://my_bucket/my_file.txt','wb')asfout:fout.write(b'hello world')# stream from Azure Blob Storageconnect_str=os.environ['AZURE_STORAGE_CONNECTION_STRING']transport_params= {'client':azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),}forlineinopen('azure://mycontainer/myfile.txt',transport_params=transport_params):print(line)# stream content *into* Azure Blob Storage (write mode):connect_str=os.environ['AZURE_STORAGE_CONNECTION_STRING']transport_params= {'client':azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),}withopen('azure://mycontainer/my_file.txt','wb',transport_params=transport_params)asfout:fout.write(b'hello world')
The top-level compression parameter controls compression/decompression behavior when reading and writing.The supported values for this parameter are:
infer_from_extension
(default behavior)disable
.gz
.bz2
.zst
By default,smart_open
determines the compression algorithm to use based on the file extension.
>>>fromsmart_openimportopen,register_compressor>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...print(fin.read(32))ItwasabrightcolddayinApri
You can override this behavior to either disable compression, or explicitly specify the algorithm to use.To disable compression:
>>>fromsmart_openimportopen,register_compressor>>>withopen('smart_open/tests/test_data/1984.txt.gz','rb',compression='disable')asfin:...print(fin.read(32))b'\x1f\x8b\x08\x08\x85F\x94\\\x00\x031984.txt\x005\x8f=r\xc3@\x08\x85{\x9d\xe2\x1d@'
To specify the algorithm explicitly (e.g. for non-standard file extensions):
>>>fromsmart_openimportopen,register_compressor>>>withopen('smart_open/tests/test_data/1984.txt.gzip',compression='.gz')asfin:...print(fin.read(32))ItwasabrightcolddayinApri
You can also easily add support for other file extensions and compression formats.For example, to open xz-compressed files:
>>>importlzma,os>>>fromsmart_openimportopen,register_compressor>>>def_handle_xz(file_obj,mode):...returnlzma.LZMAFile(filename=file_obj,mode=mode,format=lzma.FORMAT_XZ)>>>register_compressor('.xz',_handle_xz)>>>withopen('smart_open/tests/test_data/1984.txt.xz')asfin:...print(fin.read(32))ItwasabrightcolddayinApri
lzma
is in the standard library in Python 3.3 and greater.For 2.7, usebackports.lzma.
smart_open
supports a wide range of transport options out of the box, including:
- S3
- HTTP, HTTPS (read-only)
- SSH, SCP and SFTP
- WebHDFS
- GCS
- Azure Blob Storage
Each option involves setting up its own set of parameters.For example, for accessing S3, you often need to set up authentication, like API keys or a profile name.smart_open
'sopen
function accepts a keyword argumenttransport_params
which accepts additional parameters for the transport layer.Here are some examples of using this parameter:
>>>importboto3>>>fin=open('s3://commoncrawl/robots.txt',transport_params=dict(client=boto3.client('s3')))>>>fin=open('s3://commoncrawl/robots.txt',transport_params=dict(buffer_size=1024))
For the full list of keyword arguments supported by each transport option, see the documentation:
help('smart_open.open')
smart_open
uses theboto3
library to talk to S3.boto3
has severalmechanisms for determining the credentials to use.By default,smart_open
will defer toboto3
and let the latter take care of the credentials.There are several ways to override this behavior.
The first is to pass aboto3.Client
object as a transport parameter to theopen
function.You can customize the credentials when constructing the session for the client.smart_open
will then use the session when talking to S3.
session=boto3.Session(aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY,aws_session_token=SESSION_TOKEN,)client=session.client('s3',endpoint_url=...,config=...)fin=open('s3://bucket/key',transport_params={'client':client})
Your second option is to specify the credentials within the S3 URL itself:
fin=open('s3://aws_access_key_id:aws_secret_access_key@bucket/key', ...)
Important: The two methods above aremutually exclusive. If you pass an AWS clientand the URL contains credentials,smart_open
will ignore the latter.
Important:smart_open
ignores configuration files from the olderboto
library.Port your oldboto
settings toboto3
in order to use them withsmart_open
.
Additional keyword arguments can be propagated to the boto3 methods that are used bysmart_open
under the hood using theclient_kwargs
transport parameter.
For instance, to upload a blob with Metadata, ACL, StorageClass, these keyword arguments can be passed tocreate_multipart_upload
(docs).
kwargs= {'Metadata': {'version':2},'ACL':'authenticated-read','StorageClass':'STANDARD_IA'}fout=open('s3://bucket/key','wb',transport_params={'client_kwargs': {'S3.Client.create_multipart_upload':kwargs}})
Since going over all (or select) keys in an S3 bucket is a very common operation, there's also an extra functionsmart_open.s3.iter_bucket()
that does this efficiently,processing the bucket keys in parallel (using multiprocessing):
>>>fromsmart_openimports3>>># we use workers=1 for reproducibility; you should use as many workers as you have cores>>>bucket='silo-open-data'>>>prefix='Official/annual/monthly_rain/'>>>forkey,contentins3.iter_bucket(bucket,prefix=prefix,accept_key=lambdakey:'/201'inkey,workers=1,key_limit=3):...print(key,round(len(content)/2**20))Official/annual/monthly_rain/2010.monthly_rain.nc13Official/annual/monthly_rain/2011.monthly_rain.nc13Official/annual/monthly_rain/2012.monthly_rain.nc13
smart_open
uses thegoogle-cloud-storage
library to talk to GCS.google-cloud-storage
uses thegoogle-cloud
package under the hood to handle authentication.There are severaloptions to providecredentials.By default,smart_open
will defer togoogle-cloud-storage
and let it take care of the credentials.
To override this behavior, pass agoogle.cloud.storage.Client
object as a transport parameter to theopen
function.You cancustomize the credentialswhen constructing the client.smart_open
will then use the client when talking to GCS. To follow allow withthe example below,refer to Google's guideto setting up GCS authentication with a service account.
importosfromgoogle.cloud.storageimportClientservice_account_path=os.environ['GOOGLE_APPLICATION_CREDENTIALS']client=Client.from_service_account_json(service_account_path)fin=open('gs://gcp-public-data-landsat/index.csv.gz',transport_params=dict(client=client))
If you need more credential options, you can create an explicitgoogle.auth.credentials.Credentials
objectand pass it to the Client. To create an API token for use in the example below, refer to theGCS authentication guide.
importosfromgoogle.auth.credentialsimportCredentialsfromgoogle.cloud.storageimportClienttoken=os.environ['GOOGLE_API_TOKEN']credentials=Credentials(token=token)client=Client(credentials=credentials)fin=open('gs://gcp-public-data-landsat/index.csv.gz',transport_params={'client':client})
Additional keyword arguments can be propagated to the GCS open method (docs), which is used bysmart_open
under the hood, using theblob_open_kwargs
transport parameter.
Additionally keyword arguments can be propagated to the GCSget_blob
method (docs) when in a read-mode, using theget_blob_kwargs
transport parameter.
Additional blob properties (docs) can be set before an upload, as long as they are not read-only, using theblob_properties
transport parameter.
open_kwargs= {'predefined_acl':'authenticated-read'}properties= {'metadata': {'version':2},'storage_class':'COLDLINE'}fout=open('gs://bucket/key','wb',transport_params={'blob_open_kwargs':open_kwargs,'blob_properties':properties})
smart_open
uses theazure-storage-blob
library to talk to Azure Blob Storage.By default,smart_open
will defer toazure-storage-blob
and let it take care of the credentials.
Azure Blob Storage does not have any ways of inferring credentials therefore, passing aazure.storage.blob.BlobServiceClient
object as a transport parameter to theopen
function is required.You cancustomize the credentialswhen constructing the client.smart_open
will then use the client when talking to. To follow allow withthe example below,refer to Azure's guideto setting up authentication.
importosfromazure.storage.blobimportBlobServiceClientazure_storage_connection_string=os.environ['AZURE_STORAGE_CONNECTION_STRING']client=BlobServiceClient.from_connection_string(azure_storage_connection_string)fin=open('azure://my_container/my_blob.txt',transport_params={'client':client})
If you need more credential options, refer to theAzure Storage authentication guide.
Additional keyword arguments can be propagated to thecommit_block_list
method (docs), which is used bysmart_open
under the hood for uploads, using theblob_kwargs
transport parameter.
kwargs= {'metadata': {'version':2}}fout=open('azure://container/key','wb',transport_params={'blob_kwargs':kwargs})
smart_open.open
can also be used withPath
objects.The built-in Path.open() is not able to read text from compressed files, so usepatch_pathlib
to replace it with smart_open.open() instead.This can be helpful when e.g. working with compressed files.
>>>frompathlibimportPath>>>fromsmart_open.smart_open_libimportpatch_pathlib>>>>>> _=patch_pathlib()# replace `Path.open` with `smart_open.open`>>>>>>path=Path("smart_open/tests/test_data/crime-and-punishment.txt.gz")>>>>>>withpath.open("r")asinfile:...print(infile.readline()[:41])Вначалеиюля,вчрезвычайножаркоевремя
Seethis document.
Seethis document.
smart_open
comes with a comprehensive suite of unit tests.Before you can run the test suite, install the test dependencies:
pip install -e .[test]
Now, you can run the unit tests:
pytest smart_open
The tests are also run automatically withTravis CI on every commit push & pull request.
smart_open
lives onGithub. You can fileissues or pull requests there. Suggestions, pull requests and improvements welcome!
smart_open
is open source software released under theMIT license.Copyright (c) 2015-nowRadim Řehůřek.
About
Utils for streaming large files (S3, HDFS, gzip, bz2...)