piskvorky/smart_openPublic

NotificationsYou must be signed in to change notification settings
Fork384
Star3.3k

Utils for streaming large files (S3, HDFS, gzip, bz2...)

License

MIT license

3.3k stars 384 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,122 Commits
.github		.github
benchmark		benchmark
ci_helpers		ci_helpers
integration-tests		integration-tests
release		release
sampledata		sampledata
smart_open		smart_open
tests		tests
.flake8		.flake8
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
MIGRATING_FROM_OLDER_VERSIONS.rst		MIGRATING_FROM_OLDER_VERSIONS.rst
README.rst		README.rst
extending.md		extending.md
help.txt		help.txt
howto.md		howto.md
pyproject.toml		pyproject.toml
update_helptext.py		update_helptext.py

Repository files navigation

smart_open — utils for streaming large files in Python

What?

smart_open is a Python 3 library forefficient streaming of very large files from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.

smart_open is a drop-in replacement for Python's built-inopen(): it can do anythingopen can (100% compatible, falls back to nativeopen wherever possible), plus lots of nifty extra stuff on top.

Python 2.7 is no longer supported. If you need Python 2.7, please usesmart_open 1.10.1,the last version to support Python 2.

Why?

Working with large remote files, for example using Amazon'sboto3 Python library, is a pain.boto3'sObject.upload_fileobj() andObject.download_fileobj() methods require gotcha-prone boilerplate to use successfully, such as constructing file-like object wrappers.smart_open shields you from that. It builds on boto3 and other remote storage libraries, but offers aclean unified Pythonic API. The result is less code for you to write and fewer bugs to make.

How?

smart_open is well-tested, well-documented, and has a simple Pythonic API:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('tests/test_data/1984.txt.gz')asfin:...withopen('tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)74807879>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'

Other examples of URLs thatsmart_open accepts:

s3://my_bucket/my_keys3://my_key:my_secret@my_bucket/my_keys3://my_key:my_secret@my_server:my_port@my_bucket/my_keygs://my_bucket/my_blobazure://my_bucket/my_blobhdfs:///path/filehdfs://path/filewebhdfs://host:port/path/file./local/path/file~/local/path/filelocal/path/file./local/path/file.gzfile:///home/user/filefile:///home/user/file.bz2[ssh|scp|sftp]://username@host//path/file[ssh|scp|sftp]://username@host/path/file[ssh|scp|sftp]://username:password@host/path/file

Documentation

The API reference can be viewed athelp.txt

Installation

smart_open supports a wide range of storage solutions. For all options, see the API reference.Each individual solution has its own dependencies.By default,smart_open does not install any dependencies, in order to keep the installation size small.You can install one or more of these dependencies explicitly using optional dependencies:

pip install smart_open[s3,gcs,azure,http,webhdfs,ssh,zst]

Or, if you don't mind installing a large number of third party libraries, you can install all dependencies using:

pip install smart_open[all]

Be warned that this option increases the installation size significantly, e.g. over 100MB.

If you're upgrading fromsmart_open versions 2.x and below, please check out theMigration Guide.

Built-in help

To view the API reference, use thehelp python builtin:

help('smart_open')

or viewhelp.txt in your browser.

More examples

For the sake of simplicity, the examples below assume you have all the dependencies installed, i.e. you have done:

pip install smart_open[all]

importos,boto3,botocorefromsmart_openimportopen# stream content *into* S3 (write mode) using a custom client# this client is thread-safe ref https://github.com/boto/boto3/blob/1.38.41/docs/source/guide/clients.rst?plain=1#L111config=botocore.client.Config(max_pool_connections=64,tcp_keepalive=True,retries={"max_attempts":6,"mode":"adaptive"},)client=boto3.Session(aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],).client("s3",config=config)withopen('s3://smart-open-py37-benchmark-results/test.txt','wb',transport_params={'client':client})asfout:bytes_written=fout.write(b'hello world!')print(bytes_written)# perform a single-part upload to S3 (saves billable API requests, and allows seek() before upload)withopen('s3://smart-open-py37-benchmark-results/test.txt','wb',transport_params={'multipart_upload':False})asfout:bytes_written=fout.write(b'hello world!')print(bytes_written)# now with tempfile.TemporaryFile instead of the default io.BytesIO (to reduce memory footprint)importtempfilewithtempfile.TemporaryFile()astmp,open('s3://smart-open-py37-benchmark-results/test.txt','wb',transport_params={'multipart_upload':False,'writebuffer':tmp})asfout:bytes_written=fout.write(b'hello world!')print(bytes_written)# stream from HDFSforlineinopen('hdfs://user/hadoop/my_file.txt',encoding='utf8'):print(line)# stream from WebHDFSforlineinopen('webhdfs://host:port/user/hadoop/my_file.txt'):print(line)# stream content *into* HDFS (write mode):withopen('hdfs://host:port/user/hadoop/my_file.txt','wb')asfout:fout.write(b'hello world')# stream content *into* WebHDFS (write mode):withopen('webhdfs://host:port/user/hadoop/my_file.txt','wb')asfout:fout.write(b'hello world')# stream from a completely custom s3 server, like s3proxy:forlineinopen('s3u://user:secret@host:port@mybucket/mykey.txt'):print(line)# Stream to Digital Ocean Spaces bucket providing credentials from boto3 profilesession=boto3.Session(profile_name='digitalocean')client=session.client('s3',endpoint_url='https://ams3.digitaloceanspaces.com')transport_params= {'client':client}withopen('s3://bucket/key.txt','wb',transport_params=transport_params)asfout:fout.write(b'here we stand')# stream from GCSforlineinopen('gs://my_bucket/my_file.txt'):print(line)# stream content *into* GCS (write mode):withopen('gs://my_bucket/my_file.txt','wb')asfout:fout.write(b'hello world')# stream from Azure Blob Storageconnect_str=os.environ['AZURE_STORAGE_CONNECTION_STRING']transport_params= {'client':azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),}forlineinopen('azure://mycontainer/myfile.txt',transport_params=transport_params):print(line)# stream content *into* Azure Blob Storage (write mode):connect_str=os.environ['AZURE_STORAGE_CONNECTION_STRING']transport_params= {'client':azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),}withopen('azure://mycontainer/my_file.txt','wb',transport_params=transport_params)asfout:fout.write(b'hello world')

Compression Handling

The top-level compression parameter controls compression/decompression behavior when reading and writing.The supported values for this parameter are:

infer_from_extension (default behavior)
disable
.gz
.bz2
.zst

By default,smart_open determines the compression algorithm to use based on the file extension.

>>>fromsmart_openimportopen,register_compressor>>>withopen('tests/test_data/1984.txt.gz')asfin:...print(fin.read(32))ItwasabrightcolddayinApri

You can override this behavior to either disable compression, or explicitly specify the algorithm to use.To disable compression:

>>>fromsmart_openimportopen,register_compressor>>>withopen('tests/test_data/1984.txt.gz','rb',compression='disable')asfin:...print(fin.read(32))b'\x1f\x8b\x08\x08\x85F\x94\\\x00\x031984.txt\x005\x8f=r\xc3@\x08\x85{\x9d\xe2\x1d@'

To specify the algorithm explicitly (e.g. for non-standard file extensions):

>>>fromsmart_openimportopen,register_compressor>>>withopen('tests/test_data/1984.txt.gzip',compression='.gz')asfin:...print(fin.read(32))ItwasabrightcolddayinApri

You can also easily add support for other file extensions and compression formats.For example, to open xz-compressed files:

>>>importlzma,os>>>fromsmart_openimportopen,register_compressor>>>def_handle_xz(file_obj,mode):...returnlzma.LZMAFile(filename=file_obj,mode=mode,format=lzma.FORMAT_XZ)>>>register_compressor('.xz',_handle_xz)>>>withopen('tests/test_data/1984.txt.xz')asfin:...print(fin.read(32))ItwasabrightcolddayinApri

lzma is in the standard library in Python 3.3 and greater.For 2.7, usebackports.lzma.

Transport-specific Options

smart_open supports a wide range of transport options out of the box, including:

S3
HTTP, HTTPS (read-only)
SSH, SCP and SFTP
WebHDFS
GCS
Azure Blob Storage

Each option involves setting up its own set of parameters.For example, for accessing S3, you often need to set up authentication, like API keys or a profile name.smart_open'sopen function accepts a keyword argumenttransport_params which accepts additional parameters for the transport layer.Here are some examples of using this parameter:

>>>importboto3>>>fin=open('s3://commoncrawl/robots.txt',transport_params=dict(client=boto3.client('s3')))>>>fin=open('s3://commoncrawl/robots.txt',transport_params=dict(buffer_size=1024))

For the full list of keyword arguments supported by each transport option, see the documentation:

help('smart_open.open')

S3 Credentials

smart_open uses theboto3 library to talk to S3.boto3 has severalmechanisms for determining the credentials to use.By default,smart_open will defer toboto3 and let the latter take care of the credentials.There are several ways to override this behavior.

The first is to pass aboto3.Client object as a transport parameter to theopen function.You can customize the credentials when constructing the session for the client.smart_open will then use the session when talking to S3.

session=boto3.Session(aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY,aws_session_token=SESSION_TOKEN,)client=session.client('s3',endpoint_url=...,config=...)fin=open('s3://bucket/key',transport_params={'client':client})

Your second option is to specify the credentials within the S3 URL itself:

fin=open('s3://aws_access_key_id:aws_secret_access_key@bucket/key', ...)

Important: The two methods above aremutually exclusive. If you pass an AWS clientand the URL contains credentials,smart_open will ignore the latter.

Important:smart_open ignores configuration files from the olderboto library.Port your oldboto settings toboto3 in order to use them withsmart_open.

S3 Advanced Usage

Additional keyword arguments can be propagated to the boto3 methods that are used bysmart_open under the hood using theclient_kwargs transport parameter.

For instance, to upload a blob with Metadata, ACL, StorageClass, these keyword arguments can be passed tocreate_multipart_upload (docs).

kwargs= {'Metadata': {'version':2},'ACL':'authenticated-read','StorageClass':'STANDARD_IA'}fout=open('s3://bucket/key','wb',transport_params={'client_kwargs': {'S3.Client.create_multipart_upload':kwargs}})

Iterating Over an S3 Bucket's Contents

Since going over all (or select) keys in an S3 bucket is a very common operation, there's also an extra functionsmart_open.s3.iter_bucket() that does this efficiently,processing the bucket keys in parallel (using multiprocessing):

>>>fromsmart_openimports3>>># we use workers=1 for reproducibility; you should use as many workers as you have cores>>>bucket='silo-open-data'>>>prefix='Official/annual/monthly_rain/'>>>forkey,contentins3.iter_bucket(bucket,prefix=prefix,accept_key=lambdakey:'/201'inkey,workers=1,key_limit=3):...print(key,round(len(content)/2**20))Official/annual/monthly_rain/2010.monthly_rain.nc13Official/annual/monthly_rain/2011.monthly_rain.nc13Official/annual/monthly_rain/2012.monthly_rain.nc13

GCS Credentials

smart_open uses thegoogle-cloud-storage library to talk to GCS.google-cloud-storage uses thegoogle-cloud package under the hood to handle authentication.There are severaloptions to providecredentials.By default,smart_open will defer togoogle-cloud-storage and let it take care of the credentials.

To override this behavior, pass agoogle.cloud.storage.Client object as a transport parameter to theopen function.You cancustomize the credentialswhen constructing the client.smart_open will then use the client when talking to GCS. To follow allow withthe example below,refer to Google's guideto setting up GCS authentication with a service account.

importosfromgoogle.cloud.storageimportClientservice_account_path=os.environ['GOOGLE_APPLICATION_CREDENTIALS']client=Client.from_service_account_json(service_account_path)fin=open('gs://gcp-public-data-landsat/index.csv.gz',transport_params=dict(client=client))

If you need more credential options, you can create an explicitgoogle.auth.credentials.Credentials objectand pass it to the Client. To create an API token for use in the example below, refer to theGCS authentication guide.

importosfromgoogle.auth.credentialsimportCredentialsfromgoogle.cloud.storageimportClienttoken=os.environ['GOOGLE_API_TOKEN']credentials=Credentials(token=token)client=Client(credentials=credentials)fin=open('gs://gcp-public-data-landsat/index.csv.gz',transport_params={'client':client})

GCS Advanced Usage

Additional keyword arguments can be propagated to the GCS open method (docs), which is used bysmart_open under the hood, using theblob_open_kwargs transport parameter.

Additionally keyword arguments can be propagated to the GCSget_blob method (docs) when in a read-mode, using theget_blob_kwargs transport parameter.

Additional blob properties (docs) can be set before an upload, as long as they are not read-only, using theblob_properties transport parameter.

open_kwargs= {'predefined_acl':'authenticated-read'}properties= {'metadata': {'version':2},'storage_class':'COLDLINE'}fout=open('gs://bucket/key','wb',transport_params={'blob_open_kwargs':open_kwargs,'blob_properties':properties})

Azure Credentials

smart_open uses theazure-storage-blob library to talk to Azure Blob Storage.By default,smart_open will defer toazure-storage-blob and let it take care of the credentials.

Azure Blob Storage does not have any ways of inferring credentials therefore, passing aazure.storage.blob.BlobServiceClientobject as a transport parameter to theopen function is required.You cancustomize the credentialswhen constructing the client.smart_open will then use the client when talking to. To follow allow withthe example below,refer to Azure's guideto setting up authentication.

importosfromazure.storage.blobimportBlobServiceClientazure_storage_connection_string=os.environ['AZURE_STORAGE_CONNECTION_STRING']client=BlobServiceClient.from_connection_string(azure_storage_connection_string)fin=open('azure://my_container/my_blob.txt',transport_params={'client':client})

If you need more credential options, refer to theAzure Storage authentication guide.

Azure Advanced Usage

Additional keyword arguments can be propagated to thecommit_block_list method (docs), which is used bysmart_open under the hood for uploads, using theblob_kwargs transport parameter.

kwargs= {'metadata': {'version':2}}fout=open('azure://container/key','wb',transport_params={'blob_kwargs':kwargs})

Drop-in replacement of`pathlib.Path.open`

smart_open.open can also be used withPath objects.The built-in Path.open() is not able to read text from compressed files, so usepatch_pathlib to replace it with smart_open.open() instead.This can be helpful when e.g. working with compressed files.

>>>frompathlibimportPath>>>fromsmart_open.smart_open_libimportpatch_pathlib>>>>>> _=patch_pathlib()# replace `Path.open` with `smart_open.open`>>>>>>path=Path("tests/test_data/crime-and-punishment.txt.gz")>>>>>>withpath.open("r")asinfile:...print(infile.readline()[:41])Вначалеиюля,вчрезвычайножаркоевремя

How do I ...?

Seethis document.

Extending`smart_open`

Seethis document.

Testing`smart_open`

smart_open comes with a comprehensive suite of unit tests.Before you can run the test suite, install the test dependencies:

pip install -e .[test]

Now, you can run the unit tests:

pytest tests

The tests are also run automatically withGitHub Actions on every commit push & pull request.

Comments, bug reports

smart_open lives onGithub. You can fileissues or pull requests there. Suggestions, pull requests and improvements welcome!

About

Utils for streaming large files (S3, HDFS, gzip, bz2...)

Releases56

v7.3.0.post1 Latest

Jul 3, 2025

+ 55 releases

Sponsor this project

Learn more about GitHub Sponsors

Packages

No packages published

Contributors115

+ 101 contributors

Movatterモバイル変換

Uh oh!

License

piskvorky/smart_open

Folders and files

Latest commit

History

Repository files navigation

smart_open — utils for streaming large files in Python

What?

Why?

How?

Documentation

Installation

Built-in help

More examples

Compression Handling

Transport-specific Options

S3 Credentials

S3 Advanced Usage

Iterating Over an S3 Bucket's Contents

GCS Credentials

GCS Advanced Usage

Azure Credentials

Azure Advanced Usage

Drop-in replacement ofpathlib.Path.open

How do I ...?

Extendingsmart_open

Testingsmart_open

Comments, bug reports

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases56

Sponsor this project

Uh oh!

Packages0

Uh oh!

Contributors115

Uh oh!

Languages

Drop-in replacement of`pathlib.Path.open`

Extending`smart_open`

Testing`smart_open`

Packages