Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Utils for streaming large files (S3, HDFS, gzip, bz2...)

License

NotificationsYou must be signed in to change notification settings

piskvorky/smart_open

Repository files navigation

LicenseGHACoverallsDownloads

What?

smart_open is a Python 3 library forefficient streaming of very large files from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.

smart_open is a drop-in replacement for Python's built-inopen(): it can do anythingopen can (100% compatible, falls back to nativeopen wherever possible), plus lots of nifty extra stuff on top.

Python 2.7 is no longer supported. If you need Python 2.7, please usesmart_open 1.10.1,the last version to support Python 2.

Why?

Working with large remote files, for example using Amazon'sboto3 Python library, is a pain.boto3'sObject.upload_fileobj() andObject.download_fileobj() methods require gotcha-prone boilerplate to use successfully, such as constructing file-like object wrappers.smart_open shields you from that. It builds on boto3 and other remote storage libraries, but offers aclean unified Pythonic API. The result is less code for you to write and fewer bugs to make.

How?

smart_open is well-tested, well-documented, and has a simple Pythonic API:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)74807879>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'

Other examples of URLs thatsmart_open accepts:

s3://my_bucket/my_keys3://my_key:my_secret@my_bucket/my_keys3://my_key:my_secret@my_server:my_port@my_bucket/my_keygs://my_bucket/my_blobazure://my_bucket/my_blobhdfs:///path/filehdfs://path/filewebhdfs://host:port/path/file./local/path/file~/local/path/filelocal/path/file./local/path/file.gzfile:///home/user/filefile:///home/user/file.bz2[ssh|scp|sftp]://username@host//path/file[ssh|scp|sftp]://username@host/path/file[ssh|scp|sftp]://username:password@host/path/file

Documentation

Installation

smart_open supports a wide range of storage solutions, including AWS S3, Google Cloud and Azure.Each individual solution has its own dependencies.By default,smart_open does not install any dependencies, in order to keep the installation size small.You can install these dependencies explicitly using:

pip install smart_open[azure] # Install Azure depspip install smart_open[gcs] # Install GCS depspip install smart_open[s3] # Install S3 deps

Or, if you don't mind installing a large number of third party libraries, you can install all dependencies using:

pip install smart_open[all]

Be warned that this option increases the installation size significantly, e.g. over 100MB.

If you're upgrading fromsmart_open versions 2.x and below, please check out theMigration Guide.

Built-in help

For detailed API info, see the online help:

help('smart_open')

or clickhere to view the help in your browser.

More examples

For the sake of simplicity, the examples below assume you have all the dependencies installed, i.e. you have done:

pip install smart_open[all]
>>>importos,boto3>>>fromsmart_openimportopen>>>>>># stream content *into* S3 (write mode) using a custom session>>>session=boto3.Session(...aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],...aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],... )>>>url='s3://smart-open-py37-benchmark-results/test.txt'>>>withopen(url,'wb',transport_params={'client':session.client('s3')})asfout:...bytes_written=fout.write(b'hello world!')...print(bytes_written)12
# stream from HDFSforlineinopen('hdfs://user/hadoop/my_file.txt',encoding='utf8'):print(line)# stream from WebHDFSforlineinopen('webhdfs://host:port/user/hadoop/my_file.txt'):print(line)# stream content *into* HDFS (write mode):withopen('hdfs://host:port/user/hadoop/my_file.txt','wb')asfout:fout.write(b'hello world')# stream content *into* WebHDFS (write mode):withopen('webhdfs://host:port/user/hadoop/my_file.txt','wb')asfout:fout.write(b'hello world')# stream from a completely custom s3 server, like s3proxy:forlineinopen('s3u://user:secret@host:port@mybucket/mykey.txt'):print(line)# Stream to Digital Ocean Spaces bucket providing credentials from boto3 profilesession=boto3.Session(profile_name='digitalocean')client=session.client('s3',endpoint_url='https://ams3.digitaloceanspaces.com')transport_params= {'client':client}withopen('s3://bucket/key.txt','wb',transport_params=transport_params)asfout:fout.write(b'here we stand')# stream from GCSforlineinopen('gs://my_bucket/my_file.txt'):print(line)# stream content *into* GCS (write mode):withopen('gs://my_bucket/my_file.txt','wb')asfout:fout.write(b'hello world')# stream from Azure Blob Storageconnect_str=os.environ['AZURE_STORAGE_CONNECTION_STRING']transport_params= {'client':azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),}forlineinopen('azure://mycontainer/myfile.txt',transport_params=transport_params):print(line)# stream content *into* Azure Blob Storage (write mode):connect_str=os.environ['AZURE_STORAGE_CONNECTION_STRING']transport_params= {'client':azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),}withopen('azure://mycontainer/my_file.txt','wb',transport_params=transport_params)asfout:fout.write(b'hello world')

Compression Handling

The top-level compression parameter controls compression/decompression behavior when reading and writing.The supported values for this parameter are:

  • infer_from_extension (default behavior)
  • disable
  • .gz
  • .bz2
  • .zst

By default,smart_open determines the compression algorithm to use based on the file extension.

>>>fromsmart_openimportopen,register_compressor>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...print(fin.read(32))ItwasabrightcolddayinApri

You can override this behavior to either disable compression, or explicitly specify the algorithm to use.To disable compression:

>>>fromsmart_openimportopen,register_compressor>>>withopen('smart_open/tests/test_data/1984.txt.gz','rb',compression='disable')asfin:...print(fin.read(32))b'\x1f\x8b\x08\x08\x85F\x94\\\x00\x031984.txt\x005\x8f=r\xc3@\x08\x85{\x9d\xe2\x1d@'

To specify the algorithm explicitly (e.g. for non-standard file extensions):

>>>fromsmart_openimportopen,register_compressor>>>withopen('smart_open/tests/test_data/1984.txt.gzip',compression='.gz')asfin:...print(fin.read(32))ItwasabrightcolddayinApri

You can also easily add support for other file extensions and compression formats.For example, to open xz-compressed files:

>>>importlzma,os>>>fromsmart_openimportopen,register_compressor>>>def_handle_xz(file_obj,mode):...returnlzma.LZMAFile(filename=file_obj,mode=mode,format=lzma.FORMAT_XZ)>>>register_compressor('.xz',_handle_xz)>>>withopen('smart_open/tests/test_data/1984.txt.xz')asfin:...print(fin.read(32))ItwasabrightcolddayinApri

lzma is in the standard library in Python 3.3 and greater.For 2.7, usebackports.lzma.

Transport-specific Options

smart_open supports a wide range of transport options out of the box, including:

  • S3
  • HTTP, HTTPS (read-only)
  • SSH, SCP and SFTP
  • WebHDFS
  • GCS
  • Azure Blob Storage

Each option involves setting up its own set of parameters.For example, for accessing S3, you often need to set up authentication, like API keys or a profile name.smart_open'sopen function accepts a keyword argumenttransport_params which accepts additional parameters for the transport layer.Here are some examples of using this parameter:

>>>importboto3>>>fin=open('s3://commoncrawl/robots.txt',transport_params=dict(client=boto3.client('s3')))>>>fin=open('s3://commoncrawl/robots.txt',transport_params=dict(buffer_size=1024))

For the full list of keyword arguments supported by each transport option, see the documentation:

help('smart_open.open')

S3 Credentials

smart_open uses theboto3 library to talk to S3.boto3 has severalmechanisms for determining the credentials to use.By default,smart_open will defer toboto3 and let the latter take care of the credentials.There are several ways to override this behavior.

The first is to pass aboto3.Client object as a transport parameter to theopen function.You can customize the credentials when constructing the session for the client.smart_open will then use the session when talking to S3.

session=boto3.Session(aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY,aws_session_token=SESSION_TOKEN,)client=session.client('s3',endpoint_url=...,config=...)fin=open('s3://bucket/key',transport_params={'client':client})

Your second option is to specify the credentials within the S3 URL itself:

fin=open('s3://aws_access_key_id:aws_secret_access_key@bucket/key', ...)

Important: The two methods above aremutually exclusive. If you pass an AWS clientand the URL contains credentials,smart_open will ignore the latter.

Important:smart_open ignores configuration files from the olderboto library.Port your oldboto settings toboto3 in order to use them withsmart_open.

S3 Advanced Usage

Additional keyword arguments can be propagated to the boto3 methods that are used bysmart_open under the hood using theclient_kwargs transport parameter.

For instance, to upload a blob with Metadata, ACL, StorageClass, these keyword arguments can be passed tocreate_multipart_upload (docs).

kwargs= {'Metadata': {'version':2},'ACL':'authenticated-read','StorageClass':'STANDARD_IA'}fout=open('s3://bucket/key','wb',transport_params={'client_kwargs': {'S3.Client.create_multipart_upload':kwargs}})

Iterating Over an S3 Bucket's Contents

Since going over all (or select) keys in an S3 bucket is a very common operation, there's also an extra functionsmart_open.s3.iter_bucket() that does this efficiently,processing the bucket keys in parallel (using multiprocessing):

>>>fromsmart_openimports3>>># we use workers=1 for reproducibility; you should use as many workers as you have cores>>>bucket='silo-open-data'>>>prefix='Official/annual/monthly_rain/'>>>forkey,contentins3.iter_bucket(bucket,prefix=prefix,accept_key=lambdakey:'/201'inkey,workers=1,key_limit=3):...print(key,round(len(content)/2**20))Official/annual/monthly_rain/2010.monthly_rain.nc13Official/annual/monthly_rain/2011.monthly_rain.nc13Official/annual/monthly_rain/2012.monthly_rain.nc13

GCS Credentials

smart_open uses thegoogle-cloud-storage library to talk to GCS.google-cloud-storage uses thegoogle-cloud package under the hood to handle authentication.There are severaloptions to providecredentials.By default,smart_open will defer togoogle-cloud-storage and let it take care of the credentials.

To override this behavior, pass agoogle.cloud.storage.Client object as a transport parameter to theopen function.You cancustomize the credentialswhen constructing the client.smart_open will then use the client when talking to GCS. To follow allow withthe example below,refer to Google's guideto setting up GCS authentication with a service account.

importosfromgoogle.cloud.storageimportClientservice_account_path=os.environ['GOOGLE_APPLICATION_CREDENTIALS']client=Client.from_service_account_json(service_account_path)fin=open('gs://gcp-public-data-landsat/index.csv.gz',transport_params=dict(client=client))

If you need more credential options, you can create an explicitgoogle.auth.credentials.Credentials objectand pass it to the Client. To create an API token for use in the example below, refer to theGCS authentication guide.

importosfromgoogle.auth.credentialsimportCredentialsfromgoogle.cloud.storageimportClienttoken=os.environ['GOOGLE_API_TOKEN']credentials=Credentials(token=token)client=Client(credentials=credentials)fin=open('gs://gcp-public-data-landsat/index.csv.gz',transport_params={'client':client})

GCS Advanced Usage

Additional keyword arguments can be propagated to the GCS open method (docs), which is used bysmart_open under the hood, using theblob_open_kwargs transport parameter.

Additionally keyword arguments can be propagated to the GCSget_blob method (docs) when in a read-mode, using theget_blob_kwargs transport parameter.

Additional blob properties (docs) can be set before an upload, as long as they are not read-only, using theblob_properties transport parameter.

open_kwargs= {'predefined_acl':'authenticated-read'}properties= {'metadata': {'version':2},'storage_class':'COLDLINE'}fout=open('gs://bucket/key','wb',transport_params={'blob_open_kwargs':open_kwargs,'blob_properties':properties})

Azure Credentials

smart_open uses theazure-storage-blob library to talk to Azure Blob Storage.By default,smart_open will defer toazure-storage-blob and let it take care of the credentials.

Azure Blob Storage does not have any ways of inferring credentials therefore, passing aazure.storage.blob.BlobServiceClientobject as a transport parameter to theopen function is required.You cancustomize the credentialswhen constructing the client.smart_open will then use the client when talking to. To follow allow withthe example below,refer to Azure's guideto setting up authentication.

importosfromazure.storage.blobimportBlobServiceClientazure_storage_connection_string=os.environ['AZURE_STORAGE_CONNECTION_STRING']client=BlobServiceClient.from_connection_string(azure_storage_connection_string)fin=open('azure://my_container/my_blob.txt',transport_params={'client':client})

If you need more credential options, refer to theAzure Storage authentication guide.

Azure Advanced Usage

Additional keyword arguments can be propagated to thecommit_block_list method (docs), which is used bysmart_open under the hood for uploads, using theblob_kwargs transport parameter.

kwargs= {'metadata': {'version':2}}fout=open('azure://container/key','wb',transport_params={'blob_kwargs':kwargs})

Drop-in replacement ofpathlib.Path.open

smart_open.open can also be used withPath objects.The built-in Path.open() is not able to read text from compressed files, so usepatch_pathlib to replace it with smart_open.open() instead.This can be helpful when e.g. working with compressed files.

>>>frompathlibimportPath>>>fromsmart_open.smart_open_libimportpatch_pathlib>>>>>> _=patch_pathlib()# replace `Path.open` with `smart_open.open`>>>>>>path=Path("smart_open/tests/test_data/crime-and-punishment.txt.gz")>>>>>>withpath.open("r")asinfile:...print(infile.readline()[:41])Вначалеиюля,вчрезвычайножаркоевремя

How do I ...?

Seethis document.

Extendingsmart_open

Seethis document.

Testingsmart_open

smart_open comes with a comprehensive suite of unit tests.Before you can run the test suite, install the test dependencies:

pip install -e .[test]

Now, you can run the unit tests:

pytest smart_open

The tests are also run automatically withTravis CI on every commit push & pull request.

Comments, bug reports

smart_open lives onGithub. You can fileissues or pull requests there. Suggestions, pull requests and improvements welcome!


smart_open is open source software released under theMIT license.Copyright (c) 2015-nowRadim Řehůřek.


[8]ページ先頭

©2009-2025 Movatter.jp