S3Fs

S3Fs is a Pythonic file interface to S3. It builds on top ofbotocore. The project is hosted onGitHubGitHub Repository

The top-level classS3FileSystem holds connection information and allowstypical file-system style operations likecp,mv,ls,du,glob, etc., as well as put/get of local files to/from S3.

The connection can be anonymous - in which case only publicly-available,read-only buckets are accessible - or via credentials explicitly suppliedor in configuration files.

Callingopen() on aS3FileSystem (typically using a context manager)provides anS3File for read or write access to a particular key. The objectemulates the standardFile protocol (read,write,tell,seek), such that functions expecting a file can access S3. Only binary readand write modes are implemented, with blocked caching.

S3Fs uses and is based uponfsspec.

Examples

Simple locate and read a file:

>>>imports3fs>>>s3=s3fs.S3FileSystem(anon=True)>>>s3.ls('my-bucket')['my-file.txt']>>>withs3.open('my-bucket/my-file.txt','rb')asf:...print(f.read())b'Hello, world'

(see alsowalk andglob)

Reading with delimited blocks:

>>>s3.read_block(path,offset=1000,length=10,delimiter=b'\n')b'A whole line of text\n'

Writing with blocked caching:

>>>s3=s3fs.S3FileSystem(anon=False)# uses default credentials>>>withs3.open('mybucket/new-file','wb')asf:...f.write(2*2**20*b'a')...f.write(2*2**20*b'a')# data is flushed and file closed>>>s3.du('mybucket/new-file'){'mybucket/new-file': 4194304}

Because S3Fs faithfully copies the Python file interface it can be usedsmoothly with other projects that consume the file interface likegzip orpandas.

>>>withs3.open('mybucket/my-file.csv.gz','rb')asf:...g=gzip.GzipFile(fileobj=f)# Decompress data with gzip...df=pd.read_csv(g)# Read CSV file with Pandas

Integration

The librariesintake,pandas anddask accept URLs with the prefix“s3://”, and will use s3fs to complete the IO operation in question. TheIO functions take an argumentstorage_options, which will be passedtoS3FileSystem, for example:

df=pd.read_excel("s3://bucket/path/file.xls",storage_options={"anon":True})

This gives the chance to pass any credentials or other necessaryarguments needed to s3fs.

Async

s3fs is implemented usingaiobotocore, and offers async functionality.A number of methods ofS3FileSystem areasync, for for each of these,there is also a synchronous version with the same name and lack of a_prefix.

If you wish to calls3fs from async code, then you should passasynchronous=True,loop= to the constructor (the latter is optional,if you wish to use both async and sync methods). You must also explicitlyawait the client creation before making any S3 call.

asyncdefrun_program():s3=S3FileSystem(...,asynchronous=True)session=awaits3.set_session()...# perform workawaitsession.close()asyncio.run(run_program())# or call from your async code

Concurrent async operations are also used internally for bulk operationssuch aspipe/cat,get/put,cp/mv/rm. The async calls arehidden behind a synchronisation layer, so are designed to be calledfrom normal code. If you arenotusing async-style programming, you do not need to know about how thisworks, but you might find the implementation interesting.

Multiprocessing

When using Python’smultiprocessing, the start method must be set to eitherspawn orforkserver.fork is not safe to use because of the open socketsand async thread used by s3fs, and may lead tohard-to-find bugs and occasional deadlocks. Read more about the availablestart methods.

Limitations

This project is meant for convenience, rather than feature completeness.The following are known current omissions:

  • file access is always binary (althoughreadline and iterating by lineare possible)

  • no permissions/access-control (i.e., nochmod/chown methods)

Logging

The logger nameds3fs provides information about the operations of the filesystem. To quickly see all messages, you can set the environment variableS3FS_LOGGING_LEVEL=DEBUG. The presence of this environment variable willinstall a handler for the logger that prints messages to stderr and set the loglevel to the given value. More advance logging configuration is possible usingPython’s standardlogging framework.

Credentials

The AWS key and secret may be provided explicitly when creating anS3FileSystem.A more secure way, not including the credentials directly in code, is to allowboto to establish the credentials automatically. Boto will try the followingmethods, in order:

  • AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, andAWS_SESSION_TOKENenvironment variables

  • configuration files such as~/.aws/credentials

  • for nodes on EC2, the IAM metadata provider

You can specify a profile usings3fs.S3FileSystem(profile='PROFILE').Otherwisesf3s will use authentication viaboto environment variables.

In a distributed environment, it is not expected that raw credentials shouldbe passed between machines. In the explicitly provided credentials case, themethodS3FileSystem.get_delegated_s3pars() can be used to obtain temporary credentials.When not using explicit credentials, it should be expected that every machinealso has the appropriate environment variables, config files or IAM rolesavailable.

If none of the credential methods are available, only anonymous access willwork, andanon=True must be passed to the constructor.

Furthermore,S3FileSystem.current() will return the most-recently createdinstance, so this method could be used in preference to the constructor incases where the code must be agnostic of the credentials/config used.

S3 Compatible Storage

To uses3fs against an S3 compatible storage, likeMinIO orCeph Object Gateway, you’ll probably need to pass extra parameters whencreating thes3fs filesystem. Here are some sample configurations:

For a self-hosted MinIO instance:

# When relying on auto discovery for credentials>>>s3=s3fs.S3FileSystem(anon=False,endpoint_url='https://...')# Or passing the credentials directly>>>s3=s3fs.S3FileSystem(key='miniokey...',secret='asecretkey...',endpoint_url='https://...')

It is also possible to set credentials through environment variables:

# export FSSPEC_S3_ENDPOINT_URL=https://...# export FSSPEC_S3_KEY='miniokey...'# export FSSPEC_S3_SECRET='asecretkey...'>>>s3=s3fs.S3FileSystem()# or ...>>>f=fsspec.open("s3://minio-bucket/...")

For Storj DCS via theS3-compatible Gateway:

# When relying on auto discovery for credentials>>>s3=s3fs.S3FileSystem(anon=False,endpoint_url='https://gateway.storjshare.io')# Or passing the credentials directly>>>s3=s3fs.S3FileSystem(key='accesskey...',secret='asecretkey...',endpoint_url='https://gateway.storjshare.io')

For a Scaleway s3-compatible storage in thefr-par zone:

>>>s3=s3fs.S3FileSystem(   key='scaleway-api-key...',   secret='scaleway-secretkey...',   endpoint_url='https://s3.fr-par.scw.cloud',   client_kwargs={      'region_name': 'fr-par'   })

For an OVH s3-compatible storage in theGRA zone:

>>>s3=s3fs.S3FileSystem(   key='ovh-s3-key...',   secret='ovh-s3-secretkey...',   endpoint_url='https://s3.GRA.cloud.ovh.net',   client_kwargs={      'region_name': 'GRA'   },   config_kwargs={      'signature_version': 's3v4'   })

Requester Pays Buckets

Some buckets, such as thearXiv raw data, are configured so that therequester of the data pays any transfer fees. You must beauthenticated to access these buckets and (because these charges maybeunexpected) amazon requires an additional key on many of the APIcalls. To enableRequesterPays create your file system as

>>>s3=s3fs.S3FileSystem(anon=False,requester_pays=True)

Serverside Encryption

For some buckets/files you may want to use some of s3’s server side encryptionfeatures.s3fs supports these in a few ways

>>>s3=s3fs.S3FileSystem(...s3_additional_kwargs={'ServerSideEncryption':'AES256'})

This will create an s3 filesystem instance that will append theServerSideEncryption argument to all s3 calls (where applicable).

The same applies fors3.open. Most of the methods on the filesystem objectwill also accept and forward keyword arguments to the underlying calls. Themost recently specified argument is applied last in the case where boths3_additional_kwargs and a method’s**kwargs are used.

Thes3.utils.SSEParams provides some convenient helpers for the serversideencryption parameters in particular. An instance can be passed instead of aregular python dictionary as thes3_additional_kwargs parameter.

Bucket Version Awareness

If your bucket has object versioning enabled then you can add version-aware supporttos3fs. This ensures that if a file is opened at a particular point in time thatversion will be used for reading.

This mitigates the issue where more than one user is concurrently reading and writingto the same object.

>>>s3=s3fs.S3FileSystem(version_aware=True)# Open the file at the latest version>>>fo=s3.open('versioned_bucket/object')>>>versions=s3.object_version_info('versioned_bucket/object')# Open the file at a particular version>>>fo_old_version=s3.open('versioned_bucket/object',version_id='SOMEVERSIONID')

In order for this to function the user must have the necessary IAM permissions to performa GetObjectVersion

Indices and tables

These docs pages collect anonymous tracking data using goatcounter, and thedashboard is available to the public:https://s3fs.goatcounter.com/ .