- Notifications
You must be signed in to change notification settings - Fork0
Parallel S3 and local filesystem execution tool.
License
monobaila/s5cmd
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
s5cmd
is a very fast S3 and local filesystem execution tool. It comes with supportfor a multitude of operations including tab completion and wildcard supportfor files, which can be very handy for your object storage workflow while workingwith large number of files.
There are already other utilities to work with S3 and similar object storageservices, thus it is natural to wonder whats5cmd
has to offer that others don't.
In short,s5cmd
offers a very fast speed.Thanks toJoshua Robinson for hisstudy and experimentation ons5cmd;
to quote his mediumpost:
For uploads, s5cmd is 32x faster than s3cmd and 12x faster than aws-cli.For downloads, s5cmd can saturate a 40Gbps link (~4.3 GB/s), whereas s3cmdand aws-cli can only reach 85 MB/s and 375 MB/s respectively.
If you would like to know more about performance ofs5cmd
and thereasons for its fast speed, refer tobenchmarks section
s5cmd
supports wide range of object management tasks both for cloudstorage services and local filesystems.
- List buckets and objects
- Upload, download or delete objects
- Move, copy or rename objects
- Set Server Side Encryption using AWS Key Management Service (KMS)
- Set Access Control List (ACL) for objects/files on the upload, copy, move.
- Print object contents to stdout
- Select JSON records from objects using SQL expressions
- Create or remove buckets
- Summarize objects sizes, grouping by storage class
- Wildcard support for all operations
- Multiple arguments support for delete operation
- Command file support to run commands in batches at very high execution speeds
- Dry run support
- S3 Transfer Acceleration support
- Google Cloud Storage (and any other S3 API compatible service) support
- Structured logging for querying command outputs
- Shell auto-completion
- S3 ListObjects API backward compatibility
TheReleases page provides pre-builtbinaries for Linux, macOS and Windows.
For macOS, ahomebrew tap is provided:
brew install peak/tap/s5cmd
WarningThese releases are maintained by the community. They might be out of date compared to the official releases.
You can also installs5cmd
fromMacPorts on macOS:
sudo port selfupdatesudo port install s5cmd
s5cmd
isincluded in theconda-forge channel, and it can be downloaded through theConda.
Installing
s5cmd
from theconda-forge
channel can be achieved by addingconda-forge
to your channels with:conda config --add channels conda-forgeconda config --set channel_priority strict
Once the
conda-forge
channel has been enabled,s5cmd
can be installed withconda
:conda install s5cmd
ps. Quoted froms5cmd feedstock. You can also find further instructions on itsREADME.
You can builds5cmd
from source if you haveGo 1.17+installed.
go get github.com/peak/s5cmd
master
is not guaranteed to be stable sincedevelopment happens onmaster
branch.
$ docker pull peakcom/s5cmd$ docker run --rm -v ~/.aws:/root/.aws peakcom/s5cmd <S3 operation>
ℹ️/aws
directory is the working directory of the image. Mounting your current working directory to it allows you to runs5cmd
as if it was installed in your system;
docker run --rm -v $(pwd):/aws -v ~/.aws:/root/.aws peakcom/s5cmd <S3 operation>
$ git clone https://github.com/peak/s5cmd && cd s5cmd$ docker build -t s5cmd .$ docker run --rm -v ~/.aws:/root/.aws s5cmd <S3 operation>
s5cmd
supports multiple-level wildcards for all S3 operations. This isachieved by listing all S3 objects with the prefix up to the first wildcard,then filtering the results in-memory. For example, for the following command;
s5cmd cp 's3://bucket/logs/2020/03/*' .
first aListObjects
request is send, then the copy operation will be executedagainst each matching object, in parallel.
s5cmd cp s3://bucket/object.gz .
Suppose we have the following objects:
s3://bucket/logs/2020/03/18/file1.gzs3://bucket/logs/2020/03/19/file2.gzs3://bucket/logs/2020/03/19/originals/file3.gz
s5cmd cp 's3://bucket/logs/2020/03/*' logs/
s5cmd
will match the given wildcards and arguments by doing an efficientsearch against the given prefixes. All matching objects will be downloaded inparallel.s5cmd
will create the destination directory if it is missing.
logs/
directory content will look like:
$ tree.└── logs ├── 18 │ └── file1.gz └── 19 ├── file2.gz └── originals └── file3.gz4 directories, 3 files
ℹ️s5cmd
preserves the source directory structure by default. If you want toflatten the source directory structure, use the--flatten
flag.
s5cmd cp --flatten 's3://bucket/logs/2020/03/*' logs/
logs/
directory content will look like:
$ tree.└── logs ├── file1.gz ├── file2.gz └── file3.gz1 directory, 3 files
s5cmd cp object.gz s3://bucket/
by setting server side encryption (aws kms) of the file:
s5cmd cp -sse aws:kms -sse-kms-key-id <your-kms-key-id> object.gz s3://bucket/
by setting Access Control List (acl) policy of the object:
s5cmd cp -acl bucket-owner-full-control object.gz s3://bucket/
s5cmd cp directory/ s3://bucket/
Will upload all files at given directory to S3 while keeping the folder hierarchyof the source.
s5cmd rm s3://bucket/logs/2020/03/18/file1.gz
s5cmd rm s3://bucket/logs/2020/03/19/*
Will remove all matching objects:
s3://bucket/logs/2020/03/19/file2.gzs3://bucket/logs/2020/03/19/originals/file3.gz
s5cmd
utilizes S3 delete batch API. If matching objects are up to 1000,they'll be deleted in a single request. However, it should be noted that commands such as
s5cmd rm s3://bucket-foo/object s3://bucket-bar/object
are not supported bys5cmd
and result in error (since we have 2 different buckets), as it is in odds with the benefit of performing batch delete requests. Thus, if in need, one can uses5cmd run
mode for this case, i.e,
$ s5cmd runrm s3://bucket-foo/objectrm s3://bucket-bar/object
more details and examples ons5cmd run
are presented in alater section.
s5cmd
supports copying objects on the server side as well.
s5cmd cp 's3://bucket/logs/2020/*' s3://bucket/logs/backup/
Will copy all the matching objects to the given S3 prefix, respecting the sourcefolder hierarchy.
s5cmd
supports theSelectObjectContent
S3 operation, and will run yourSQL queryagainst objects matching normal wildcard syntax and emit matching JSON records via stdout. Recordsfrom multiple objects will be interleaved, and order of the records is not guaranteed (though it'slikely that the records from a single object will arrive in-order, even if interleaved with otherrecords).
$ s5cmd select --compression GZIP \ --query "SELECT s.timestamp, s.hostname FROM S3Object s WHERE s.ip_address LIKE '10.%' OR s.application='unprivileged'" \ s3://bucket-foo/object/2021/*{"timestamp":"2021-07-08T18:24:06.665Z","hostname":"application.internal"}{"timestamp":"2021-07-08T18:24:16.095Z","hostname":"api.github.com"}
At the moment this operationonly supports JSON records selected with SQL. S3 calls thislines-type JSON, but it seems that it works even if the records aren't line-delineated. YMMV.
$ s5cmd du --humanize 's3://bucket/2020/*'30.8M bytes in 3 objects: s3://bucket/2020/*
The most powerful feature ofs5cmd
is the commands file. Thousands of S3 andfilesystem commands are declared in a file (or simply piped in from anotherprocess) and they are executed using multiple parallel workers. Since only oneprogram is launched, thousands of unnecessary fork-exec calls are avoided. Thisway S3 execution times can reach a few thousand operations per second.
s5cmd run commands.txt
or
cat commands.txt | s5cmd run
commands.txt
content could look like:
cp s3://bucket/2020/03/* logs/2020/03/# line comments are supportedrm s3://bucket/2020/03/19/file2.gz# empty lines are OK too like above# rename an S3 objectmv s3://bucket/2020/03/18/file1.gz s3://bucket/2020/03/18/original/file.gz
sync
command synchronizes S3 buckets, prefixes, directories and files between S3 buckets and prefixes as well.It compares files between source and destination, taking source files assource-of-truth;
- copies files those do not exist in destination
- copies files those exist in both locations if the comparison made with sync strategy allows it so
It makes a one way synchronization from source to destination without modifying any of the source files and deleting any of the destination files (unless--delete
flag has passed).
Suppose we have following files;
- 29 Sep 10:00 .5000 29 Sep 11:00 ├── favicon.ico 300 29 Sep 10:00 ├── index.html 50 29 Sep 10:00 ├── readme.md 80 29 Sep 11:30 └── styles.css
s5cmd ls s3://bucket/static/2021/09/29 10:00:01 300 index.html2021/09/29 11:10:01 10 readme.md2021/09/29 10:00:01 90 styles.css2021/09/29 11:10:01 10 test.html
running would;
- copy
favicon.ico
- file does not exist in destination.
- copy
styles.css
- source file is newer than to remote counterpart.
- copy
readme.md
- even though the source one is older, it's size differs from the destination one; assuming source file is the source of truth.
s5cmd sync . s3://bucket/static/cp favicon.ico s3://bucket/static/favicon.icocp styles.css s3://bucket/static/styles.csscp readme.md s3://bucket/static/readme.md
Running with--delete
flag would delete files those do not exist in the source;
s5cmd sync --delete . s3://bucket/static/rm s3://bucket/test.htmlcp favicon.ico s3://bucket/static/favicon.icocp styles.css s3://bucket/static/styles.csscp readme.md s3://bucket/static/readme.md
It's also possible to use wildcards to sync only a subset of files.
To sync only.html
files in S3 bucket above to same local file system;
s5cmd sync 's3://bucket/static/*.html' .cp s3://bucket/prefix/index.html index.htmlcp s3://bucket/prefix/test.html test.html
By defaults5cmd
compares files' both sizeand modification times, treating source files assource of truth. Any difference in size or modification time would causes5cmd
to copy source object to destination.
mod time | size | should sync |
---|---|---|
src > dst | src != dst | ✅ |
src > dst | src == dst | ✅ |
src <= dst | src != dst | ✅ |
src <= dst | src == dst | ❌ |
With--size-only
flag, it's possible to use the strategy that would only compare file sizes. Source treated assource of truth and any difference in sizes would causes5cmd
to copy source object to destination.
mod time | size | should sync |
---|---|---|
src > dst | src != dst | ✅ |
src > dst | src = dst | ❌ |
src <= dst | src != dst | ✅ |
src <= dst | src == dst | ❌ |
--dry-run
flag will output what operations will be performed without actuallycarrying out those operations.
s3://bucket/pre/file1.gz...s3://bucket/last.txt
running
s5cmd --dry-run cp s3://bucket/pre/* s3://another-bucket/
will output
cp s3://bucket/pre/file1.gz s3://another-bucket/file1.gz...cp s3://bucket/pre/last.txt s3://anohter-bucket/last.txt
however, those copy operations will not be performed. It is displaying whats5cmd
will do when ran without--dry-run
Note that--dry-run
can be used with any operation that has a side effect, i.e.,cp, mv, rm, mb ...
The--use-list-objects-v1
flag will force using S3 ListObjectsV1 API. Thisflag is useful for services that do not support ListObjectsV2 API.
s5cmd --use-list-objects-v1 ls s3://bucket/
s5cmd
uses official AWS SDK to access S3. SDK requires credentials to signrequests to AWS. Credentials can be provided in a variety of ways:
- Command line options
--profile
to use anamed profile,--credentials-file
flag to use the specified credentials file, and--no-sign-request
to send requests anonymously - Environment variables
- AWS credentials file, including profile selection via
AWS_PROFILE
environmentvariable - If
s5cmd
runs on an Amazon EC2 instance, EC2 IAM role - If
s5cmd
runs on EKS, Kube IAM role
The SDK detects and uses the built-in providers automatically, without requiringmanual configurations.
While executing the commands,s5cmd
detects the region according to the following order of priority:
--source-region
or--destination-region
flags ofcp
command.AWS_REGION
environment variable.- Region section of AWS profile.
- Auto detection from bucket region (via
HeadBucket
). us-east-1
as default region.
Shell completion is supported for bash, pwsh (PowerShell) and zsh.
Runs5cmd --install-completion
to obtain the appropriate auto-completion script for your shell, note thatinstall-completion
does not install the auto-completion but merely gives the instructions to install. The name is kept as it is for backward compatibility.
To actually enable auto-completion:
you should add auto-completion script to.bashrc
and.zshrc
file.
you should save the autocompletion script to a file nameds5cmd.ps1
and add the full path of "s5cmd.ps1" file to profile file (which you can locate with$profile
)
Finally, restart your shell to activate the changes.
NoteThe environment variable
SHELL
must be accurate for the autocompletion to function properly. That is it should point tobash
binary in bash, tozsh
binary in zsh and topwsh
binary in PowerShell.
NoteThe autocompletion is tested with following versions of the shells:
zsh 5.8.1 (x86_64-apple-darwin21.0)
GNUbash, version 5.1.16(1)-release (x86_64-apple-darwin21.1.0)
PowerShell 7.2.6
s5cmd
supports S3 API compatible services, such as GCS, Minio or your favoriteobject storage.
s5cmd --endpoint-url https://storage.googleapis.com ls
or an alternative with environment variable
S3_ENDPOINT_URL="https://storage.googleapis.com" s5cmd ls# orexport S3_ENDPOINT_URL="https://storage.googleapis.com"s5cmd ls
all variants will return your GCS buckets.
s5cmd
reads.aws/credentials
to access Google Cloud Storage. Populate theaws_access_key_id
andaws_secret_access_key
fields in.aws/credentials
with an HMAC key created using thisprocedure.
s5cmd
will use virtual-host style bucket resolving for S3, S3 transferacceleration and GCS. If a custom endpoint is provided, it'll fallback topath-style.
s5cmd
uses an exponential backoff retry mechanism for transient or potentialserver-side throttling errors. Non-retriable errors, such asinvalid credentials
,authorization errors
etc, will not be retried. By default,s5cmd
will retry 10 times for up to a minute. Number of retries are adjustablevia--retry-count
flag.
ℹ️ Enable debug level logging for displaying retryable errors.
On some shells, like zsh, the*
character gets treated as a file globbingwildcard, which causes unexpected results fors5cmd
. You might see an outputlike:
zsh: no matches found
If that happens, you need to wrap your wildcard expression in single quotes, like:
s5cmd cp '*.gz' s3://bucket/
s5cmd
supports both structured and unstructured outputs.
- unstructured output
$ s5cmd cp s3://bucket/testfile.cp s3://bucket/testfile testfile
$ s5cmd cp --no-clobber s3://somebucket/file.txt file.txtERROR"cp s3://somebucket/file.txt file.txt": object already exists
- If
--json
flag is provided:
{"operation":"cp","success":true,"source":"s3://bucket/testfile","destination":"testfile","object":"[object]"}{"operation":"cp","job":"cp s3://somebucket/file.txt file.txt","error":"'cp s3://somebucket/file.txt file.txt': object already exists"}
numworkers
is a global option that sets the size of the global worker pool. Default value ofnumworkers
is256.Commands such ascp
,select
andrun
, which can benefit from parallelism use this worker pool to execute tasks. A task can be an upload, a download or anything in arun
file.
For example, if you are uploading 100 files to an S3 bucket and the--numworkers
is set to 10, thens5cmd
will limit the number of files concurrently uploaded to 10.
s5cmd --numworkers 10 cp '/Users/foo/bar/*' s3://mybucket/foo/bar/
concurrency
is acp
command option. It sets the number of parts that will be uploaded or downloaded in parallel for a single file.This parameter is used by the AWS Go SDK. Default value ofconcurrency
is5
.
numworkers
andconcurrency
options can be used together:
s5cmd --numworkers 10 cp --concurrency 10 '/Users/foo/bar/*' s3://mybucket/foo/bar/
If you have a few, large files to download, setting--numworkers
to a very high value will not affect download speed. In this scenario setting--concurrency
to a higher value may have a better impact on the download speed.
Some benchmarks regarding the performance ofs5cmd
are introduced below. For moredetails refer to thispostwhich is the source of the benchmarks to be presented.
Upload/download of single large file
Uploading large number of small-sized files
Performance comparison on different hardware
So, where does all this speed come from?
There are mainly two reasons for this:
- It is written in Go, a statically compiled language designed to make developmentof concurrent systems easy and make full utilization of multi-core processors.
- Parallelization.
s5cmd
starts out with concurrent worker pools and parallelizesworkloads as much as possible while trying to achieve maximum throughput.
bench.py
script can be used to compare performance of two different s5cmd builds. Refer to thisreadme file for further details.
Some of the advanced usage patterns provided below are inspired by the followingarticle (thank you!@joshuarobinson)
Assume we have a set of objects on S3, and we would like to list them in sorted fashion according to object names.
$ s5cmd ls s3://bucket/reports/ | sort -k 42020/08/17 09:34:33 1364 antalya.csv2020/08/17 09:34:33 0 batman.csv2020/08/17 09:34:33 23114 istanbul.csv2020/08/17 09:34:33 26154 izmir.csv2020/08/17 09:34:33 112 samsun.csv2020/08/17 09:34:33 12552 van.csv
For a more practical scenario, let's say we have anavocado prices dataset, and we would like to take a peek at the few lines of the data by fetching only the necessary bytes.
$ s5cmd cat s3://bucket/avocado.csv.gz | gunzip | xsv slice --len 5 | xsv table Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany2 2015-12-13 0.93 118220.22 794.7 109149.67 130.5 8145.35 8042.21 103.14 0.0 conventional 2015 Albany3 2015-12-06 1.08 78992.15 1132.0 71976.41 72.58 5811.16 5677.4 133.76 0.0 conventional 2015 Albany4 2015-11-29 1.28 51039.6 941.48 43838.39 75.78 6183.95 5986.26 197.69 0.0 conventional 2015 Albany
s5cmd
allows to pass in some file, containing list of operations to be performed, as an argument to therun
command as illustrated in theabove example. Alternatively, one can pipe in commands intotherun:
BUCKET=s5cmd-test; s5cmd ls s3://$BUCKET/*test | grep -v DIR | awk ‘{print $NF}’| xargs -I {} echo “cp s3://$BUCKET/{} /local/directory/” | s5cmd run
The above command performs twos5cmd
invocations; first, searches for files withtest suffix and then creates acopy to local directory command for each matching file and finally, pipes in those into the run.
Let's examine another usage instance, where we migrate files older than30 days to a cloud object storage:
find /mnt/joshua/nachos/ -type f -mtime +30 | awk '{print "mv "$1" s3://joshuarobinson/backup/"$1}'| s5cmd run
It is worth to mention that,run
command should not be considered as asilver bullet for all operations. For example, assume we want to remove the following objects:
s3://bucket/prefix/2020/03/object1.gzs3://bucket/prefix/2020/04/object1.gz...s3://bucket/prefix/2020/09/object77.gz
Rather than executing
rm s3://bucket/prefix/2020/03/object1.gzrm s3://bucket/prefix/2020/04/object1.gz...rm s3://bucket/prefix/2020/09/object77.gz
withrun
command, it is better to just use
rm s3://bucket/prefix/2020/0*/object*.gz
the latter sends single delete request per thousand objects, whereas using the former approachsends a separate delete request for each subcommand provided torun.
Thus, there can be asignificant runtime difference between those two approaches.
MIT. SeeLICENSE.