Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Using AWS S3 as a database
AWS Community Builders  profile imageDaniel Muller
Daniel Muller forAWS Community Builders

Posted on

     

Using AWS S3 as a database

During re:Invent 2017, AWS announced a new feature for S3:s3-select, which went GA in April 2018.

In a Nutshell

S3 select allows to retrieve partial content from a single key in S3 using SQL. You can think of it as a single table-database.

There are some constraints:

  • Data needs to be in a single file and can't be grouped under a prefix
  • The content of the file must be in JSON, CSV or Apache Parquet
  • Supported compressions are GZIP or BZIP2
  • You can't use joins between tables
  • You can't update a single entry, you need to replace the whole file
    • But this comes with a benefit: updating the data becomes very easy. All you need is an S3 PutObject access (Console, CLI, SDK, SFTP, ...)

This is specially useful in bigdata. Instead of loading a large file, you retrieve only the useful content. Therefore reducing network transfer, storage and memory on the processing side, which translates into cost reduction.

S3-Select stills needs to scan the whole file (you pay for that), but you gain on the processing side.

Which format to choose?

Each format has it's pros and cons. The format to use mainly depends on your production capabilities and query needs.

Being text formats, CSV and JSON a very easy to produce (code, spreadsheets, ...). Parquet needs some additional skills and tools.

Schema alterations are straight forward in JSON, since every entry carries it's own schema. For CSV and Parquet, you need to alter each entry with the new field. Using a spreadsheet software helps for CSV.

With all the attributes repeated, JSON is the heaviest format. Half of the file is schema definition. With CSV and Parquet you define your attributes only once. Obviously, this becomes less of a factor once compressed.

On the query side, Parquet is the clear winner. Being a columnar format, the amount of data scanned is greatly reduced, since only the needed columns are used. With CSV and JSON the whole file needs to be scanned.

JSONCSVParquet
Data CreationEasyEasyDifficult
Data ModificationEasyAverageDifficult
Storage SizeHighLowLow
Query speedSlowSlowFast

Which compression to choose?

You should compress your files, there is no reason to keep them vanilla. Smaller files results in less storage used and faster transfer times from your data source to S3.

BZIP2 is a bit slower, but will generate smaller files than GZIP. You reduce even more your upload times and storage costs, but with a small impact on query times.

A simple example

player_idplayer_slugfirst_namelast_nameplayer_urlflag_coderesidencebirthplacebirthdatebirth_yearbirth_monthbirth_dayturned_proweight_lbsweight_kgheight_ftheight_inchesheight_cmhandednessbackhand
a002ricardo-acunaRicardoAcunahttp://www.atpworldtour.com/en/players/ricardo-acuna/a002/overviewCHIJupiter, FL, USASantiago, Chile1958.01.1319580113015068"5'9"""69175
a001sadiq-abdullahiSadiqAbdullahihttp://www.atpworldtour.com/en/players/sadiq-abdullahi/a001/overviewNGR1960.02.0219600202000"0'0"""00
a005nelson-aertsNelsonAertshttp://www.atpworldtour.com/en/players/nelson-aerts/a005/overviewBRACachoeira Do Sul, Brazil1963.04.2519630425016575"6'2"""74188
a004egan-adamsEganAdamshttp://www.atpworldtour.com/en/players/egan-adams/a004/overviewUSAPalmetto, FL, USAMiami Beach, FL, USA1959.06.1519590615016073"5'10"""70178

...

Query the dataset

The API allowing to query a file's content isSelectObjectContent. Also available in most of the SDK's.

For theCLI, you find it underaws s3api select-object-content.

Let's do it with #"http://www.w3.org/2000/svg" width="20px" height="20px" viewbox="0 0 24 24">Enter fullscreen modeExit fullscreen mode

records:

[{"last_name":"Allegro","birth_year":"1978"},{"last_name":"Bastl","birth_year":"1975"},{"last_name":"Bohli","birth_year":"1983"},{"last_name":"Blatter","birth_year":"1949"},...]
Enter fullscreen modeExit fullscreen mode

stats:

{"Details":{"BytesScanned":379951,"BytesProcessed":1557430,"BytesReturned":1034}}
Enter fullscreen modeExit fullscreen mode

S3-Select had to scan the whole compressed file and needed to process the uncompressed content. But needed to return only 1KB (~0.3% of the total file).

In conclusion

Considering S3 as a Database is surely far fetched. But it works very well for single large datasets on which you need to retrieve a small chunk. Consider it as your read-only slave database.

Keep in mind, that the read speed is far below what you can achieve with a real databases on SSD, so don't use this for anything time sensitive.

The ability to update content with more traditional non-database tools, allows a wider range of peoples to curate and maintain the data.

A real life example

When S3-Select came out, I decided to use the above concept to create aServerless GeoIP API using Maxmind's GeoIP2 databases. I wanted to replace the locally installed databases on each server by an HTTP service. This helps on the administrative side by managing updates in only one place, remove boot up scripts while autoscaling, and also it allows other processes and services to use this data (ETL, logging, lambda, ...).

You can read more about it in myblog or install it from myGithub repo (works with GeoIP2 and GeoLite2 databases).

I made some tweaks from the above example:

  • The data files are sharded by primary key to reduce the scan size
  • Cloudfront is used as a database-cache layer to avoid having to query twice for the same IP.

Why not DynamoDB?

When S3-Select came out, DynamoDB didn't have an "On-Demand" pricing. Having to provision (and pay for) a fixed throughput was too expensive.

Today, with "On-Demand" pricing, I would definitively use DynamoDB for this tool. Storage pricing is similar to S3, query pricing also. Maintaining and querying the data in DynamoDB is also easier.

Top comments(1)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss
CollapseExpand
 
m_azarboon profile image
Mahdi Azarboon
Cloud Architect - Serverless
  • Joined

Great post. Please publish more. I've just followed you.

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

AWS Community Builders
Build On!

Would you like to become an AWS Community Builder? Learn more about the program and apply to join when applications are open next.

Learn more

More fromAWS Community Builders

Serverless Meets Secure Networking: Payment Gateway building block with VPC Lattice
#webdev#aws#serverless#vpclattice
Deploying a Voting App on AWS EKS
#aws#kubernetes#eks#webdev
The art of guesstimating
#aws#kubernetes#cloud#architecture
DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp