Posted onNov 5, 2020

Using AWS S3 as a database

#aws #s3 #serverless #database

During re:Invent 2017, AWS announced a new feature for S3:s3-select, which went GA in April 2018.

In a Nutshell

S3 select allows to retrieve partial content from a single key in S3 using SQL. You can think of it as a single table-database.

There are some constraints:

Data needs to be in a single file and can't be grouped under a prefix
The content of the file must be in JSON, CSV or Apache Parquet
Supported compressions are GZIP or BZIP2
You can't use joins between tables
You can't update a single entry, you need to replace the whole file
- But this comes with a benefit: updating the data becomes very easy. All you need is an S3 PutObject access (Console, CLI, SDK, SFTP, ...)

This is specially useful in bigdata. Instead of loading a large file, you retrieve only the useful content. Therefore reducing network transfer, storage and memory on the processing side, which translates into cost reduction.

S3-Select stills needs to scan the whole file (you pay for that), but you gain on the processing side.

Which format to choose?

Each format has it's pros and cons. The format to use mainly depends on your production capabilities and query needs.

Being text formats, CSV and JSON a very easy to produce (code, spreadsheets, ...). Parquet needs some additional skills and tools.

Schema alterations are straight forward in JSON, since every entry carries it's own schema. For CSV and Parquet, you need to alter each entry with the new field. Using a spreadsheet software helps for CSV.

With all the attributes repeated, JSON is the heaviest format. Half of the file is schema definition. With CSV and Parquet you define your attributes only once. Obviously, this becomes less of a factor once compressed.

On the query side, Parquet is the clear winner. Being a columnar format, the amount of data scanned is greatly reduced, since only the needed columns are used. With CSV and JSON the whole file needs to be scanned.

	JSON	CSV	Parquet
Data Creation	Easy	Easy	Difficult
Data Modification	Easy	Average	Difficult
Storage Size	High	Low	Low
Query speed	Slow	Slow	Fast

Which compression to choose?

You should compress your files, there is no reason to keep them vanilla. Smaller files results in less storage used and faster transfer times from your data source to S3.

BZIP2 is a bit slower, but will generate smaller files than GZIP. You reduce even more your upload times and storage costs, but with a small impact on query times.

A simple example

Dataset used:atp-world-tour-tennis-data/players
Size: 1.5MB (372KB gzipped)
Entries: 10912
Fields: 20

player_id	player_slug	first_name	last_name	player_url	flag_code	residence	birthplace	birthdate	birth_year	birth_month	birth_day	weight_lbs	weight_kg	height_ft	height_inches	height_cm
a002	ricardo-acuna	Ricardo	Acuna	http://www.atpworldtour.com/en/players/ricardo-acuna/a002/overview	CHI	Jupiter, FL, USA	Santiago, Chile	1958.01.13	1958	01	13	150	68	"5'9"""	69	175
a001	sadiq-abdullahi	Sadiq	Abdullahi	http://www.atpworldtour.com/en/players/sadiq-abdullahi/a001/overview	NGR			1960.02.02	1960	02	02	0	0	"0'0"""	0	0
a005	nelson-aerts	Nelson	Aerts	http://www.atpworldtour.com/en/players/nelson-aerts/a005/overview	BRA		Cachoeira Do Sul, Brazil	1963.04.25	1963	04	25	165	75	"6'2"""	74	188
a004	egan-adams	Egan	Adams	http://www.atpworldtour.com/en/players/egan-adams/a004/overview	USA	Palmetto, FL, USA	Miami Beach, FL, USA	1959.06.15	1959	06	15	160	73	"5'10"""	70	178

...

Query the dataset

The API allowing to query a file's content isSelectObjectContent. Also available in most of the SDK's.

For theCLI, you find it underaws s3api select-object-content.

Let's do it with #"http://www.w3.org/2000/svg" width="20px" height="20px" viewbox="0 0 24 24">Enter fullscreen mode

records:

[{"last_name":"Allegro","birth_year":"1978"},{"last_name":"Bastl","birth_year":"1975"},{"last_name":"Bohli","birth_year":"1983"},{"last_name":"Blatter","birth_year":"1949"},...]

stats:

{"Details":{"BytesScanned":379951,"BytesProcessed":1557430,"BytesReturned":1034}}

S3-Select had to scan the whole compressed file and needed to process the uncompressed content. But needed to return only 1KB (~0.3% of the total file).

In conclusion

Considering S3 as a Database is surely far fetched. But it works very well for single large datasets on which you need to retrieve a small chunk. Consider it as your read-only slave database.

Keep in mind, that the read speed is far below what you can achieve with a real databases on SSD, so don't use this for anything time sensitive.

The ability to update content with more traditional non-database tools, allows a wider range of peoples to curate and maintain the data.

A real life example

When S3-Select came out, I decided to use the above concept to create aServerless GeoIP API using Maxmind's GeoIP2 databases. I wanted to replace the locally installed databases on each server by an HTTP service. This helps on the administrative side by managing updates in only one place, remove boot up scripts while autoscaling, and also it allows other processes and services to use this data (ETL, logging, lambda, ...).

You can read more about it in myblog or install it from myGithub repo (works with GeoIP2 and GeoLite2 databases).

I made some tweaks from the above example:

The data files are sharded by primary key to reduce the scan size
Cloudfront is used as a database-cache layer to avoid having to query twice for the same IP.

Why not DynamoDB?

When S3-Select came out, DynamoDB didn't have an "On-Demand" pricing. Having to provision (and pay for) a fixed throughput was too expensive.

Today, with "On-Demand" pricing, I would definitively use DynamoDB for this tool. Storage pricing is similar to S3, query pricing also. Maintaining and querying the data in DynamoDB is also easier.