Engineering @ Zendesk

Moving from DynamoDB to tiered storage with MySQL+S3

8 min readNov 9, 2023

Originally we implemented a feature to persist an event-stream into DynamoDB to allow customers to retrieve them. This proved effective, serving as a strong use case for a key/value storage, yet the drawback was its high cost. Moving to provisioned billing-mode reduced cost by ~50%, but that was not going to be sustainable as we scaled to more customers. We also kept multiplying the cost each time we wanted to allow querying on another field in the logs via Global Secondary Indices.

The architecture is in the diagram below… fairly straightforward :)

Alternatives

We went looking for alternatives, and since here at Zendesk we’re mainly on the AWS stack (with some exceptions) we looked at:

S3 —Very cheap but very limited in query-ability unless adding S3-Select, and also not easy to do time-windowed streaming from Kafka.
Hudi + S3 — We use Hudi in our existing Datalake but it would have delayed data (24hours) and would have complexities to manage & debug.
ElasticSearch — We self-host ES in EC2 instances, and while it has great query-ability, it would be just as expensive as DynamoDB with our data size.
MySQL — Great query-ability but wouldn’t scale with billions of rows without doing data-sharding.

Approach

Eventually we decided on a hybrid approach to get the best of both worlds. This was to use MySQL to write the logs from Kafka, and periodically upload the data to S3 to allow removal of that data from MySQL to prevent overwhelming the DB. The advantages of this solution were:

Cheaper than DynamoDB (S3 is dirt cheap and Aurora MySQL costs don’t scale as badly as DynamoDB with data growth)
Easy to query the data (really easy in MySQL, but S3-Select is good too)
Reasonably performant to serve up API requests for data

For a bit more of an in-depth description, here’s the workflow:

We pickup the logs from Kafka in JSON format
These are written to the buffer table in MySQL.
Every hour a background job kicks off that reads from themetadata table to see what the last row we uploaded to S3 was.
This job reads batches of 10,000 logs from thebuffer table and uploads them to a S3 file.
The job also writes an entry back to themetadata table for each S3 file uploaded to store the S3 path, the number of logs in the file, its last log ID and also the last timestamp stored in that file.

CREATE TABLE `metadata` (
  `id` binary(16) NOT NULL,
  `end_time_ms` bigint(20) DEFAULT NULL,    # last timestamp this file contains
  `file_path` varchar(255) DEFAULT NULL,    # path of S3 file
  `log_count` int(11) DEFAULT NULL,         # count of the logs this S3 file contains
  `last_log_id` binary(16) NOT NULL, # ULID of last log we processed
);

This continues until all rows in thebuffer table have been uploaded (except the last 5mins of data to prevent Kafka late deliveries causing ordering issues)
Another background job also kicks off every hour to delete data in the DB that’s older than 4 hours to control the size of thebuffer table.

Querying

So the data pipeline is all setup, but does it actually give us the data back in a timely fashion when the client makes an API request for it?

The above graph shows the relative performance of Aurora (yellow) vs S3-Select (purple) when querying data. S3-Select can be surprisingly fast (keep in mind we batched by only 10,000 items per file), and extremely cheap!

To give a clearer picture of the query path:

Based on the time-window specified, we find the relevant S3 files based on ourmetadata entries in the MySQL DB.
If no other filters are specified (e.g. user-id, url, …) then we’ll just select enough S3 files to satisfy the number of results we want using thelog_count column in themetadata table row.
After determining which S3 files we need to query, we create an S3-Select query like the below, wheretimestamp is just a field in the stored JSON in the file:

SELECT * FROM S3Object s WHERE s."timestamp" >= '%s' AND s."timestamp" < '%s' LIMIT 100

Those S3-Select queries are done in parallel.
If we couldn’t get enough results from S3, then we will attempt to read more logs from our MySQLbuffer table.

Optimizing

Once we had the above all working we encountered a performance problem when the client wanted to filter results by any field besides the timestamp. For example if the client wanted logs for a specific user-id, then we’d at worst have to scan ALL the S3 data within the time-range to find any relevant logs, which made it difficult to figure out what queries to do in parallel.

Having to potentially search all S3 files is painful

So given that we can’t hold up an API response for hours, we attempted to reduce the number of S3 files that we need to scan.

Our first iteration was to just duplicate the data in S3 given that it was so cheap. So we constrained which fields the client could filter by and then duplicated data and metadata, e.g. for the user-id field:

Create normal S3 file that has all 10,000 logs from thebuffer table batch
Then create an S3 file with the above logs divided up into their own files based on user-id, i.e. a file for user-id 1, another for user-id 2, …
Create ametadata record for each of these files with afile_type column indicating whether it was for a specific field oralllogs.

This quickly became cumbersome as we would have to create an S3 file for not only each filter-able field, but also for any combinations of filtering, e.g.:

1 file for user-id specific logs
1 file for url specific logs
1 file for combination of user-id and url logs
…

EnterBloom Filters. Bloom filters have the nice property that we can ask the bloom-filter attached to ametadata record the question “is this <user-id> definitelynot in this S3 file?”. And if we get “yes”, we know for sure that we can skip scanning this file. If we get a “no”, then due to the nature of bloom-filters it may or may not be in the file so we’ll have to scan it.

So this was great, we had a way to filter down the S3 files to scan saving money and time. But we still had the problem of how many files to scan. If the client doesn’t specify a filter, then we had the simple task of just scanning enough S3 files by totaling thelog_count column. But if 100 S3 files had some logs relating to the specified user-id, then we had to scan them all just in case each file only had one relevant log each to fulfill a request for 100 results.

EnterCount-Min Sketch. This data structure behaved much like the above bloom filters, but instead of asking a binary question of whether an item existed in it or not, it instead gives an estimated count of how many times that item occurred.

So with these 2 data structures combined, we can now support queries foruser-id based on the following workflow:

Read the relevantmetadata rows that match the time range specified.
For eachmetadata, read the corresponding bloom filters and count-min sketches that match the filters, e.g. user-id, based on the below schema.

CREATE TABLE `metadata_bloom_filters` (
  `id` binary(16) NOT NULL,
  `metadata_id` binary(16) NOT NULL,
  `name` varchar(255) DEFAULT NULL, # name of field in log, e.g. "user_id"
  `bloom_filter` text,              # serialized bloom filter
  `count_min_sketch` text,          # serialized count-min sketch
);

If the bloom filter says that the user-idmight be in the S3, we add it to the list of files to scan.
We also maintain an estimated count of matching logs from the count-min sketch to limit the files we have to scan to fill up the response page size.

We can also now have 1 bloom/count-min sketch structure for each field, and when we want to filter by multiple fields we just have to check existence in their corresponding bloom filters.

Now we have a best-effort to minimize the number of S3 files to scan. However in the worst case we may get unlucky and get all false positives on the bloom filters and actually return an empty result back to the client.

But since we use Cursor-Based-Pagination for APIs at Zendesk, we provide the client caller ahas_more field in the response to indicate they should request the next page (based on the last ULID in the last S3 file scanned) in case there are potentially more hits in subsequent S3 files. We’ll eventually sethas_more=false when there are no more S3 files to scan in the time-range.

And there you have it. I’m personally impressed at the response time of S3-Select which I originally expected to be in the seconds. But most of the time it’s around 200–500ms with some spikes up into the seconds which we have yet to optimize (possibly playing with the number of logs per S3 file).

When we first migrated from Dynamo, our costs were <20% of the provisioned Dynamo costs and less than 10% of that 20% being for the combined usage of S3 storage and S3-Select, with remaining 90+% for Aurora. Huge win against the initial per-customer costs with DynamoDB.

Thanks to

Sreeramaraju Manthena

for working on this project with me! This was definitely a journey 😅.

Also thanks to

Ralf Gueldemeister

and

Ryan Seddon

for helping edit this post! 🙇