Originally we implemented a feature to persist an event-stream into DynamoDB to allow customers to retrieve them. This proved effective, serving as a strong use case for a key/value storage, yet the drawback was its high cost. Moving to provisioned billing-mode reduced cost by ~50%, but that was not going to be sustainable as we scaled to more customers. We also kept multiplying the cost each time we wanted to allow querying on another field in the logs via Global Secondary Indices.
The architecture is in the diagram below… fairly straightforward :)
We went looking for alternatives, and since here at Zendesk we’re mainly on the AWS stack (with some exceptions) we looked at:
Eventually we decided on a hybrid approach to get the best of both worlds. This was to use MySQL to write the logs from Kafka, and periodically upload the data to S3 to allow removal of that data from MySQL to prevent overwhelming the DB. The advantages of this solution were:
For a bit more of an in-depth description, here’s the workflow:
metadata
table to see what the last row we uploaded to S3 was.buffer
table and uploads them to a S3 file.metadata
table for each S3 file uploaded to store the S3 path, the number of logs in the file, its last log ID and also the last timestamp stored in that file.CREATE TABLE `metadata` (
`id` binary(16) NOT NULL,
`end_time_ms` bigint(20) DEFAULT NULL, # last timestamp this file contains
`file_path` varchar(255) DEFAULT NULL, # path of S3 file
`log_count` int(11) DEFAULT NULL, # count of the logs this S3 file contains
`last_log_id` binary(16) NOT NULL, # ULID of last log we processed
);
buffer
table have been uploaded (except the last 5mins of data to prevent Kafka late deliveries causing ordering issues)buffer
table.So the data pipeline is all setup, but does it actually give us the data back in a timely fashion when the client makes an API request for it?
The above graph shows the relative performance of Aurora (yellow) vs S3-Select (purple) when querying data. S3-Select can be surprisingly fast (keep in mind we batched by only 10,000 items per file), and extremely cheap!
To give a clearer picture of the query path:
metadata
entries in the MySQL DB.log_count
column in themetadata
table row.timestamp
is just a field in the stored JSON in the file:SELECT * FROM S3Object s WHERE s."timestamp" >= '%s' AND s."timestamp" < '%s' LIMIT 100
buffer
table.Once we had the above all working we encountered a performance problem when the client wanted to filter results by any field besides the timestamp. For example if the client wanted logs for a specific user-id, then we’d at worst have to scan ALL the S3 data within the time-range to find any relevant logs, which made it difficult to figure out what queries to do in parallel.
So given that we can’t hold up an API response for hours, we attempted to reduce the number of S3 files that we need to scan.
Our first iteration was to just duplicate the data in S3 given that it was so cheap. So we constrained which fields the client could filter by and then duplicated data and metadata, e.g. for the user-id field:
buffer
table batchmetadata
record for each of these files with afile_type
column indicating whether it was for a specific field oralllogs.This quickly became cumbersome as we would have to create an S3 file for not only each filter-able field, but also for any combinations of filtering, e.g.:
EnterBloom Filters. Bloom filters have the nice property that we can ask the bloom-filter attached to ametadata
record the question “is this <user-id> definitelynot in this S3 file?”. And if we get “yes”, we know for sure that we can skip scanning this file. If we get a “no”, then due to the nature of bloom-filters it may or may not be in the file so we’ll have to scan it.
So this was great, we had a way to filter down the S3 files to scan saving money and time. But we still had the problem of how many files to scan. If the client doesn’t specify a filter, then we had the simple task of just scanning enough S3 files by totaling thelog_count
column. But if 100 S3 files had some logs relating to the specified user-id, then we had to scan them all just in case each file only had one relevant log each to fulfill a request for 100 results.
EnterCount-Min Sketch. This data structure behaved much like the above bloom filters, but instead of asking a binary question of whether an item existed in it or not, it instead gives an estimated count of how many times that item occurred.
So with these 2 data structures combined, we can now support queries foruser-id
based on the following workflow:
metadata
rows that match the time range specified.metadata
, read the corresponding bloom filters and count-min sketches that match the filters, e.g. user-id, based on the below schema.CREATE TABLE `metadata_bloom_filters` (
`id` binary(16) NOT NULL,
`metadata_id` binary(16) NOT NULL,
`name` varchar(255) DEFAULT NULL, # name of field in log, e.g. "user_id"
`bloom_filter` text, # serialized bloom filter
`count_min_sketch` text, # serialized count-min sketch
);
We can also now have 1 bloom/count-min sketch structure for each field, and when we want to filter by multiple fields we just have to check existence in their corresponding bloom filters.
Now we have a best-effort to minimize the number of S3 files to scan. However in the worst case we may get unlucky and get all false positives on the bloom filters and actually return an empty result back to the client.
But since we use Cursor-Based-Pagination for APIs at Zendesk, we provide the client caller ahas_more
field in the response to indicate they should request the next page (based on the last ULID in the last S3 file scanned) in case there are potentially more hits in subsequent S3 files. We’ll eventually sethas_more=false
when there are no more S3 files to scan in the time-range.
And there you have it. I’m personally impressed at the response time of S3-Select which I originally expected to be in the seconds. But most of the time it’s around 200–500ms with some spikes up into the seconds which we have yet to optimize (possibly playing with the number of logs per S3 file).
When we first migrated from Dynamo, our costs were <20% of the provisioned Dynamo costs and less than 10% of that 20% being for the combined usage of S3 storage and S3-Select, with remaining 90+% for Aurora. Huge win against the initial per-customer costs with DynamoDB.
Thanks to for working on this project with me! This was definitely a journey 😅.
Also thanks to and for helping edit this post! 🙇