In this article, we will see how we increased our reporting performance by moving most of our business logic into our data warehouse and removing all the post-processing of our data.
At Teads, we provide marketers with an advertising platform to deliver their campaigns. To provide customers with insights about how their campaigns are performing, we need to join several types of data:
Our previous reporting engine was split into two parts:
If you want more information on our ingestion pipeline you can read our initialarticle about it.
To generate the reports, the report engine needs to load compressed events data fromBigQuery thanks to theBigQuery Storage API. Once fact events are in memory, the reporting engine starts to list dimension ids from the events.
As the dimensions data is in another database we cannot join on it directly, due to this limitation the reporting engine builds a cache with the needed data selected from the list of ids and then join on it.
After the join, it starts dropping the raw data to keep only the requested data and union them to have as few rows as possible.
It will then export it in the requested format toS3.
Steps of the data for a report on advertiser_name, creative_type, impression
Compressed events data
|ad_id|creative_id|impression|
|1 |2 |4 |
|2 |3 |10 |
|3 |3 |16 |
|4 |2 |30 |
Compressed events data with joined dimensions
|ad_id|advertiser_name|creative_id|creative_type|impression|
|1 |advertiser_b |1 |video |4 |
|2 |advertiser_a |2 |image |10 |
|3 |advertiser_a |3 |image |16 |
|4 |advertiser_b |4 |video |30 |
Union data ready for the report output
|advertiser_name|creative_type|impression|
|advertiser_b |video |34 |
|advertiser_a |image |26 |This solution helped us bootstrap reports for everyone inside and outside the company, and provide reports that link our events and dimensions data.
However, we ended up with three major limitations:
For this new service, we wanted to simplify reporting as much as possible by reducing the number of steps between the request from the users and the result.
As explained in the past section, most of our report columns were previously computed directly in the reporting engine.
In order to remove a part of our network impact we scheduled a clone of our dimensions data in BigQuery to avoid moving dimensions data on run time.
This projection introduces a new storage cost for our data warehouse, but it is absorbed by the removal of the infrastructure need to join this data.
A second part of the simplification take place in the Ingestion Pipeline where we also provide some of the dimensions data to avoid complexe join during the query done by the Reporting Engine. To avoid relocating the Network Impact on the ingestion pipeline we choose to only hydrate events with low cardinality dimension such as enum.
For example, for thecreative_type dimension we can now group directly in BigQuery without joining, instead of pulling all creatives and checking their type from the dimensions data.
Events before hydration
|action |creative_id|ad_id|....
|impression|1 |1 |....
|click |1 |1 |....
|impression|2 |2 |....
|click |2 |2 |....
|impression|3 |3 |....
|progress-1|3 |3 |....
|progress-2|3 |3 |....
|impression|4 |4 |....
|progress-1|4 |4 |....
|progress-2|4 |4 |....
|progress-3|4 |4 |....
Events after hydration
|action |creative_id|creative_type|ad_id|....
|impression|1 |image |1 |....
|click |1 |image |1 |....
|impression|2 |image |2 |....
|click |2 |image |2 |....
|impression|3 |video |3 |....
|progress-1|3 |video |3 |....
|progress-2|3 |video |3 |....
|impression|4 |video |4 |....
|progress-1|4 |video |4 |....
|progress-2|4 |video |4 |....
|progress-3|4 |video |4 |....To tackle our network impact limitation we had to limit the data we have to move from the data warehouse to our reporting engine instance.
To do that, we moved the computation from our reporting engine to our data warehouse by converting our report requests into SQL queries.
The query generator is composed of two parts:
One of the biggest limitations we had with Spark was the need to load all the data in memory. This led to scaling issue: more data meant we needed more memory in Spark instances.
Now that the join and reduce part are centralized in the query provided to BigQuery, the data loaded by the reporting engine is smaller than before.
To be sure that the size of the report will never be a limitation, we decided to stream every data that come from BigQuery directly in S3.
Thanks to theBigQuery Storage Read API we are able to output the result of our query in a temporary table on BigQuery and easily read the result in a stream.
Between the read and write operations, a serialization is applied from theAvro format provided by BigQuery and the output selected by the user.
By focusing on simplicity and removing complexity we were able to improve our ability to add new dimensions/metrics and improve the engine performance.
On average, the report generation time is reduced by afactor of 17 with the new reporting engine. This improvement is mostly due to the reduction of data transfer by applying better filtering and computing data in only one place.
By reducing the amount of data queried during a report we are able to absorb the increase in query complexity while not increasing BigQuery costs. But we were able to reduce the cost of the reporting engine by afactor of 10 by removing the EMR cluster and only focusing on a small service that forwards query and stream results.
Compared to our legacy reporting engine, gains in cost and speed are significant but there is still room for improvement.
To continue on the path of simplicity we want to abstract the compartmentation of the data directly in BigQuery, to make sure the reporting engine only has access to the relevant data it needs in its execution context.
See this follow up article:
Our events are currently stored in generic aggregated tables. To further reduce the amount of queried data we could create smaller data sources per domain (data marts).
The innovators building the future of digital advertising