- Notifications
You must be signed in to change notification settings - Fork928
Description
Problem
Currently we store raw agent stats for 6 months and app stats indefinitely. On dev.coder.com, the agent stats table takes up 6.3 GB of storage and it's a fairly small deployment. We've managed to optimize our queries fairly well, but the raw data is still large and slow to query. App stats are not as bad as they're only written on a as-needed basis, compressing time-ranges into a single row, as possible.
To make insights more performant, and allow us to store them for extended periods of time (years), we need to store them in a format that can be query efficiently and doesn't use excessive amounts of storage.
Limitations
Template parameter usage is also a part of insights, but is not covered in this proposal.
Requirements
- Generate reports on usage at a certain granularity (e.g. daily, weekly, monthly)
- Store usage data in a database at a granularity that allows for timezone adjustments (e.g. 15m, 30m, 1h)
- 15m covers most timezones, but has 4x the storage requirements of 1h
- Store app usage by template and user (user is also needed for unique counts across templates)
- Compute total usage by user
- Store latency for user by template
Use cases
- Deployment DAUs
- Template insights
Design
We will roll up the data into a format that is queryable and more efficient to store. The rollup will happen on an hourly or daily schedule. If we require live-data, we can utilize the same rollup queries to compute this from the raw data. Once the data is rolled up, we are free to delete data from eitherworkspace_agent_stats
orworkspace_app_stats
, however, there is no evidence that we need to delete fromworkspace_app_stats
at this time.
We will store the data in a table calledtemplate_insights
. This table will have the following columns:
id (bigserial)- start_time (timestamptz)
- end_time (timestamptz)
- template_id (uuid)
- user_id (uuid)
- latency_ms (smallint)
- usage_mins (smallint)
- ssh_mins (smallint)
- vscode_mins (smallint)
- jetbrains_mins (smallint)
- webterminal_mins (smallint)
- app_usage_mins (jsonb)
- object {name1: int, name2: int, ...}
Note that we're storing the minutes so that we can properly account for usage that is smaller than the configured rollup range (be it 15m or 1h). Theusage_mins
field stores the accurate total usage within the timeframe (let's say you use ssh for 10 mins, then disconnect and start using code-server for 5 mins,usage_mins
will reflect this and show 15 mins).
With this format, if we assume that a user uses 1 template at a time, and that we have 1000 users who each use coder 8 hours/work day for a year, we would have 1000 * 8 * 365 = 2,920,000 rows. If we store this at 30m granularity, we would have 2,920,000 * 2 = 5,840,000 rows. This is a manageable amount of data to store and query.
Data type storage sizes are:
bigserial: 8 bytes- timestamptz: 8 bytes
- uuid: 16 bytes
- smallint: 2 bytes
- jsonb: ~35 bytes/app (depending on length of name)
select pg_column_size('{"code-server": 60}'::jsonb);=> 36
There are ways to optimizeapp_usage
storage requirements further, but this is a good starting point. Furthermore, if we don't need to query the JSON, or don't mind the compute overhead of live conversion, we can simply store it asjson
, greatly reducing the storage requirement.
select pg_column_size('{"code-server":60}'::json);=> 22
With this design, storage size will approximate: 8 + 8 + 16 + 2 + 2 + 2 + 2 + 2 + 2 + 36 = 80 bytes/row
If we store 5,840,000 rows, we would need 5,840,000 * 80 = ~0.5 GB of storage. This is seems acceptable for one year of data for 1000 users. If a customer feels this is too much, we can make the retention configurable.
Note: These calculations don't take into account the the storage space used by whatever indexes we end up creating.