- Notifications
You must be signed in to change notification settings - Fork6.1k
Rework health event loop#19612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Rework health event loop#19612
Uh oh!
There was an error while loading.Please reload this page.
Conversation
6493d55
tofd57440
Compare8eedfbc
to163eb41
Compare4edcc4d
to352e44d
Compareconst char *database_aclk_config[] = { | ||
"CREATE TABLE IF NOT EXISTS alert_queue " | ||
" (host_id BLOB, health_log_id INT, unique_id INT, alarm_id INT, status INT, date_scheduled INT, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
If status and date_scheduled cannot be null, please, set explicitlyNOT NULL
.
Uh oh!
There was an error while loading.Please reload this page.
"red text, warn text, crit text, exec text, to_key text, info text, delay text, options text, " | ||
"repeat text, host_labels text, p_db_lookup_dimensions text, p_db_lookup_method text, p_db_lookup_options int, " | ||
"p_db_lookup_after int, p_db_lookup_before int, p_update_every int, source text, chart_labels text, " | ||
"summary text, time_group_condition INT, time_group_value DOUBLE, dims_group INT, data_source INT)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What I wrote above about the null fields is also applied for each table here. Let us force the engine to do a double check, to be sure we will not have data loss.
thiagoftsmFeb 27, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
ifevery
is ourupdate_every
or another frequency, ideally, it should be integer
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
static void timer_cb(uv_timer_t *handle) | ||
{ | ||
uv_stop(handle->loop); | ||
uv_update_time(handle->loop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I am letting this comment here for we do not forget. If next commits do not use the commented code, please, remove them.
I updated to the current master, to have better crash detection. |
c8e306b
tof715a21
Comparec1ea9d7
tob99ba95
CompareCache info/summary idsFix tableDeduplicate info/summary from alert transitionsAdd job_running flag and cleanup handling to health event loopQueue store sql statements in metadata threadRemove unnecessary health check condition from event loopAdd new job names for alert host snapshot and process eventsPrepare ACLK OP to handle alert processing per hostRemove transactions for nowFix compile errorRevert serialization testAdd opcodes for Health Pause/ResumeSerialize alert transition processing / wrap in transaction (test)Set fixed timer intervals for health event loopAccurate scheduling health run after maintenance cannot be done yetRemove unused thread-local variable for health thread and simplify SQL statement preparationRefactor job submission logic to improve worker handlingSimplify job execution handlingMaintenance task should try to keep the health run scheduleImplement worker data pool to avoid memory allocationsFix delay calculation in health_host_register functionFix wrong memory allocationsRefactor job management to use dynamic worker data allocationCleanup workersRefactor health_host_initialize to remove delay parameterRemove VACUUM command from database health cleanupIf cleanup is running returnRefactor health job schedulingPrevent scheduling job if we reached the limitConfigure init, maint and cleanup health thread count based on the configured health threadsSchedule a host health init when we one is completedFix maintenance reschedulingAdd worker jobsCode cleanupAdd health thread configRun maintenanceRebase and code cleanupTransient alert entries are queued for deletion at the end of the evaluation loop (making sure they are processed and saved)Host unregister with cleanupNew job for host rrdcalc cleanup during unregisterstream receiver will schedule a host unregister and rrdcalc cleanup nowConfigure max threads per job (needs improved config)Reschedule alert evaluation if rrdcalc cleanup is pendingCollect all alert transitions and save in a batchRemove alert transition store from metadata event loopAdd host heath maintenance job (not used yet)cleanup_health_log removed from metadata threadMigrate health tables and aclk tables to new databases. Drop old tables.Use attach_databaseStore alert transitions to the new databases (includes queues and version book keeping for the cloud)Add sql_init_databases to configure all databases during startupDo not create aclk and health tables in netdata-meta.dbNew netdata-health database for alert transitionsNew netdata-aclk database to store transitions that need to go to the cloudFix compilation with internal checksRemove static threadInit and shutdown healthRegister/Unregister hosts for healthHandle resume from suspension / health delay of child reconnectRemove static health threadNew health event loop
Uh oh!
There was an error while loading.Please reload this page.
Summary
Use worker threads to evaluate health for hosts in parallel
[health].health threads
default number of CPUs * 2health_log
,health_log_detail
andalert_hash
tables tonetdata-health.db
and deletefrom
netdata-meta.db
aclk_queue
,alert_queue
andalert_version
tables tonetdata-aclk.db
and deletefrom
netdata-meta.db
metasync
tohealth
event loophealth
event loop[TBA]
Test Plan