Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Rework health event loop#19612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
stelfrag wants to merge4 commits intonetdata:master
base:master
Choose a base branch
Loading
fromstelfrag:health_evloop

Conversation

stelfrag
Copy link
Collaborator

@stelfragstelfrag commentedFeb 10, 2025
edited
Loading

Summary

Use worker threads to evaluate health for hosts in parallel

  • Number of threads to use for host health evaluation[health].health threads default number of CPUs * 2
  • Number of threads to use for host health initialization (default 25% of health threads)
  • Number of threads to use for host health maintenance (cleanup host alert transitions) (default 25% of health threads)
  • Number of threads to use for host health cleanup on child disconnection (default 25% of health threads)
  • Migratehealth_log ,health_log_detail andalert_hash tables tonetdata-health.db and delete
    fromnetdata-meta.db
  • Migrateaclk_queue,alert_queue andalert_version tables tonetdata-aclk.db and delete
    fromnetdata-meta.db
  • Switch alert transition store frommetasync tohealth event loop
  • Alert config cleanup during child disconnection offloaded tohealth event loop
  • Schedule periodic host alert transition cleanup job
  • Add workers for extended statistics
  • Deduplicate info/summary fields in transitions

[TBA]

Test Plan
  • [TBA]

@stelfragstelfragforce-pushed thehealth_evloop branch 2 times, most recently from6493d55 tofd57440CompareFebruary 18, 2025 14:08
@stelfragstelfragforce-pushed thehealth_evloop branch 6 times, most recently from8eedfbc to163eb41CompareFebruary 26, 2025 08:45
@github-actionsgithub-actionsbot added the area/buildBuild system (autotools and cmake). labelFeb 26, 2025
@stelfragstelfragforce-pushed thehealth_evloop branch 6 times, most recently from4edcc4d to352e44dCompareFebruary 27, 2025 12:09
const char *database_aclk_config[] = {

"CREATE TABLE IF NOT EXISTS alert_queue "
" (host_id BLOB, health_log_id INT, unique_id INT, alarm_id INT, status INT, date_scheduled INT, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If status and date_scheduled cannot be null, please, set explicitlyNOT NULL.

"red text, warn text, crit text, exec text, to_key text, info text, delay text, options text, "
"repeat text, host_labels text, p_db_lookup_dimensions text, p_db_lookup_method text, p_db_lookup_options int, "
"p_db_lookup_after int, p_db_lookup_before int, p_update_every int, source text, chart_labels text, "
"summary text, time_group_condition INT, time_group_value DOUBLE, dims_group INT, data_source INT)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What I wrote above about the null fields is also applied for each table here. Let us force the engine to do a double check, to be sure we will not have data loss.

Copy link
Contributor

@thiagoftsmthiagoftsmFeb 27, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

ifevery is ourupdate_every or another frequency, ideally, it should be integer

static void timer_cb(uv_timer_t *handle)
{
uv_stop(handle->loop);
uv_update_time(handle->loop);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I am letting this comment here for we do not forget. If next commits do not use the commented code, please, remove them.

@ktsaou
Copy link
Member

I updated to the current master, to have better crash detection.

@stelfragstelfragforce-pushed thehealth_evloop branch 4 times, most recently fromc8e306b tof715a21CompareMarch 4, 2025 14:17
@thiagoftsmthiagoftsm self-requested a reviewMarch 6, 2025 02:31
@stelfragstelfragforce-pushed thehealth_evloop branch 4 times, most recently fromc1ea9d7 tob99ba95CompareMarch 12, 2025 15:30
Cache info/summary idsFix tableDeduplicate info/summary from alert transitionsAdd job_running flag and cleanup handling to health event loopQueue store sql statements in metadata threadRemove unnecessary health check condition from event loopAdd new job names for alert host snapshot and process eventsPrepare ACLK OP to handle alert processing per hostRemove transactions for nowFix compile errorRevert serialization testAdd opcodes for Health Pause/ResumeSerialize alert transition processing / wrap in transaction (test)Set fixed timer intervals for health event loopAccurate scheduling health run after maintenance cannot be done yetRemove unused thread-local variable for health thread and simplify SQL statement preparationRefactor job submission logic to improve worker handlingSimplify job execution handlingMaintenance task should try to keep the health run scheduleImplement worker data pool to avoid memory allocationsFix delay calculation in health_host_register functionFix wrong memory allocationsRefactor job management to use dynamic worker data allocationCleanup workersRefactor health_host_initialize to remove delay parameterRemove VACUUM command from database health cleanupIf cleanup is running returnRefactor health job schedulingPrevent scheduling job if we reached the limitConfigure init, maint and cleanup health thread count based on the configured health threadsSchedule a host health init when we one is completedFix maintenance reschedulingAdd worker jobsCode cleanupAdd health thread configRun maintenanceRebase and code cleanupTransient alert entries are queued for deletion at the end of the evaluation loop (making sure they are processed and saved)Host unregister with cleanupNew job for host rrdcalc cleanup during unregisterstream receiver will schedule a host unregister and rrdcalc cleanup nowConfigure max threads per job (needs improved config)Reschedule alert evaluation if rrdcalc cleanup is pendingCollect all alert transitions and save in a batchRemove alert transition store from metadata event loopAdd host heath maintenance job (not used yet)cleanup_health_log removed from metadata threadMigrate health tables and aclk tables to new databases. Drop old tables.Use attach_databaseStore alert transitions to the new databases (includes queues and version book keeping for the cloud)Add sql_init_databases to configure all databases during startupDo not create aclk and health tables in netdata-meta.dbNew netdata-health database for alert transitionsNew netdata-aclk database to store transitions that need to go to the cloudFix compilation with internal checksRemove static threadInit and shutdown healthRegister/Unregister hosts for healthHandle resume from suspension / health delay of child reconnectRemove static health threadNew health event loop
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@thiagoftsmthiagoftsmAwaiting requested review from thiagoftsmthiagoftsm is a code owner

@vkalintirisvkalintirisAwaiting requested review from vkalintirisvkalintiris will be requested when the pull request is marked ready for reviewvkalintiris is a code owner

At least 1 approving review is required to merge this pull request.

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

3 participants
@stelfrag@ktsaou@thiagoftsm

[8]ページ先頭

©2009-2025 Movatter.jp