Movatterモバイル変換


[0]ホーム

URL:


Jump to content
Wikitech
Search

Portal:Cloud VPS/Admin/Monitoring

From Wikitech
<Portal:Cloud VPS |Admin

This page describes howmonitoring works as deployed and managed by the WMCS team, for bothCloud VPS andToolforge.

Monitoring for Cloud VPS metal infrastructure

We have our own instance in the wikiprodPrometheus setup. As of writing (Oct 2023), it's only in eqiad, but that might change. It's configured via theprofile::prometheus::cloud Puppet profile.

To query it, usehttps://thanos.wikimedia.org orhttps://prometheus-eqiad.wikimedia.org/cloud/. To craft dashboards, use the Grafana instance athttps://grafana.wikimedia.org.

metricsinfra: Monitoring services for Cloud VPS

For user documentation, seeHelp:Cloud VPS managed monitoring.

TheCloud VPS project "metricsinfra" provides the base infrastructure and services for multi-tenant instance monitoring on Cloud VPS. Technical documentation for the setup is atNova Resource:Metricsinfra/Documentation.

Metricsinfra Prometheus

Themetricsinfra Prometheus server scrapes base instance-level metrics from ALL Puppetized Cloud VPS instances.

Metricsinfra Prometheus CAN be used for:

  • Base instance-level metrics
    • example: node-exporter
  • Small project-specific services that have a low metric count and cardinality. Ask in the#wikimedia-cloudconnect IRC channel if unsure.

Metricsinfra Prometheus MUST NOT be used for:

  • Project-specific services that require complex configuration, or have a large metric count or cardinality that requires a large amount of storage or compute resources to process
    • Deploy a project-specific Prometheus instance instead, and hook it up to the Metricsinfra Alertmanager and Grafana services.
  • Metrics that contain private information

Managing scrape targets

The monitoring configuration is mostly kept in a Trove database. There is no interface for more user-friendly management yet, but for now you can ssh tometricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud and usesudo -i mariadb to edit the database by hand.

Scrape targets are defined in thescrapes table:

MariaDB[prometheusconfig]>select*fromscrapessjoinprojectspons.project_id=p.id;

Managing alert rules

The monitoring configuration is mostly kept in a Trove database. There is no interface for more user-friendly management yet, but for now you can ssh tometricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud and usesudo -i mariadb to edit the database by hand.

Project-specific rules are defined in thealerts table. Global rules that apply to all Cloud VPS projects are defined in theglobal_alerts table. You can add a new alert with a query like the following one:

MariaDB[prometheusconfig]>INSERTINTOalertsVALUES(NULL,12,'ToolsDBReplicationLagIsTooHigh','mysql_slave_status_seconds_behind_master{project="tools"} > 3600','1m','warning','{"summary": "ToolsDB replication on {{ $labels.instance }} is lagging behind the primary, the current lag is {{ $value }}"}');

The new alert should appear athttps://prometheus.wmcloud.org/alerts after a few minutes.

Note thatthese alerts can not query metrics that are not stored in the metricsinfra Prometheus instance, which includes most notably various Toolforge components. Other Prometheus instances can have however separate mechanisms for configuring alert rules.

Metricsinfra Alertmanager

The metricsinfra project has anAlertmanager instance that will send out alerts via IRC, email or VictorOps. In addition to the metricsinfra Prometheus instance, other Prometheus instances in WMCS-managed projects can use this instance to send out alerts.

Silencing alerts

By default project viewers and members can useprometheus-alerts.wmcloud.org to create and edit silences for the projects they are in. (Toolforge is an exemption for this general rule: access to creating and editing silences for the tools project is restricted to maintainers of the "admin" tool.) In addition, members of the "admin" and "metricsinfra" projects can manage silences for any project.

Alternatively to silence existing or expected (downtime) notifications you can use the `amtool` command on any metricsinfra alertmanager server (currently for example metricsinfra-alertmanager-1.metricsinfra.eqiad1.wikimedia.cloud). For example to silence all Toolsbeta alerts you could use:

metricsinfra-alertmanager-1:~$amtoolsilenceaddproject=toolsbeta-c"per T123456"-d30d3e68bf51-63f6-4406-a009-e6765acf5d8e

To change this default behavior you have to set theacl_group column in theproject table on the DB to theprometheusconfig database.

Managing notification groups

Managing ACLs

Managing access for project-specific Prometheus instances

I repeat:only WMCS-managed projects can use this method. WMCS-managed means projects that are considered part of Cloud VPS infrastructure, Toolforge orData Services (the threeCloud Services product categories), and where everyone with access is required to comply withHelp:Access policies. The reason for this is that any project with this level of metricsinfra access has the ability to send pages to the WMCS team.

Change theprofile::wmcs::metricsinfra::alertmanager::project_proxy::trusted_hosts Hiera key (managed via Horizon on the metricsinfra project) to include the per-project Prometheus servers to allow. Right now it is just host-level authentication, no secrets involved unfortunately.

Then, in the Prometheus server config, use something like this:

alerting:alertmanagers:-openstack_sd_configs:-role:instanceregion:eqiad1-ridentity_endpoint:https://openstack.eqiad1.wikimediacloud.org:25000/v3username:novaobserverpassword:$NOVAOBSERVER_PASSWORDdomain_name:defaultproject_name:metricsinfraall_tenants:falserefresh_interval:5mport:8643relabel_configs:-source_labels:-__meta_openstack_instance_nameaction:keepregex:metricsinfra-alertmanager-\d+-source_labels:-__meta_openstack_instance_nametarget_label:instance-source_labels:-__meta_openstack_instance_statusaction:keepregex:ACTIVEalert_relabel_configs:-target_label:sourcereplacement:prometheusaction:replace-target_label:projectreplacement:$YOUR_OPENSTACK_PROJECT_NAMEaction:replace

Metricsinfra Grafana

TheMetricsinfra Grafana instance is used to draw dashboards from Prometheus data. Like the metricsinfra Alertmanager instance, it can be used with per-project Prometheus servers in addition to the metricsinfra Prometheus server.

Managing data sources

Data sources are managed viamodules/profile/files/wmcs/metricsinfra/grafana/datasources.yaml in the Puppet repository.

Monitoring for Toolforge

In addition to the Metricsinfra setup, Toolforge has its own Prometheus server for Kubernetes metrics. It's queriable viahttps://prometheus.svc.toolforge.org/tools/, and uses the metricsinfra grafana and alertmanager instances. Alerts are configured viahttps://gitlab.wikimedia.org/repos/cloud/toolforge/alerts. The toolsbeta equivalent is queriable viahttps://prometheus.svc.beta.toolforge.org/tools/.

Dashboards and handy links

If you want to get an overview of what's going on the Cloud VPS infra, open these links:

DatacenterWhatMechanismCommentsLink
eqiadNFS serversicingalabstore1xxx servers[1]
eqiadNFS Server Statisticsgrafanalabstore and cloudstore NFS operations, connections and various details[2]
eqiadCloud VPS main servicesicingaservice servers, non virts[3]
codfwCloud VPS labtest serversicingaall physical servers[4]
eqiadToolforge basic alertsgrafanasome interesting metrics from Toolforge[5]
eqiadToolsDB (Toolforge R/W MariaDB)grafanaDatabase metrics for ToolsDB servers[6]
eqiadToolforge grid statuscustom tooljobs running on Toolforge's grid[7]
anycloud serversicingaall physical servers with the cloudXXXX naming scheme[8]
eqiadCloud VPS eqiad1 capacitygrafanacapacity planning[9]
eqiadlabstore1004/labstore1005grafanaload & general metrics[10]
eqiadCloud VPS eqiad1grafanaload & general metrics[11]
eqiadCloud VPS eqiad1grafanainternal openstack metrics[12]
eqiadCloud VPS eqiad1grafanahypervisor metrics from openstack[13]
eqiadCloud VPS memcachegrafanacloudservices servers[14]
eqiadopenstack database backend (per host)grafanamariadb/galera on cloudcontrols[15]
eqiadopenstack database backend (aggregated)grafanamariadb/galera on cloudcontrols[16]
eqiadToolforgegrafanaArturo's metrics[17]
eqiadCloud HW eqiadicingaIcinga group for WMCS in eqiad[18]
eqiadToolforge, new kubernetes clusterprometheus/grafanaGeneric dashboard for the new Kubernetes cluster[19]
eqiadToolforge, new kubernetes cluster, namespacesprometheus/grafanaPer-namspace dashboard for the new Kubernetes cluster[20]
eqiadToolforge, new kubernetes cluster, ingressprometheus/grafanadashboard about the ingress for the new kubernetes cluster[21]
eqiadToolforgeprometheus/grafanadashboard showing a table with basic information about all VMs in the tools project[22]
eqiadToolforge email serverprometheus/grafanadashboard showing data about Toolforge exim email server[23]
DatacenterWhatMechanismCommentsLink

See also

Retrieved from "https://wikitech.wikimedia.org/w/index.php?title=Portal:Cloud_VPS/Admin/Monitoring&oldid=2361737"
Category:

[8]ページ先頭

©2009-2025 Movatter.jp