Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

OpenTelemetry (OTEL) collector monitoring

License

NotificationsYou must be signed in to change notification settings

monitoringartist/opentelemetry-collector-monitoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Metric name/label name conventions

OpenTelemetry collector metrics and label names can vary significantly. These variations depend on several factors:

  • OpenTelemetry Collector version
  • Specific receivers/exporters in use
  • Receiver/exporter configuration settings
  • Metric storage backend configuration

For example uptime metric can be saved as:

otelcol_process_uptime{service_instance_id="123456", ...}otelcol_process_uptime_total{service_instance_id="123456", ...}otelcol_process_uptime_seconds_total{service_instance_id="123456", ...}otelcol_process_uptime_seconds_total{service.instance.id="123456", ...}otelcol_process_uptime_seconds_total_total{service.instance.id="123456", ...}

This dashboard attempts to detect all these variations through hidden dashboard variablesto ensure compatibility across different configurations.

Metrics

Collector can expose Prometheus metrics locally on port 8888 and path/metrics. For containerized environments it may be desirable to expose thisport on a public interface instead of just locally.

receivers:  prometheus:    config:      scrape_configs:      - job_name: otel-collector-metrics        scrape_interval: 10s        static_configs:        - targets: ['127.0.0.1:8888']service:  pipelines:    metrics:      receivers: [prometheus]      processors: []      exporters: [...]  telemetry:    resource:      service.name: grafana-opentelemetry/metrics    metrics:      level: detailed      readers:        - pull:            exporter:              prometheus:                host: 127.0.0.1                port: 8888                with_resource_constant_labels:                  included:                      - service.name

Collector can scrape own metric via own metric pipeline, so real configurationcan looks like:

extensions:  sigv4auth/aws:receivers:  prometheus:    config:      scrape_configs:      - job_name: otel-collector-metrics        scrape_interval: 10s        static_configs:          - targets: ['127.0.0.1:8888']exporters:  prometheusremotewrite/aws:    endpoint: ${PROMETHEUS_ENDPOINT}    auth:      authenticator: sigv4auth/aws    retry_on_failure:      enabled: true      initial_interval: 1s      max_interval: 10s      max_elapsed_time: 30sservice:  pipelines:    metrics:      receivers: [prometheus]      processors: []      exporters: [prometheusremotewrite/aws]  telemetry:    resource:      service.name: grafana-opentelemetry/metrics    metrics:      level: detailed      readers:        - pull:            exporter:              prometheus:                host: 127.0.0.1                port: 8888                with_resource_constant_labels:                  included:                      - service.name

Grafana dashboard for OpenTelemetry collector metrics

OpenTelemetry collector dashboard

This dashboard can also be used forGrafana Alloy monitoring.

Prometheus alerts

Recommended Prometheus alerts for OpenTelemetry collector metrics:

# keep in mind that metrics may have "_total|seconds_total|seconds_total_total" suffixes - check your metrics/configuration firstgroups:  - name: opentelemetry-collector    rules:      - alert: processor-dropped-spans        expr: sum(rate(otelcol_processor_dropped_spans{}[1m])) > 0        for: 5m        labels:          severity: critical        annotations:          summary: Some spans have been dropped by processor          description: Maybe collector has received non standard spans or it reached some limits      - alert: processor-dropped-metrics        expr: sum(rate(otelcol_processor_dropped_metric_points{}[1m])) > 0        for: 5m        labels:          severity: critical        annotations:          summary: Some metric points have been dropped by processor          description: Maybe collector has received non standard metric points or it reached some limits      - alert: processor-dropped-logs        expr: sum(rate(otelcol_processor_dropped_log_records{}[1m])) > 0        for: 5m        labels:          severity: critical        annotations:          summary: Some log records have been dropped by processor          description: Maybe collector has received non standard log records or it reached some limits      - alert: receiver-refused-spans        expr: sum(rate(otelcol_receiver_refused_spans{}[1m])) > 0        for: 5m        labels:          severity: critical        annotations:          summary: Some spans have been refused by receiver          description: Maybe collector has received non standard spans or it reached some limits      - alert: receiver-refused-metrics        expr: sum(rate(otelcol_receiver_refused_metric_points{}[1m])) > 0        for: 5m        labels:          severity: critical        annotations:          summary: Some metric points have been refused by receiver          description: Maybe collector has received non standard metric points or it reached some limits      - alert: receiver-refused-logs        expr: sum(rate(otelcol_receiver_refused_log_records{}[1m])) > 0        for: 5m        labels:          severity: critical        annotations:          summary: Some log records have been refused by receiver          description: Maybe collector has received non standard log records or it reached some limits      - alert: exporter-enqueued-spans        expr: sum(rate(otelcol_exporter_enqueue_failed_spans{}[1m])) > 0        for: 5m        labels:          severity: critical        annotations:          summary: Some spans have been enqueued by exporter          description: Maybe used destination has a problem or used payload is not correct      - alert: exporter-enqueued-metrics        expr: sum(rate(otelcol_exporter_enqueue_failed_metric_points{}[1m])) > 0        for: 5m        labels:          severity: critical        annotations:          summary: Some metric points have been enqueued by exporter          description: Maybe used destination has a problem or used payload is not correct      - alert: exporter-enqueued-logs        expr: sum(rate(otelcol_exporter_enqueue_failed_log_records{}[1m])) > 0        for: 5m        labels:          severity: critical        annotations:          summary: Some log records have been enqueued by exporter          description: Maybe used destination has a problem or used payload is not correct      - alert: exporter-failed-requests        expr: sum(rate(otelcol_exporter_send_failed_requests{}[1m])) > 0        for: 5m        labels:          severity: critical        annotations:          summary: Some exporter requests failed          description: Maybe used destination has a problem or used payload is not correct      - alert: high-cpu-usage        expr: max(rate(otelcol_process_cpu_seconds{}[1m])*100) > 90        for: 5m        labels:          severity: critical        annotations:          summary: High max CPU usage          description: Collector needs to scale up

Example of alert rules provided as a opentelemetry collector part of helm chart:https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/templates/prometheusrule.yaml

Documentation


[8]ページ先頭

©2009-2025 Movatter.jp