Troubleshooting Managed Service for Prometheus

This document describes some problems you might encounter when usingGoogle Cloud Managed Service for Prometheus and provides information on diagnosing andresolving the problems.

You configured Managed Service for Prometheus but are not seeingany metric data in Grafana or the Prometheus UI. At a high level, the causemight be either of the following:

A problem on the query side, so that data can't be read. Query-sideproblems are often caused by incorrect permissions on theservice account reading the data or by misconfiguration of Grafana.
A problem on the ingestion side, so that no data is sent.Ingestion-side problems can be caused by configuration problemswith service accounts, collectors, or rule evaluation.

To determine whether the problem is on the ingestion side or the queryside, try querying data by using the Metrics ExplorerPromQL tabin the Google Cloud console. This page is guaranteed not to have any issues withread permissions or Grafana settings.

To view this page, do the following:

Use the Google Cloud console project picker to select the projectfor which you are not seeing data.
In the Google Cloud console, go to the Metrics explorer page:
Go toMetrics explorer
If you use the search bar to find this page, then select the result whose subheading isMonitoring.
In the toolbar of thequery-builder pane, select the button whose name is either MQL or PromQL.
Verify thatPromQL is selectedin theLanguage toggle. The language toggle is in the same toolbar thatlets you format your query.
Enter the following query into the editor,and then clickRun query:
```
up
```

If you query theup metric and see results, then the problem ison the query side. For information on resolving these problems, seeQuery-side problems.

If you query theup metric and do not see any results, then the problem ison the ingestion side. For information on resolving these problems, seeIngestion-side problems.

A firewall can also cause ingestion and query problems; for more information,seeFirewalls.

The Cloud MonitoringMetrics Management page provides informationthat can help you control the amount you spend on billable metricswithout affecting observability. TheMetrics Management page reports thefollowing information:

Ingestion volumes for both byte- and sample-based billing, across metric domains and for individual metrics.
Data about labels and cardinality of metrics.
Number of reads for each metric.
Use of metrics in alerting policies and custom dashboards.
Rate of metric-write errors.

You can also use theMetrics Management page toexclude unneeded metrics,eliminating the cost of ingesting them.

To view theMetrics Management page, do the following:

In the Google Cloud console, go to the Metrics management page:
Go toMetrics management
If you use the search bar to find this page, then select the result whose subheading isMonitoring.
In the toolbar, select your time window. By default, theMetrics Management page displays information about the metrics collected in the previous one day.

For more information about theMetrics Management page, seeView and manage metric usage.

Query-side problems

The cause of most query-side problems is one of the following:

Incorrect permissions or credentials for service accounts.
Misconfiguration of Workload Identity Federation for GKE, if your cluster has thisfeature enabled. For more information, seeConfigure a serviceaccount for Workload Identity Federation for GKE.

Start by doing the following:

Check your configuration carefully against thesetup instructions forquerying.
If you are using Workload Identity Federation for GKE, verify that your serviceaccount has the correct permissions by doing the following;
1. In the Google Cloud console, go to theIAM page:
  Go toIAM
  If you use the search bar to find this page, then select the result whose subheading isIAM & Admin.
2. Identify the service account name in the list of principals. Verify thatthe name of the service account is correctly spelled. ThenclickEdit.
3. Select theRole field, then clickCurrently used andsearch for the Monitoring Viewer role. If the service account doesn'thave this role, add it now.

If the problem still persists, then consider the following possibilities:

Misconfigured or mistyped secrets

If you see any of the following, then you might have a missing or mistypedsecret:

One of these "forbidden" errors in Grafana or the Prometheus UI:
- "Warning: Unexpected response status when fetching server time: Forbidden"
- "Warning: Error fetching metrics list: Unexpected response status whenfetching metric names: Forbidden"
A message like this in your logs:
"cannot read credentials file: open /gmp/key.json: no such file or directory"

If you are using thedata source syncer to authenticateand configure Grafana, try the following to resolve these errors:

Verify that you have chosen the correct Grafana API endpoint, Grafanadata source UID, and Grafana API token. You can inspect the variables in theCronJob by running the commandkubectl describe cronjob datasource-syncer.
Verify that you have set the data source syncer's project ID to the samemetrics scope or project that your service account hascredentials for.
Verify that your Grafana service account has the "Admin" role and that yourAPI token has not expired.
Verify that your service account has the Monitoring Viewer role for thechosen project ID.
Verify that there are no errors in the logs for the data sourcesyncer Job by runningkubectl logs job.batch/datasource-syncer-init. Thiscommand has to be run immediately after applying thedatasource-syncer.yamlfile.
If using Workload Identity Federation for GKE, verify that you have not mistyped theaccount key or credentials, and verify that you have bound it to thecorrect namespace.

If you are using thelegacy frontend UI proxy, try the followingto resolve these errors:

Verify that you have set the frontend UI's project ID to the samemetricsscope or project that your service account hascredentials for.
Verify the project ID you've specified for any--query.project-idflags.
Verify that your service account has the Monitoring Viewer role for thechosen project ID.
Verify you have set the correct project ID whendeploying the frontendUI and did not leave it set to the literal stringPROJECT_ID.
If using Workload Identity, verify that you have not mistyped theaccount key or credentials, and verify that you have bound it to thecorrect namespace.
If mounting your own secret, make sure the secret is present:
```
kubectl get secretgmp-test-sa -o json | jq '.data | keys'
```

Verify that the secret is correctly mounted:

kubectl get deploy frontend -o json | jq .spec.template.spec.volumeskubectl get deploy frontend -o json | jq .spec.template.spec.containers[].volumeMounts

Make sure the secret is passed correctly to the container:

kubectl get deploy frontend -o json | jq .spec.template.spec.containers[].args

Incorrect HTTP method for Grafana

If you see the following API error from Grafana, then Grafana is configuredto send aPOST request instead of aGET request:

"{"status":"error","errorType":"bad_data","error":"no match[] parameter provided"}%"

To resolve this issue, configure Grafana to use aGET request by followingthe instructions inConfigure a data source.

Timeouts on large or long-running queries

If you see the following error in Grafana, then your default query timeout istoo low:

"Post "http://frontend.NAMESPACE_NAME.svc:9090/api/v1/query_range": net/http:timeout awaiting response headers"

Managed Service for Prometheus does not time out until a query exceeds120 seconds, while Grafana times out after 30 seconds by default. To fix this,raise the timeouts in Grafana to 120 seconds by following the instructionsinConfigure a data source.

Label-validation errors

If you see one of the following errors in Grafana, then you might be using anunsupported endpoint:

"Validation: labels other thanname are not supported yet"
"Templating [job]: Error updating options: labels other thanname arenot supported yet."

Managed Service for Prometheus supports the/api/v1/$label/values endpointonly for the__name__ label. This limitation causes queries using thelabel_values($label) variable in Grafana to fail.

Instead, use thelabel_values($metric, $label) form. This query isrecommended because it constrains the returned label values by metric, whichprevents retrieval of values not related to the dashboard's contents.This query calls a supported endpoint for Prometheus.

For more information about supported endpoints, seeAPIcompatibility.

Quota exceeded

If you see the following error, then you have exceeded your read quota forthe Cloud Monitoring API:

"429: RESOURCE_EXHAUSTED: Quota exceeded for quota metric 'Time series queries ' and limit 'Time series queries per minute' of service'monitoring.googleapis.com' for consumer 'project_number:...'."

To resolve this issue, submit a request to increase your read quotafor the Monitoring API. For assistance, contactGoogle Cloud Support. For more information aboutquotas, see theCloud Quotas documentation.

Metrics from multiple projects

If you want to view metrics from multiple Google Cloud projects,you don't have to configure multiple data source syncersor create multiple data sources in Grafana.

Instead, create a Cloud Monitoring metrics scope in oneGoogle Cloud project — the scoping project — that containsthe projects you want to monitor. When you configure the Grafana data sourcewith a scoping project, you get accessto the data from all projects in the metrics scope. For more information,seeQueries and metrics scopes.

No monitored resource type specified

If you see the following error, then you need to specify amonitored resourcetype when using PromQL to query aGoogle Cloud systemmetric:

"metric is configured to be used with more than one monitored resource type;series selector must specify a label matcher on monitored resource name"

You can specify a monitored resource type by filtering using themonitored_resource label. For more information about identifying and choosinga valid monitored resource type, seeSpecifying a monitored resourcetype.

Counter, histogram, and summary raw values not matching between the collector UI and the Google Cloud console

You might notice a difference between the values in the local collectorPrometheus UI and the Google Cloud Google Cloud console when querying the rawvalue of cumulative Prometheus metrics, including counters, histograms, andsummaries. This behavior is expected.

Monarch requires start timestamps, but Prometheus doesn't have starttimestamps. Managed Service for Prometheus generates start timestamps byskipping the first ingested point in any time series and converting it into astart timestamp. Subsequent points have the value of the initial skippedpoint subtracted from their value to ensure rates are correct. This causes apersistent deficit in the raw value of those points.

The difference between the number in the collector UI and the number in theGoogle Cloud console is equal to the first value recorded in the collector UI,which is expected because the system skips that initial value, and subtracts itfrom subsequent points.

This is acceptable because there's no production need for running a query forraw values for cumulative metrics; all useful queries require arate() functionor the like, in which case the difference over any time horizon is identicalbetween the two UIs. Cumulative metrics only ever increase, so you can't set analert on a raw query as a time series only ever hits a threshold one time. Alluseful alerts and charts look at the change or the rate of change in the value.

The collector only holds about 10 minutes of data locally. Discrepancies in rawcumulative values might also arise due to a reset happening before the 10minute horizon. To rule out this possibility, try setting only a 10 minute querylookback period when comparing the collector UI to the Google Cloud console.

Discrepancies can also be caused by having multiple worker threadsin your application, each with a/metrics endpoint.If your application spins up multiple threads, you have to put the Prometheusclient library in multiprocess mode. For more information, see the documentationforusing multiprocess mode in Prometheus' Python client library.

Missing counter data or broken histograms

The most common signal of this problem is seeing no data or seeing datagaps when querying a plain counter metric (for example, a PromQL query ofmetric_name_foo). You can confirm this if data appears after you add aratefunction to your query (for example,rate(metric_name_foo[5m])).

You might also notice that your samples ingested has risen sharply without anymajor change in scrape volume or that new metrics are being created with"unknown" or "unknown:counter" suffixes in Cloud Monitoring.

You might also notice that histogram operations, such as thequantile()function, don't work as expected.

These issues occur when a metric is collected without aPrometheus metric TYPE.As Monarch is strongly typed, Managed Service for Prometheusaccounts for untyped metrics suffixing them with "unknown" and ingestingthem twice, once as a gauge and once as a counter. The query engine then chooseswhether to query the underlying gauge or counter metric based on what queryfunctions you use.

While this heuristic usually works quite well, it can lead to issues such asstrange results when querying a raw "unknown:counter" metric. Also, ashistograms are specifically typed objects in Monarch, ingesting thethree required histogrammetricsas individual counter metrics causes histogram functions to not work. As"unknown"-typed metrics are ingested twice, not setting a TYPE doubles yoursamples ingested.

Common reasons why TYPE might not be set include:

Accidentally configuring a Managed Service for Prometheuscollector as a federation server.Federation is not supported when usingManaged Service for Prometheus. As federation intentionally drops TYPEinformation, implementing federation causes "unknown"-typed metrics.
Using Prometheus Remote Write at any point in the ingestion pipeline. Thisprotocol also intentionally drops TYPE information.
Using a relabeling rule that modifies the metric name. This causes therenamed metric to disassociate from the TYPE information associated withthe original metric name.
The exporter not emitting a TYPE for each metric.
A transient issue where TYPE is dropped when the collector first starts up.

To resolve this issue, do the following:

Stop using federation with Managed Service for Prometheus. If you wantto reduce cardinality and cost by "rolling up" data before sending it toMonarch, seeConfigure local aggregation.
Stop using Prometheus Remote Write in your collection path.
Confirm that the# TYPE field exists for each metric by visiting the/metrics endpoint.
Delete any relabeling rules that modify the name of a metric.
Delete any conflicting metrics with the "unknown" or "unknown:counter" suffixbycalling DeleteMetricDescriptor.
Or always query counters using arate or other counter-processing function.

You can alsocreate a metric-exclusion rule within MetricsManagement to prevent any"unknown"-suffixed metrics from being ingested by using the regular expressionprometheus.googleapis.com/.+/unknown.*. If you don't fix the underlyingissue before installing this rule, you might prevent wanted metric data frombeing ingested.

Grafana data not persisted after pod restart

If your data appears to vanish from Grafana after a pod restart but isvisible in Cloud Monitoring, then you are using Grafana to query thelocal Prometheus instance instead of Managed Service for Prometheus.

For information about configuring Grafana to use the managed serviceas a data source, seeGrafana.

Inconsistent query or alert rule results that automatically fix themselves

You might notice a pattern where queries over recent windows, such asqueries run by recording or alerting rules, return unexplainable spikes in data.When you investigate the spike by running the query in Grafana orMetrics Explorer, you might see that the spike has disappeared and the datalooks normal again.

This behavior might happen more often if any of the following are true:

You are consistently running many very similar queries in parallel,perhaps by using rules. These queries might differ from each other onlyby a single attribute. For example, you might be running 50 recording rulesthat differ only by theVALUE for the filter{foo="VALUE"},or that differ only by having different[duration] values for theratefunction.
You are running queries at time=now with no buffer.
You are running instant queries such as alerts or recording rules. If you areusing a recording rule, you might notice that the saved output has the spike,but the spike can't be found when running a query over the raw data.
You are querying two metrics to create a ratio. Thespikes are more pronounced when the count of time series is low in either thenumerator or the denominator query.
Your metric data lives in larger Google Cloud regions such asus-central1orus-east4.

There are a few possible causes for temporary spikes in these kinds of queries:

(Most common cause) Your similar, parallel queries are all requesting datafrom the same set ofMonarch nodes, consuming a large amount of memory on eachnode in aggregate. When Monarch has sufficient available resourcesin a cloud region, your queries work. When Monarch is underresource pressure in a cloud region, each node throttles queries,preferentially throttling users that are consuming the most memory oneach node. When Monarch once again has sufficient resources,your queries work again. These queries might be SLIs that are automaticallygenerated from tools such asSloth.
You have late-arriving data, and your queries are nottolerant to this. It takes approximately 3-7 seconds fornewly-written data to be queryable, excluding networking latency and any delaycaused by resource pressure within your environment. Ifyour query does not build in a delay or offset to account for late data, thenyou might unknowingly query over a period where you only have partial data.Once the data arrives, your query results look normal.
Monarch might have a slight inconsistency when saving your data indifferent replicas. The query engine attempts to pick the "best quality"replica, but if different queries pick different replicas with slightlydifferent sets of data, it's possible that your results slightly varybetween queries. This is an expected behavior of the system, and your alertsshould be tolerant to these slight discrepancies.
An entire Monarch region might be temporarily unavailable. If aregion is not reachable, the query engine treats the region like it neverexisted. After the region becomes available, query results continue returningthat region's data.

To account for these possible root causes, you should ensure your queries,rules, and alerts follow these best practices:

Consolidate similar rules and alerts into a single rule that aggregates bylabels instead of having separate rules for each permutation of labelvalues. If these are alerting rules, you can uselabel-based notifications to route alerts from the aggregate rule instead ofconfiguring individual routing rules for each alert.
For example, if you have a labelfoo with valuesbar,baz, andqux,instead of having a separate rule for each label value (one with the querysum(metric{foo="bar"}), one with the querysum(metric{foo="baz"}), onewith the querysum(metric{foo="qux"})), have a single rule that aggregatesacross that label and optionally filters to the label values you care about(such assum by (foo) metric{foo=~"bar|baz|qux"}).
If your metric has 2 labels, and each label has 50 values, and you have aseparate rule for each combination of label values, and your rule queries area ratio, then each period you are launching 50 x 50 x 2 =5,000 parallelMonarch queries that each hit the same set ofMonarch nodes. In aggregate, these 5,000 parallel queries consumea large amount of memory on each Monarch node, which increasesyour risk of being throttled when a Monarch region is underresource pressure.
If you instead use aggregations to consolidate these rules into a singlerule that's a ratio, then each period you only launch 2 parallelMonarch queries. These 2 parallel queries consume much less memoryin aggregate than the 5,000 parallel queries, and your risk of being throttledis much lower.
If your rule looks back more than 1 day, then run it less frequently thanevery minute. Queries that access data older than 25 hours go to theMonarch on-disk data repository. These repository queries areslower and consume more memory than queries over more recent data,which exacerbates any problems with memory consumption fromparallel recording rules.
Consider running these kinds of queries once an hour instead of once a minute.Running a day-long query every minute only gives you a 1/1440 = 0.07% changein the result each period, which is a negligible change. Running a day-longquery every hour gives you a 60/1440 = 4% change in the result each period,which is a more relevant signal size. If you need to get alerted if recentdata changes, then you can run a different rule with a shorter lookback(such as 5 minutes) once a minute.
Use thefor: fieldin your rules to tolerate transient aberrant results. Thefor: field stopsyour alert from firing unless the alert condition has been met for at leastthe configured duration. Set this field to be twice thelength of your rule evaluation interval or longer.
Using thefor: field helps because transient issues often resolvethemselves, meaning theydon't occur on consecutive alert cycles. If you see a spike, and that spikepersists across multiple timestamps and multiple alert cycles, you can be moreconfident that it's a real spike and not a transient issue.
Use theoffset modifier in PromQLto delay your query evaluation so it doesn'toperate over the most recent period of data. Look at your sampling intervaland your rule-evaluation interval and identify the longer of the two. Ideally,your query offset is at least twice the length of the longer interval.For example, if you send data every 15s and run rules every 30s, thenoffset your queries by at least 1m. A 1m offset causes your rules to use anend timestamp that's at least 60 seconds old, which builds in a buffer forlate data to arrive before running your rule.
This is both a Cloud Monitoring best practice (allmanaged PromQL alertshave at least a 1m offset) and aPrometheus best practice.
Group your results by thelocation label to isolate potential unavailableregion issues. The label that has the Google Cloud region might be calledzone orregion in some system metrics.
If you don't group by region anda region becomes unavailable, then it looks like your results drop suddenlyand you might see historical results drop as well. If you group by regionand a region becomes unavailable, then you don't receive any results from thatregion but results from other regions are unaffected.
If your ratio is a success ratio (such as 2xx responses over total responses),consider making it an error ratio (such as 4xx+5xx responses over totalresponses) instead. Error ratios are more tolerant to inconsistent data, as atemporary dip in the data makes the query result lower than your threshold andtherefore doesn't cause your alert to fire.
Break apart a ratio query or recording rule into separate numerator anddenominator queries, if possible. This is aPrometheus best practice.Using ratios is valid, but because the query in the numerator executesindependently from the query in the denominator, using ratios canmagnify the impact of transient issues:
- If Monarch throttles the numerator query but notthe denominator query, then you might see unexpectedly low results. IfMonarch throttles the denominator query but not the numeratorquery, then you might see unexpectedly high results.
- If you are querying recent time periods and you havelate-arriving data, it's possible that one query in the ratio executesbefore the data arrives and the other query in the ratio executes after thedata arrives.
- If either side of your ratio is comprised of relatively few time series,then any errors get magnified. If your numerator and denominator each have100 time series, and Monarch doesn't return 1 time series in thenumerator query, then you are likely to notice the 1% difference. If yournumerator anddenominator each have 1,000,000 time series, and Monarch doesn'treturn 1 time series in the numerator query, you are unlikely to notice the0.0001% difference.
If your data is sparse, then use a longer rate duration in your query. If yourdata arrives every 10 minutes and your query usesrate(metric[1m]), thenyour query only looks back 1 minute for data and you sometimes get emptyresults. As a rule of thumb, set your[duration] to be at least 4 times yourscrape interval.
Gauge queries by default look back 5 minutes for data. To makethem look back further, use any validx_over_time function such aslast_over_time.

These recommendations are mostly relevant if you are seeing inconsistent queryresults when querying recent data. If you see this issue happening whenquerying data that's over 25 hours old, then there might be a technicalissue with Monarch. If this happens, contact Cloud Customer Care so wecan investigate.

Importing Grafana dashboards

For information about using and troubleshooting the dashboard importer, seeImport Grafana dashboards into Cloud Monitoring.

For information about problems with the conversion of thedashboard contents, see the importer'sREADME file.

Ingestion-side problems

Ingestion-side problems can be related to either collection or rule evaluation.Start by looking at the error logs for managed collection. You canrun the following commands:

kubectl logs -f -n gmp-system -lapp.kubernetes.io/part-of=gmpkubectl logs -f -n gmp-system -lapp.kubernetes.io/name=collector -c prometheus

On GKE Autopilot clusters, you can run the followingcommands:

kubectl logs -f -n gke-gmp-system -lapp.kubernetes.io/part-of=gmpkubectl logs -f -n gke-gmp-system -lapp.kubernetes.io/name=collector -c prometheus

The target status feature can help you debug your scrape target. For moreinformation, seetarget status information.

Endpoint status is missing or too old

If you have enabled thetarget status featurebut one or more of your PodMonitoring or ClusterPodMonitoring resources aremissing theStatus.Endpoint Statuses field or value, then you mighthave one of the following problems:

Managed Service for Prometheus was unable to reach a collector onthe same node as one of your endpoints.
One or more of your PodMonitoring or ClusterPodMonitoring configs resultedin no valid targets.

Similar problems can also cause theStatus.Endpoint Statuses.Last UpdateTime field to have value older than a few minutes plus your scrape interval.

To resolve this issue, start by checking that the Kubernetes pods associatedwith your scrape endpoint are running. If your Kubernetes pods are running, thelabel selectors match, and you can manually access the scrape endpoints(typically by visiting the/metrics endpoint), thencheck whether the Managed Service for Prometheus collectors are running.

Collectors fraction is less than 1

If you have enabled thetarget status feature,then you get status information about your resources. TheStatus.Endpoint Statuses.Collectors Fraction value of your PodMonitoring orClusterPodMonitoring resources represents the fraction of collectors, expressedfrom0 to1, that are reachable. For example, a value of0.5 indicatesthat 50% of your collectors are reachable, while a value of1 indicates that100% of your collectors are reachable.

If theCollectors Fraction field has a value other than1, then one or morecollectors are unreachable, and metrics in any of those nodes are possibly notbeing scraped. Ensure that all collectors are running and reachable over thecluster network. You can view the status of collector pods with the following command:

kubectl -n gmp-system get pods --selector="app.kubernetes.io/name=collector"

On GKE Autopilot clusters, this command looks slightlydifferent:

kubectl -n gke-gmp-system get pods --selector="app.kubernetes.io/name=collector"

You can investigate individual collector pods (for example, a collector podnamedcollector-12345) with the following command:

kubectl -n gmp-system describe pods/collector-12345

On GKE Autopilot clusters, run the following command:

kubectl -n gke-gmp-system describe pods/collector-12345

If collectors are not healthy, seeGKE workload troubleshooting.

If the collectors are healthy, then check the operator logs. To check theoperator logs, first run the following command to find the operator pod name:

kubectl -n gmp-system get pods --selector="app.kubernetes.io/name=gmp-collector"

On GKE Autopilot clusters, run the following command:

kubectl -n gke-gmp-system get pods --selector="app.kubernetes.io/name=gmp-collector"

Then, check the operator logs (for example, an operator pod namedgmp-operator-12345) with the following command:

kubectl -n gmp-system logs pods/gmp-operator-12345

On GKE Autopilot clusters, run the following command:

kubectl -n gke-gmp-system logs pods/gmp-operator-12345

Unhealthy targets

If you have enabled thetarget status feature,but one or more of your PodMonitoring or ClusterPodMonitoring resources has theStatus.Endpoint Statuses.Unhealthy Targets field with the value other than 0,then the collector cannot scrape one or more of your targets.

View theSample Groups field, which groups targets by error message, and findtheLast Error field. TheLast Error field comes from Prometheus and tellsyou why the target was unable to be scraped. To resolve this issue, using thesample targets as a reference, check whether your scrape endpoints are running.

Unauthorized scrape endpoint

If you see one of the following errors and your scrape target requiresauthorization, then your collector is either not set up to use the correctauthorization type or is using the incorrect authorization payload:

server returned HTTP status 401 Unauthorized
x509: certificate signed by unknown authority

To resolve this issue, seeConfiguring an authorized scrape endpoint.

Quota exceeded

If you see the following error, then you have exceeded your ingestion quota forthe Cloud Monitoring API:

"429: Quota exceeded for quota metric 'Time series ingestion requests' andlimit 'Time series ingestion requests per minute' of service'monitoring.googleapis.com' for consumer 'project_number:PROJECT_NUMBER'.,rateLimitExceeded"

This error is most commonly seen when first bringing up the managed service.The default quota exhausts at 100,000 samples per second ingested.

To resolve this issue, submit a request to increase your ingestion quotafor the Monitoring API. For assistance, contactGoogle Cloud Support. For more information aboutquotas, see theCloud Quotas documentation.

Missing permission on the node's default service account

If you see one of the following errors, then the default service account on thenode might be missing permissions:

"execute query: Error querying Prometheus: client_error: client error: 403"
"Readiness probe failed: HTTP probe failed with statuscode: 503"
"Error querying Prometheus instance"

Managed collection and the managed rule evaluator inManaged Service for Prometheus both use the default service accounton the node. This account is created with all the necessary permissions,but customers sometimes manually remove the Monitoringpermissions. This removal causes collection and rule evaluation to fail.

To verify the permissions of the service account, do one of the following:

Identify the underlying Compute Engine node name, and thenrun the following command:
```
gcloud compute instances describeNODE_NAME --format="json" | jq .serviceAccounts
```
Look for the stringhttps://www.googleapis.com/auth/monitoring. Ifnecessary, add Monitoring as described inMisconfiguredservice account.
Navigate to the underlying VM in the cluster and check the configurationof the node's service account:
1. In the Google Cloud console, go to theKubernetes clusters page:
  Go toKubernetes clusters
  If you use the search bar to find this page, then select the result whose subheading isKubernetes Engine.
2. SelectNodes, then click on the name of the node in theNodes table.
3. ClickDetails.
4. Click theVM Instance link.
5. Locate theAPI and identity management pane, and clickShowdetails.
6. Look forStackdriver Monitoring API with full access.

It's also possible that the data source syncer or the Prometheus UI has beenconfigured to look at the wrong project. For information about verifying thatyou are queryingthe intended metrics scope, seeChange the queriedproject.

Misconfigured service account

If you see one of the following error messages, then the service accountused by the collector does not have the correct permissions:

"code = PermissionDenied desc = Permission monitoring.timeSeries.create denied(or the resource may not exist)"
"google: could not find default credentials. Seehttps://developers.google.com/accounts/docs/application-default-credentialsfor more information."

To verify that your service account has the correct permissions, do thefollowing:

In the Google Cloud console, go to theIAM page:
Go toIAM
If you use the search bar to find this page, then select the result whose subheading isIAM & Admin.
Identify the service account name in the list of principals. Verify thatthe name of the service account is correctly spelled. ThenclickEdit.
Select theRole field, then clickCurrently used andsearch for the Monitoring Metric Writer or the Monitoring Editor role.If the service account doesn't have one of these roles, then grant theservice account the roleMonitoring Metric Writer (roles/monitoring.metricWriter).

If you are running on non-GKE Kubernetes, then you mustexplicitly pass credentials to both the collector and the rule evaluator.You must repeat the credentials in both therules andcollectionsections. For more information, seeProvide credentialsexplicitly (for collection) orProvide credentialsexplicitly (for rules).

Service accounts are often scoped to a single Google Cloud project. Using oneservice account to write metric data for multiple projects — for example,when one managed rule evaluator is querying a multi-project metrics scope— can cause this permission error. If you are using the default serviceaccount, consider configuring a dedicated service account so that you cansafely add themonitoring.timeSeries.create permission for several projects.If you can't grant this permission, then you can use metric relabeling torewrite theproject_id label to another name. The project ID then defaults tothe Google Cloud project in which your Prometheus server or rule evaluatoris running.

Invalid scrape configuration

If you see the following error, then your PodMonitoring or ClusterPodMonitoringis improperly formed:

"Internal error occurred: failed calling webhook"validate.podmonitorings.gmp-operator.gmp-system.monitoring.googleapis.com":Post "https://gmp-operator.gmp-system.svc:443/validate/monitoring.googleapis.com/v1/podmonitorings?timeout=10s":EOF""

To solve this, make sure your custom resource is properly formedaccording tothe specification.

Metric paths with HTTP query parameters aren't scraped

You are trying to send a metric by using apath field that includes queryparameters to Managed Service for Prometheus, but the metric isn't scraped.For example, your scrape configuration might include the following:

path:/metrics/detailed?family=queue_metrics&family=queue_consumer_count

The reason the metric isn't scraped is thatPrometheus URL-encodes thequestion-mark (?) characteras%3F,so the data is sent to/metrics/detailed%3Ffamily=queue_metrics&family=queue_consumer_count instead.

To fix this problem, use theparams field. For example, if themetric is/metrics/detailed?family=queue_metrics&family=queue_consumer_count,then set up the scrape configuration as follows:

path:/metrics/detailedparams:family:['queue_metrics','queue_consumer_count']

Admission webhook unable to parse or invalid HTTP client config

On versions of Managed Service for Prometheus earlier than 0.12, you mightsee an error similar to the following, which is related to secret injection inthe non-default namespace:

"admission webhook "validate.podmonitorings.gmp-operator.gmp-system.monitoring.googleapis.com"denied the request: invalid definition for endpoint with index 0: unable toparse or invalid Prometheus HTTP client config: must use namespace"my-custom-namespace", got: "default""

To solve this issue, upgrade to version 0.12 or later.

Problems with scrape intervals and timeouts

When using Managed Service for Prometheus, the scrape timeout can'tbe greater than the scrape interval. To check your logs for this problem,run the following command:

kubectl -n gmp-system logs ds/collector prometheus

On GKE Autopilot clusters, run the following command:

kubectl -n gke-gmp-system logs ds/collector prometheus

Look for this message:

"scrape timeout greater than scrape interval for scrape config withjob name "PodMonitoring/gmp-system/example-app/go-metrics""

To resolve this issue, set the value of the scrape interval equal to orgreater than the value of the scrape timeout.

Missing TYPE on metric

If you see the following error, then the metric is missing type information:

"no metadata found for metric name "{metric_name}""

To verify that missing type information is the problem, check the/metricsoutput of the exporting application. If there is no line like the following,then the type information is missing:

# TYPE {metric_name} <type>

Certain libraries, such asthose from VictoriaMetrics older than version1.28.0, intentionally drop the type information. These libraries arenot supported by Managed Service for Prometheus.

Time-series collisions

If you see one of the following errors, you might have more than one collectorattempting to write to the same time series:

"One or more TimeSeries could not be written: One or more points were writtenmore frequently than the maximum sampling period configured for the metric."
"One or more TimeSeries could not be written: Points must be written in order.One or more of the points specified had an older end time than the mostrecent point."

The most common causes and solutions follow:

Using high-availability pairs. Managed Service for Prometheus does notsupport traditional high-availability collection. Using this configurationcan create multiple collectors that try to write data to the same time series,causing this error.
To resolve the problem, disable the duplicate collectors by reducing thereplica count to 1, or use thesupported high-availabilitymethod.
Using relabeling rules, particularly those that operate on jobs or instances.Managed Service for Prometheus partially identifies a unique time seriesby the combination of {project_id,location,cluster,namespace,job,instance} labels. Using a relabeling rule to drop these labels,especially thejob andinstance labels, can frequently cause collisions.Rewriting these labels is not recommended.
To resolve the problem, delete the rule that is causing it; this can be oftendone bymetricRelabeling rule that uses thelabeldrop action. You canidentify the problematic rule by commenting out all the relabeling rulesand then reinstating them, one at a time, until the error recurs.

A less common cause of time-series collisions is using a scrape interval shorterthan 5 seconds. The minimum scrape interval supported byManaged Service for Prometheus is 5 seconds.

Exceeding the limit on the number of labels

If you see the following error, then you might have too many labels defined forone of your metrics:

"One or more TimeSeries could not be written: The newlabels would cause the metricprometheus.googleapis.com/METRIC_NAMEto have overPER_PROJECT_LIMIT labels".

This error usually occurs when you rapidly change the definition of the metricso that one metric name effectively has multiple independent sets of label keysover the whole lifetime of your metric. The Cloud Monitoring imposes a limiton number of labels for each metric; for more information see the limits foruser-defined metrics.

Note: The number of labels (also calledlabel names,label keys, ordimensions) is different than cardinality.Cardinality refersto the number of combinations of unique label values across all labels.

There are three steps to resolve this problem:

Identify why a given metric has too many or frequently changing labels.
- You can use the APIs Explorer widget on themetricDescriptors.list page to call the method. For moreinformation, see APIs Explorer. For examples, seeList metric and resource types.
Address the source of the problem, which might involve adjusting yourPodMonitoring's relabeling rules, changing the exporter, or fixing yourinstrumentation.
Delete the metric descriptor for this metric (which incurs data loss),so it can be recreated with a smaller, more stable set of labels. You canuse themetricDescriptors.delete method to do so.

The most common sources of the problem are:

Collecting metrics from exporters or applications that attach dynamic labelson metrics. For example, self-deployed cAdvisor with additionalcontainer labels and environment variables or theDataDog agent, which injects dynamic annotations.
To resolve this, you can use ametricRelabeling sectionon the PodMonitoring to either keep or drop labels. Some applications andexporters also allow configuration that changes exported metrics. For example,cAdvisor has a number of advanced runtime settings that can dynamically addlabels. When using managed collection, we recommend using the built-inautomatic kubelet collection.
Using relabeling rules, particularly those that attach label names dynamically,which can cause an unexpected number of labels.
To resolve the problem, delete the rule entry that is causing it.

Rate limits on creating and updating metrics and labels

If you see the following error, then you have hit the per-minute rate limit oncreating new metrics and adding new metric labels to existing metrics:

"Request throttled. You have hit the per-project limit on metric definition orlabel definition changes per minute."

This rate limit is usually only hit when first integrating withManaged Service for Prometheus, for example when you migrate an existing,mature Prometheus deployment to use self-deployed collection.Thisis not a rate limit on ingesting data points. This rate limit only applieswhen creating never-before-seen metrics or when adding new labels to existingmetrics.

This quota is fixed, but any issues should automatically resolveas new metrics and metric labels get created up to the per-minutelimit.

Limits on the number of metric descriptors

If you see the following error, then you have hit the quota limit for thenumber of metric descriptors within a singleGoogle Cloud project:

"Your metric descriptor quota has been exhausted."

By default, this limit is set to 25,000.Although this quota can be lifted by request if your metrics are well-formed, itis far more likely that you hit this limit because you are ingesting malformedmetric names into the system.

Prometheus has adimensional data modelwhere information such as cluster or namespace name should get encoded as alabel value.When dimensionalinformation is instead embedded in the metric name itself, then the number ofmetric descriptors increases indefinitely. In addition, because in this scenariolabels are not properly used, it becomes much more difficult to query andaggregate data across clusters, namespaces, or services.

Neither Cloud Monitoring nor Managed Service for Prometheus supportsnon-dimensional metrics, such as those formatted for StatsD or Graphite.While most Prometheus exporters are configured correctly out-of-the-box, certainexporters, such as the StatsD exporter, the Vault exporter, or the Envoy Proxythat comes with Istio, must be explicitly configured to use labels instead ofembedding information in the metric name. Examples of malformed metric namesinclude:

request_path_____path_to_a_resource____istio_request_duration_milliseconds
envoy_cluster_grpc_method_name_failure
envoy_cluster_clustername_upstream_cx_connect_ms_bucket
vault_rollback_attempt_path_name_1700683024
service__________________________________________latency_bucket

To confirm this issue, do the following:

Within Google Cloud console, select the Google Cloud project that is linked tothe error.
In the Google Cloud console, go to the Metrics management page:
Go toMetrics management
If you use the search bar to find this page, then select the result whose subheading isMonitoring.
Confirm that the sum of Active plus Inactive metrics is over25,000. In most situations, youshould see a large number of Inactive metrics.
Select "Inactive" in the Quick Filters panel, page through the list, andlook for patterns.
Select "Active" in the Quick Filters panel, sort bySamples billablevolume descending, page through the list, and look for patterns.
Sort bySamples billable volume ascending, page through the list, andlook for patterns.

Alternatively, you can confirm this issue by using Metrics Explorer:

Within Google Cloud console, select the Google Cloud project that is linked tothe error.
In the Google Cloud console, go to the Metrics explorer page:
Go toMetrics explorer
If you use the search bar to find this page, then select the result whose subheading isMonitoring.
In the query builder, click select a metric, then clear the "Active"checkbox.
Type "prometheus" into the search bar.
Look for any patterns in the names of metrics.

Once you have identified the patterns that indicate malformed metrics, you canmitigate the issue by fixing the exporter at the source and then deleting theoffending metric descriptors.

To prevent this issue from happening again, you must first configure therelevant exporter to no longer emit malformed metrics. We recommendconsulting the documentation for your exporter for help. You can confirm youhave fixed the problem by manually visiting the/metrics endpoint andinspecting the exported metric names.

You can then free up your quota by deleting the malformed metricsusing theprojects.metricDescriptors.deletemethod. Tomore easily iterate through the list of malformed metrics, we providea Golangscript you can use. This script accepts a regularexpression that can identify your malformed metrics and deletes any metricdescriptors that match the pattern.As metric deletion is irreversible, westrongly recommend first running the script using dry run mode.

Some metrics are missing for short-running targets

Google Cloud Managed Service for Prometheus is deployed and there are no configuration errors;however, some metrics are missing.

Determine the deployment that generates the partially missing metrics.If the deployment is a Google Kubernetes Engine' CronJob, then determine how long thejob typically runs:

Find the cron job deployment yaml file and find the status, which islist at the end of the file.The status in this example shows that the job ran for one minute:
```
status:lastScheduleTime:"2024-04-03T16:20:00Z"lastSuccessfulTime:"2024-04-03T16:21:07Z"
```
If the run time is less than five minutes, then the job isn't running longenough for the metric data to be consistently scraped.
To resolve this situation, try the following:
- Configure the job to ensure that it doesn't exit until at leastfive minutes have elapsed since the job started.
- Configure the job to detect whether metrics have been scrapedbefore exiting. This capability requires library support.
- Consider creating a log based distribution-valued metric instead ofcollecting metric data. This approach is suggested when data ispublished at a low rate. For more information, seeLog-based metrics.
If the run time is longer than five minutes or if it is inconsistent, thensee theUnhealthy targets section of this document.

Problems with collection from exporters

If your metrics from an exporter are not being ingested, check the following:

Verify that the exporter is working and exporting metrics by usingthekubectl port-forward command.
For example, to check that pods with the selectorapp.kubernetes.io/name=redis in the namespacetest are emitting metrics at the/metrics endpointon port 9121, you can port-forward as follows:
```
kubectl port-forward "$(kubectl get pods -l app.kubernetes.io/name=redis -n test -o jsonpath='{.items[0].metadata.name}')" -n test 9121
```
Access the endpointlocalhost:9121/metrics by using the browser orcurl in another terminal session to verify that the metrics are beingexposed by the exporter for scraping.
Check if you can query the metrics in the Google Cloud console but not Grafana.If so, then the problem is with Grafana, not the collection of your metrics.
Verify that the managed collector is able to scrape the exporter by inspectingthe Prometheus web interface the collector exposes.
1. Identify the managed collector running on the same node on which your exporter is running. For example, if yourexporter is running on pods in the namespacetest and the pods are labeled withapp.kubernetes.io/name=redis, the following command identifies the managed collector running on the same node:
```
kubectl get pods -l app=managed-prometheus-collector --field-selector="spec.nodeName=$(kubectl get pods -l app.kubernetes.io/name=redis -n test -o jsonpath='{.items[0].spec.nodeName}')" -n gmp-system -o jsonpath='{.items[0].metadata.name}'
```
2. Set up port-forwarding from port 19090 of the managed collector:
```
kubectl port-forwardPOD_NAME -n gmp-system 19090
```
3. Navigate to the URLlocalhost:19090/targets to access the web interface. If the exporter is listed as one of the targets, then your managed collector is successfully scraping the exporter.

Collector Out Of Memory (OOM) errors

If you are using managed collection and encountering Out Of Memory (OOM) errors on your collectors,then considerenabling vertical pod autoscaling.

Operator Out Of Memory (OOM) errors

If you are using managed collection and encountering Out Of Memory (OOM) errors on your operator,then considerdisabling target status feature.The target status feature can cause operator performance issues in larger clusters.

Too many time series or increased 503 responses and context deadline exceeded errors, especially during peak load

You might also be encountering this issue if you see the following errormessage:

"Monitored resource (abcdefg) has too many time series (prometheus metrics)"

"Context deadline exceeded" is a generic 503 error returned fromMonarch for any ingestion-side problem that doesn't have a specificcause. A very small number of "context deadline exceeded" errors is expectedwith normal use of the system.

However, you might notice a pattern where "context deadline exceeded" errorsincrease and materially impact your data ingestion. One potential root causeis that you might be incorrectly setting target labels. This is more likely ifthe following are true:

Your "Context deadline exceeded" errors have a cyclical pattern, where theyincrease during either times of high load for you or times of high load forthe Google Cloud region specified by yourlocation label.
You see more errors as you onboard more metric volume to the service.
You are using thestatsd_exporter for Prometheus,Envoy for Istio, the SNMP exporter, the Prometheus Pushgateway,kube-state-metrics, or you otherwise have a similar exporter thatintermediates and reports metrics on behalf of other resources running in yourenvironment. The problem only happens for metrics emitted by this type ofexporter.
You notice that your affected metrics tend to have the stringlocalhostin the value for theinstance label, or there are very few values for theinstance label.
If you have access to the in-cluster Prometheus collector query UI, you cansee that the metrics are being collected successfully.

If these points are true, it's likely that your exporter has misconfigured theresource labelsin a way that conflicts with Monarch's requirements.

Monarch scales by storing related data together in a target. Atarget for Managed Service for Prometheus is defined by theprometheus_target resourcetype and theproject_id,location,cluster,namespace,job, andinstance labels. For more information about these labels and defaultingbehavior, seeReserved labels in Managed CollectionorReserved labels in Self-deployed collection.

Of these labels,instance is the lowest-level target field and is thereforemost important to get right. Efficiently storing and querying metrics inMonarch requires relativelysmall, diverse targets, ideally around the size of a typical VM or a container.When running Managed Service for Prometheus in typicalscenarios, theopen-source default behavior built into the collectorusually picks good values for thejob andinstance labels, which is whythis topic is not covered elsewhere in the documentation.

However, the default logic might fail when you are running an exporter thatreports metrics on behalf of other resources in your cluster, such as thestatsd_exporter. Instead of setting the value ofinstance to the IP:port ofthe resource that emits the metric, the value ofinstance gets set totheIP:port of the statsd_exporter itself. The issue can be compounded by thejob label, as instead of relating to the metric package or service, it alsolacks diversity by being set tostatsd-exporter.

When this happens, all metrics that come from this exporter within a givencluster and namespace get written into the same Monarch target. Asthis target gets larger, writes begin failing, and you see increased "Contextdeadline exceeded" 503 errors.

You can get verification that this is happening to you by contactingCloud Customer Care and asking them to check the "Monarch Quarantinerhospitalization logs". Include any known values for the six reserved labels inyour ticket. Make sure to report the Google Cloud project that is sending thedata, not the Google Cloud project of your metrics scope.

To fix this issue, you have to change your collection pipeline to use morediverse target labels. Some potential strategies, listed in order ofeffectiveness, include:

Instead of running a central exporter that reports metrics on behalf of allVMs or nodes, run a separate exporter for each VM as a node agent or bydeploying the exporter as a Kubernetes Daemonset. To avoid setting theinstance label tolocalhost, don't run the exporter on the same nodeas your collector.
- If, after sharding the exporter, you still need more target diversity, runmultiple exporters on each VM and logically assign different sets of metricsto each exporter. Then, instead of discovering the job using the static namestatsd-exporter, use a different job name for each logical set of metrics.Instances with different values forjob get assigned to different targets inMonarch.
- If you're using kube-state-metrics, use thebuilt-in horizontalshardingto create more target diversity. Other exporters might have similarcapabilities.
If you're using OpenTelemetry or self-deployed collection, use a relabelingrule to change the value ofinstance from the IP:Port or name of theexporter to the IP:Port or unique name of the resource that is generating themetrics. It's very likely that you are already capturing the IP:Port or nameof the originating resource as a metrics label. You also have to set thehonor_labels field totrue in your Prometheus or OpenTelemetryconfiguration.
If you're using OpenTelemetry or self-deployed collection, use a relabelingrule with a hashmod function to run multiple scrape jobs against the sameexporter and ensure that a different instance label is chosen for each scrapeconfiguration.

No errors and no metrics

If you are using managed collection, you don't see any errors, but data is notappearing in Cloud Monitoring, then the most likely causeis that your metric exporters or scrape configurations are not configuredcorrectly. Managed Service for Prometheus does not send any time series dataunless you first apply a valid scrape configuration.

To identify whether this is the cause, trydeploying the example applicationand example PodMonitoring resource. If you now see theup metric (it may take a few minutes), then the problem is with your scrapeconfiguration or exporter.

The root cause could be any number of things. We recommend checking thefollowing:

Your PodMonitoring references a valid port.
Your exporter's Deployment spec has properly named ports.
Your selectors (most commonlyapp) match on your Deployment andPodMonitoring resources.
You can see data at your expected endpoint and port by manually visiting it.
You have installed your PodMonitoring resource in the same namespace as theapplication you wish to scrape. Do not install any custom resourcesor applications in thegmp-system orgke-gmp-system namespace.
Your metric and label names match Prometheus'validating regularexpression.Managed Service for Prometheus does not support label names that startwith the_ character.
You are not using a set of filters that causes all data to be filtered out.Take extra care that you don't have conflicting filters when using acollection filter in theOperatorConfig resource.
If running outside of Google Cloud,project orproject-id is set to avalid Google Cloud project andlocation is set to a valid Google Cloud region.You can't useglobal as a value forlocation.
Your metric is one ofthe four Prometheus metric types.Some libraries likeKube State Metrics exposeOpenMetrics metric types like Info, Statesetand GaugeHistogram, but these metric types are not supported byManaged Service for Prometheus and are silently dropped.

Firewalls

A firewall can cause both ingestion and query problems. Your firewallmust be configured to permit bothPOST andGET requests to theMonitoring API service,monitoring.googleapis.com, to allow ingestionand queries.

Error about concurrent edits

The error message "Too many concurrent edits to the project configuration"is usually transient, resolving after a few minutes. It is usually causedby removing a relabeling rule that affects many different metrics. Theremoval causes the formation of a queue of updates to the metric descriptorsin your project. The error goes away when the queue is processed.

For more information, seeLimits on creating and updating metrics andlabels.

Queries blocked and cancelled by Monarch

If you see the following error, then you have hit the internal limit forthe number of concurrent queries that can be run for any given project:

"internal: expanding series: generic::aborted:invalid status monarch::220: Cancelled due to the number of queries whoseevaluation is blocked waiting for memory is 501, which is equal to or greaterthan the limit of 500."

To protect against abuse, the system enforces a hard limit on the number ofqueries from one project that can run concurrently within Monarch. Withtypical Prometheus usage, queries should be quick and this limit should never bereached.

You might hit this limit if you are issuing a lot of concurrent queries that runfor a longer-than-expected time. Queries requesting more than 25hours of data are usually slower to execute than queries requesting less than 25hours of data, and the longer the query lookback, the slower the query isexpected to be.

Typically this issue is triggered by running lots of long-lookback rules in aninefficient way. For example, you might have many rules that run onceevery minute and request a 4-week rate. If each of these rules takes a long timeto run, it might eventually cause a backup of queries waiting to run for yourproject, which then causes Monarch to throttle queries.

To resolve this issue, you need to increase the evaluation interval of yourlong-lookback rules so that they're not running every 1 minute. Running a queryfor a 4-week rate every 1 minute is unnecessary; there are 40,320 minutes in 4weeks, so each minute gives you almost no additional signal (your data changesat most by 1/40,320th). Using a 1 hour evaluation interval should besufficient for a query that requests a 4-week rate.

Once you resolve the bottleneck caused by inefficient long-running queriesexecuting too frequently, this issue should resolve itself.

Incompatible value types

If you see the following error upon ingestion or query, then you have a valuetype incompatibility in your metrics:

"Value type for metric prometheus.googleapis.com/metric_name/gauge must beINT64, but is DOUBLE"
"Value type for metric prometheus.googleapis.com/metric_name/gauge must beDOUBLE, but is INT64"
"One or more TimeSeries could not be written: Value type for metricprometheus.googleapis.com/target_info/gauge conflicts with the existing valuetype (INT64)"

You might see this error upon ingestion, as Monarch does not supportwriting DOUBLE-typed data to INT64-typedmetrics nor does it support writing INT64-typed data to DOUBLE-typedmetrics. You also might see this error when querying using a multi-projectmetrics scope, as Monarch cannot union DOUBLE-typed metrics in oneproject with INT64-typed metrics in another project.

This error only happens when you have OpenTelemetry collectors reporting data,and it is more likely to happen if you have both OpenTelemetry (using thegooglemanagedprometheus exporter) and Prometheusreporting data for the same metric as commonly happens for thetarget_infometric.

The cause is likely one of the following:

You are collecting OTLP metrics, and the OTLP metric library changed its valuetype from DOUBLE to INT64, as happened with OpenTelemetry's Java metrics. Thenew version of the metric library is now incompatible with the metric valuetype created by the old version of the metric library.
You are collecting thetarget_info metric using both Prometheus andOpenTelemetry. Prometheus collects this metric as a DOUBLE, whileOpenTelemetry collects this metric as an INT64. Your collectors are nowwriting two value types to the same metric in the same project, and only thecollector that first created the metric descriptor is succeeding.
You are collectingtarget_info using OpenTelemetry as an INT64 in oneproject, and you are collectingtarget_info using Prometheus as a DOUBLEin another project. Adding both metrics to the same metrics scope, thenquerying that metric through the metrics scope, causes an invalid unionbetween incompatible metric value types.

This problem can be solved by forcing all metric value types to DOUBLE by doingthe following:

Reconfigure your OpenTelemetry collectors to force all metrics to be a DOUBLEby enabling thefeature-gateexporter.googlemanagedprometheus.intToDoubleflag.
Delete all INT64 metric descriptors and let them getrecreated as a DOUBLE. You can use thedelete_metric_descriptors.goscript to automate this.

Following these steps deletes all data that is stored as an INT64 metric.There is no alternative to deleting the INT64 metrics that fully solves thisproblem.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Troubleshooting Managed Service for Prometheus Stay organized with collections Save and categorize content based on your preferences.

Query-side problems

Misconfigured or mistyped secrets

Incorrect HTTP method for Grafana

Timeouts on large or long-running queries

Label-validation errors

Quota exceeded

Metrics from multiple projects

No monitored resource type specified

Counter, histogram, and summary raw values not matching between the collector UI and the Google Cloud console

Missing counter data or broken histograms

Grafana data not persisted after pod restart

Inconsistent query or alert rule results that automatically fix themselves

Importing Grafana dashboards

Ingestion-side problems

Endpoint status is missing or too old

Collectors fraction is less than 1

Unhealthy targets

Unauthorized scrape endpoint

Quota exceeded

Missing permission on the node's default service account

Misconfigured service account

Invalid scrape configuration

Metric paths with HTTP query parameters aren't scraped

Admission webhook unable to parse or invalid HTTP client config

Problems with scrape intervals and timeouts

Missing TYPE on metric

Time-series collisions

Exceeding the limit on the number of labels

Rate limits on creating and updating metrics and labels

Limits on the number of metric descriptors

Some metrics are missing for short-running targets

Problems with collection from exporters

Collector Out Of Memory (OOM) errors

Operator Out Of Memory (OOM) errors

Too many time series or increased 503 responses and context deadline exceeded errors, especially during peak load

No errors and no metrics

Firewalls

Error about concurrent edits

Queries blocked and cancelled by Monarch

Incompatible value types

Troubleshooting Managed Service for Prometheus