Troubleshooting¶

2026-04-27

Quick-reference for the most common problems. Each entry lists the symptom, diagnosis steps, and fix.

Helm install fails with "context deadline exceeded" or "too many characters"¶

Likely cause: You are deploying via ArgoCD UI/CLI. The chart's CRDs are very large and the ArgoCD UI / CLI rejects manifests above a size threshold.

Fix: Use helm install directly. See Installation. For GitOps, render with helm template and apply the generated YAML.

CRDs not installed¶

Symptom: helm install succeeds but Prometheus/Alertmanager pods never get created. kubectl get prometheus returns error: the server doesn't have a resource type "prometheus".

Diagnosis:

kubectl get crd | grep monitoring.coreos.com

Should list 10 CRDs. If empty, the crds subchart was disabled or skipped.

Fix:

crds:
  enabled: true

Then helm upgrade.

Prometheus pod stuck in `Pending`¶

Diagnosis:

kubectl -n monitoring describe pod prometheus-prometheus-stack-prometheus-0

Look at the Events section.

Common causes and fixes:

No suitable node: anti-affinity is too strict. Set prometheus.prometheusSpec.podAntiAffinity: "soft" or remove specific nodeSelector / tolerations.
Insufficient cpu/memory: lower prometheus.prometheusSpec.resources.requests or scale the cluster.
PVC binding stuck: the StorageClass has volumeBindingMode: Immediate but no PV is available. Switch to WaitForFirstConsumer or pre-provision PVs.

Prometheus pod CrashLoopBackOff¶

Diagnosis:

kubectl -n monitoring logs prometheus-prometheus-stack-prometheus-0 -c prometheus --tail=100

Common causes and fixes:

Log line	Fix
`out of memory` killed	Bump `prometheus.prometheusSpec.resources.limits.memory`. Reduce series count via `metricRelabelings`.
`mmap: cannot allocate memory`	Same as above.
`error parsing config file`	A `PrometheusRule` you added is invalid. Find it: `kubectl get prometheusrule -A` and run `promtool check rules` against each.
`permission denied` on data dir	StorageClass mounts as a UID Prometheus does not run as. Set `prometheus.prometheusSpec.securityContext.fsGroup: 65534`.
`WAL replay` very slow	Normal after a hard restart with high churn. Wait 1–5 min.

Targets are `DOWN`¶

Open Prometheus UI » Status » Targets » click the failing target to see the error.

Error	Likely cause	Fix
`connection refused`	Wrong port name in `ServiceMonitor` or app not listening on metrics port	Check `kubectl get svc <name> -o yaml` for the port name; check `netstat` inside the pod.
`404 Not Found`	Wrong `path`	Default is `/metrics`; some apps use `/actuator/prometheus`, `/-/metrics`, etc.
`tls: bad certificate` / `x509`	TLS scrape against a self-signed cert	Add `tlsConfig.insecureSkipVerify: true` or mount the CA.
`context deadline exceeded`	Endpoint slow to respond	Increase `scrapeTimeout`; check the application.
`403 Forbidden`	App requires auth	Set `bearerTokenFile` or `basicAuth`.

A `ServiceMonitor` exists but is not picked up¶

Symptom: No new targets in Prometheus, ServiceMonitor is fine.

Diagnosis:

# What labels does the Prometheus CR want?
kubectl -n monitoring get prometheus -o yaml | grep -A5 serviceMonitorSelector

# What labels does my ServiceMonitor have?
kubectl -n my-app get servicemonitor my-app -o yaml | grep -A5 labels

The Operator only includes ServiceMonitors that match the selector. Either add the chart's release labels to your ServiceMonitor, or widen the selector in chart values:

prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelector: {}
    serviceMonitorNamespaceSelector: {}

Apply with helm upgrade.

Alerts fire but no notification arrives¶

Diagnosis:

# 1. Confirm Alertmanager actually received the alert
kubectl -n monitoring port-forward svc/prometheus-stack-kube-prom-alertmanager 9093
# Open http://localhost:9093 and check Alerts tab.

# 2. Check the route tree
# Status tab > Config. Confirm the alert's labels match a route.

# 3. Test the receiver
amtool alert add foo severity=critical \
    --alertmanager.url=http://localhost:9093

Common causes:

A higher-priority silence is matching the alert.
Routing tree falls through to a no-op default. Add a catch-all receiver.
Webhook URL invalid or blocked by egress NetworkPolicy. Check Alertmanager logs:

kubectl -n monitoring logs alertmanager-prometheus-stack-alertmanager-0 -c alertmanager

Grafana cannot reach Prometheus¶

Symptom: Grafana datasource test fails with bad gateway or dial tcp ... no such host.

Diagnosis: Grafana datasource URL must be the in-cluster Service. The chart sets it correctly by default. If you overrode it, ensure the URL is something like:

http://prometheus-stack-kube-prom-prometheus.monitoring.svc.cluster.local:9090

If the Grafana pod cannot resolve that, check NetworkPolicies and namespace-aware DNS.

Operator validating webhook timing out¶

Symptom: kubectl apply -f my-rules.yaml hangs or fails with context deadline exceeded against prometheusrulemutate.monitoring.coreos.com.

Diagnosis:

kubectl -n monitoring get pods -l app=kube-prometheus-stack-operator
kubectl -n monitoring logs <operator-pod>
kubectl get validatingwebhookconfiguration | grep prometheus

Fix options:

Operator pod is crashlooping — fix that first.
Webhook cert is expired (Operator-managed certs rotate automatically; if it is stuck, restart the operator pod).
As a last-resort emergency override, temporarily disable the webhook:

prometheusOperator:
  admissionWebhooks:
    enabled: false

Re-enable as soon as possible — without it, syntactically invalid rules silently cause Prometheus to fail to load config.

Out-of-memory killer kills Prometheus repeatedly¶

This is almost always cardinality. See Day 2 » Cardinality watch. Find the offender and drop labels with metricRelabelings.

If the cardinality is legitimate, scale up:

prometheus:
  prometheusSpec:
    resources:
      limits: { cpu: "4", memory: 16Gi }
    replicas: 2          # 2 replicas in HA, NOT shards

Only use shards: N if you have actual evidence of >1M active series and have planned for the operational complexity (sharded scrape config, queries spanning shards via Thanos).

Useful `kubectl` snippets¶

# All chart-owned resources
kubectl -n monitoring get all,servicemonitor,podmonitor,prometheusrule,alertmanagerconfig \
    -l app.kubernetes.io/instance=prometheus-stack

# Operator events
kubectl -n monitoring get events --sort-by=.lastTimestamp \
    | grep -i operator

# Prometheus config currently loaded
kubectl -n monitoring exec sts/prometheus-prometheus-stack-prometheus -c prometheus \
    -- wget -qO- http://localhost:9090/api/v1/status/config

# Reload Prometheus config (without pod restart)
kubectl -n monitoring exec sts/prometheus-prometheus-stack-prometheus -c prometheus \
    -- wget -qO- --post-data="" http://localhost:9090/-/reload

Troubleshooting¶

Helm install fails with "context deadline exceeded" or "too many characters"¶

CRDs not installed¶

Prometheus pod stuck in Pending¶

Prometheus pod CrashLoopBackOff¶

Targets are DOWN¶

A ServiceMonitor exists but is not picked up¶

Alerts fire but no notification arrives¶

Grafana cannot reach Prometheus¶

Operator validating webhook timing out¶

Out-of-memory killer kills Prometheus repeatedly¶

Useful kubectl snippets¶

Prometheus pod stuck in `Pending`¶

Targets are `DOWN`¶

A `ServiceMonitor` exists but is not picked up¶

Useful `kubectl` snippets¶