Skip to content

Troubleshooting

Quick-reference for the most common problems. Each entry lists the symptom, diagnosis steps, and fix.

Helm install fails with "context deadline exceeded" or "too many characters"

Likely cause: You are deploying via ArgoCD UI/CLI. The chart's CRDs are very large and the ArgoCD UI / CLI rejects manifests above a size threshold.

Fix: Use helm install directly. See Installation. For GitOps, render with helm template and apply the generated YAML.

CRDs not installed

Symptom: helm install succeeds but Prometheus/Alertmanager pods never get created. kubectl get prometheus returns error: the server doesn't have a resource type "prometheus".

Diagnosis:

kubectl get crd | grep monitoring.coreos.com

Should list 10 CRDs. If empty, the crds subchart was disabled or skipped.

Fix:

crds:
  enabled: true

Then helm upgrade.

Prometheus pod stuck in Pending

Diagnosis:

kubectl -n monitoring describe pod prometheus-prometheus-stack-prometheus-0

Look at the Events section.

Common causes and fixes:

  • No suitable node: anti-affinity is too strict. Set prometheus.prometheusSpec.podAntiAffinity: "soft" or remove specific nodeSelector / tolerations.
  • Insufficient cpu/memory: lower prometheus.prometheusSpec.resources.requests or scale the cluster.
  • PVC binding stuck: the StorageClass has volumeBindingMode: Immediate but no PV is available. Switch to WaitForFirstConsumer or pre-provision PVs.

Prometheus pod CrashLoopBackOff

Diagnosis:

kubectl -n monitoring logs prometheus-prometheus-stack-prometheus-0 -c prometheus --tail=100

Common causes and fixes:

Log line Fix
out of memory killed Bump prometheus.prometheusSpec.resources.limits.memory. Reduce series count via metricRelabelings.
mmap: cannot allocate memory Same as above.
error parsing config file A PrometheusRule you added is invalid. Find it: kubectl get prometheusrule -A and run promtool check rules against each.
permission denied on data dir StorageClass mounts as a UID Prometheus does not run as. Set prometheus.prometheusSpec.securityContext.fsGroup: 65534.
WAL replay very slow Normal after a hard restart with high churn. Wait 1–5 min.

Targets are DOWN

Open Prometheus UI » Status » Targets » click the failing target to see the error.

Error Likely cause Fix
connection refused Wrong port name in ServiceMonitor or app not listening on metrics port Check kubectl get svc <name> -o yaml for the port name; check netstat inside the pod.
404 Not Found Wrong path Default is /metrics; some apps use /actuator/prometheus, /-/metrics, etc.
tls: bad certificate / x509 TLS scrape against a self-signed cert Add tlsConfig.insecureSkipVerify: true or mount the CA.
context deadline exceeded Endpoint slow to respond Increase scrapeTimeout; check the application.
403 Forbidden App requires auth Set bearerTokenFile or basicAuth.

A ServiceMonitor exists but is not picked up

Symptom: No new targets in Prometheus, ServiceMonitor is fine.

Diagnosis:

# What labels does the Prometheus CR want?
kubectl -n monitoring get prometheus -o yaml | grep -A5 serviceMonitorSelector

# What labels does my ServiceMonitor have?
kubectl -n my-app get servicemonitor my-app -o yaml | grep -A5 labels

The Operator only includes ServiceMonitors that match the selector. Either add the chart's release labels to your ServiceMonitor, or widen the selector in chart values:

prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelector: {}
    serviceMonitorNamespaceSelector: {}

Apply with helm upgrade.

Alerts fire but no notification arrives

Diagnosis:

# 1. Confirm Alertmanager actually received the alert
kubectl -n monitoring port-forward svc/prometheus-stack-kube-prom-alertmanager 9093
# Open http://localhost:9093 and check Alerts tab.

# 2. Check the route tree
# Status tab > Config. Confirm the alert's labels match a route.

# 3. Test the receiver
amtool alert add foo severity=critical \
    --alertmanager.url=http://localhost:9093

Common causes:

  • A higher-priority silence is matching the alert.
  • Routing tree falls through to a no-op default. Add a catch-all receiver.
  • Webhook URL invalid or blocked by egress NetworkPolicy. Check Alertmanager logs:
kubectl -n monitoring logs alertmanager-prometheus-stack-alertmanager-0 -c alertmanager

Grafana cannot reach Prometheus

Symptom: Grafana datasource test fails with bad gateway or dial tcp ... no such host.

Diagnosis: Grafana datasource URL must be the in-cluster Service. The chart sets it correctly by default. If you overrode it, ensure the URL is something like:

http://prometheus-stack-kube-prom-prometheus.monitoring.svc.cluster.local:9090

If the Grafana pod cannot resolve that, check NetworkPolicies and namespace-aware DNS.

Operator validating webhook timing out

Symptom: kubectl apply -f my-rules.yaml hangs or fails with context deadline exceeded against prometheusrulemutate.monitoring.coreos.com.

Diagnosis:

kubectl -n monitoring get pods -l app=kube-prometheus-stack-operator
kubectl -n monitoring logs <operator-pod>
kubectl get validatingwebhookconfiguration | grep prometheus

Fix options:

  • Operator pod is crashlooping — fix that first.
  • Webhook cert is expired (Operator-managed certs rotate automatically; if it is stuck, restart the operator pod).
  • As a last-resort emergency override, temporarily disable the webhook:
prometheusOperator:
  admissionWebhooks:
    enabled: false

Re-enable as soon as possible — without it, syntactically invalid rules silently cause Prometheus to fail to load config.

Out-of-memory killer kills Prometheus repeatedly

This is almost always cardinality. See Day 2 » Cardinality watch. Find the offender and drop labels with metricRelabelings.

If the cardinality is legitimate, scale up:

prometheus:
  prometheusSpec:
    resources:
      limits: { cpu: "4", memory: 16Gi }
    replicas: 2          # 2 replicas in HA, NOT shards

Only use shards: N if you have actual evidence of >1M active series and have planned for the operational complexity (sharded scrape config, queries spanning shards via Thanos).

Useful kubectl snippets

# All chart-owned resources
kubectl -n monitoring get all,servicemonitor,podmonitor,prometheusrule,alertmanagerconfig \
    -l app.kubernetes.io/instance=prometheus-stack

# Operator events
kubectl -n monitoring get events --sort-by=.lastTimestamp \
    | grep -i operator

# Prometheus config currently loaded
kubectl -n monitoring exec sts/prometheus-prometheus-stack-prometheus -c prometheus \
    -- wget -qO- http://localhost:9090/api/v1/status/config

# Reload Prometheus config (without pod restart)
kubectl -n monitoring exec sts/prometheus-prometheus-stack-prometheus -c prometheus \
    -- wget -qO- --post-data="" http://localhost:9090/-/reload