Troubleshooting¶
Quick-reference for the most common problems. Each entry lists the symptom, diagnosis steps, and fix.
Helm install fails with "context deadline exceeded" or "too many characters"¶
Likely cause: You are deploying via ArgoCD UI/CLI. The chart's CRDs are very large and the ArgoCD UI / CLI rejects manifests above a size threshold.
Fix: Use helm install directly. See Installation.
For GitOps, render with helm template and apply the generated YAML.
CRDs not installed¶
Symptom: helm install succeeds but Prometheus/Alertmanager pods never
get created. kubectl get prometheus returns
error: the server doesn't have a resource type "prometheus".
Diagnosis:
kubectl get crd | grep monitoring.coreos.com
Should list 10 CRDs. If empty, the crds subchart was disabled or skipped.
Fix:
crds:
enabled: true
Then helm upgrade.
Prometheus pod stuck in Pending¶
Diagnosis:
kubectl -n monitoring describe pod prometheus-prometheus-stack-prometheus-0
Look at the Events section.
Common causes and fixes:
- No suitable node: anti-affinity is too strict. Set
prometheus.prometheusSpec.podAntiAffinity: "soft"or remove specificnodeSelector/tolerations. - Insufficient cpu/memory: lower
prometheus.prometheusSpec.resources.requestsor scale the cluster. - PVC binding stuck: the StorageClass has
volumeBindingMode: Immediatebut no PV is available. Switch toWaitForFirstConsumeror pre-provision PVs.
Prometheus pod CrashLoopBackOff¶
Diagnosis:
kubectl -n monitoring logs prometheus-prometheus-stack-prometheus-0 -c prometheus --tail=100
Common causes and fixes:
| Log line | Fix |
|---|---|
out of memory killed |
Bump prometheus.prometheusSpec.resources.limits.memory. Reduce series count via metricRelabelings. |
mmap: cannot allocate memory |
Same as above. |
error parsing config file |
A PrometheusRule you added is invalid. Find it: kubectl get prometheusrule -A and run promtool check rules against each. |
permission denied on data dir |
StorageClass mounts as a UID Prometheus does not run as. Set prometheus.prometheusSpec.securityContext.fsGroup: 65534. |
WAL replay very slow |
Normal after a hard restart with high churn. Wait 1–5 min. |
Targets are DOWN¶
Open Prometheus UI » Status » Targets » click the failing target to see the error.
| Error | Likely cause | Fix |
|---|---|---|
connection refused |
Wrong port name in ServiceMonitor or app not listening on metrics port |
Check kubectl get svc <name> -o yaml for the port name; check netstat inside the pod. |
404 Not Found |
Wrong path |
Default is /metrics; some apps use /actuator/prometheus, /-/metrics, etc. |
tls: bad certificate / x509 |
TLS scrape against a self-signed cert | Add tlsConfig.insecureSkipVerify: true or mount the CA. |
context deadline exceeded |
Endpoint slow to respond | Increase scrapeTimeout; check the application. |
403 Forbidden |
App requires auth | Set bearerTokenFile or basicAuth. |
A ServiceMonitor exists but is not picked up¶
Symptom: No new targets in Prometheus, ServiceMonitor is fine.
Diagnosis:
# What labels does the Prometheus CR want?
kubectl -n monitoring get prometheus -o yaml | grep -A5 serviceMonitorSelector
# What labels does my ServiceMonitor have?
kubectl -n my-app get servicemonitor my-app -o yaml | grep -A5 labels
The Operator only includes ServiceMonitors that match the selector. Either
add the chart's release labels to your ServiceMonitor, or widen the
selector in chart values:
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelector: {}
serviceMonitorNamespaceSelector: {}
Apply with helm upgrade.
Alerts fire but no notification arrives¶
Diagnosis:
# 1. Confirm Alertmanager actually received the alert
kubectl -n monitoring port-forward svc/prometheus-stack-kube-prom-alertmanager 9093
# Open http://localhost:9093 and check Alerts tab.
# 2. Check the route tree
# Status tab > Config. Confirm the alert's labels match a route.
# 3. Test the receiver
amtool alert add foo severity=critical \
--alertmanager.url=http://localhost:9093
Common causes:
- A higher-priority
silenceis matching the alert. - Routing tree falls through to a no-op default. Add a catch-all receiver.
- Webhook URL invalid or blocked by egress NetworkPolicy. Check Alertmanager logs:
kubectl -n monitoring logs alertmanager-prometheus-stack-alertmanager-0 -c alertmanager
Grafana cannot reach Prometheus¶
Symptom: Grafana datasource test fails with bad gateway or dial tcp
... no such host.
Diagnosis: Grafana datasource URL must be the in-cluster Service. The chart sets it correctly by default. If you overrode it, ensure the URL is something like:
http://prometheus-stack-kube-prom-prometheus.monitoring.svc.cluster.local:9090
If the Grafana pod cannot resolve that, check NetworkPolicies and namespace-aware DNS.
Operator validating webhook timing out¶
Symptom: kubectl apply -f my-rules.yaml hangs or fails with
context deadline exceeded against
prometheusrulemutate.monitoring.coreos.com.
Diagnosis:
kubectl -n monitoring get pods -l app=kube-prometheus-stack-operator
kubectl -n monitoring logs <operator-pod>
kubectl get validatingwebhookconfiguration | grep prometheus
Fix options:
- Operator pod is crashlooping — fix that first.
- Webhook cert is expired (Operator-managed certs rotate automatically; if it is stuck, restart the operator pod).
- As a last-resort emergency override, temporarily disable the webhook:
prometheusOperator:
admissionWebhooks:
enabled: false
Re-enable as soon as possible — without it, syntactically invalid rules silently cause Prometheus to fail to load config.
Out-of-memory killer kills Prometheus repeatedly¶
This is almost always cardinality. See
Day 2 » Cardinality watch. Find the offender
and drop labels with metricRelabelings.
If the cardinality is legitimate, scale up:
prometheus:
prometheusSpec:
resources:
limits: { cpu: "4", memory: 16Gi }
replicas: 2 # 2 replicas in HA, NOT shards
Only use shards: N if you have actual evidence of >1M active series and
have planned for the operational complexity (sharded scrape config, queries
spanning shards via Thanos).
Useful kubectl snippets¶
# All chart-owned resources
kubectl -n monitoring get all,servicemonitor,podmonitor,prometheusrule,alertmanagerconfig \
-l app.kubernetes.io/instance=prometheus-stack
# Operator events
kubectl -n monitoring get events --sort-by=.lastTimestamp \
| grep -i operator
# Prometheus config currently loaded
kubectl -n monitoring exec sts/prometheus-prometheus-stack-prometheus -c prometheus \
-- wget -qO- http://localhost:9090/api/v1/status/config
# Reload Prometheus config (without pod restart)
kubectl -n monitoring exec sts/prometheus-prometheus-stack-prometheus -c prometheus \
-- wget -qO- --post-data="" http://localhost:9090/-/reload