Day 2 tasks¶
Routine operational tasks once the chart is installed and traffic is flowing through Prometheus and Alertmanager.
Checking stack health at a glance¶
# All chart-owned pods Running, none CrashLoopBackOff
kubectl -n monitoring get pods -l app.kubernetes.io/instance=prometheus-stack
# Operator is healthy
kubectl -n monitoring get deploy prometheus-stack-kube-prom-operator
# Prometheus and Alertmanager StatefulSets at desired replicas
kubectl -n monitoring get statefulset
# CR status conditions
kubectl -n monitoring get prometheus,alertmanager
The Operator updates .status.conditions on the Prometheus and
Alertmanager CRs — that is the canonical "is it converged" signal.
Reviewing scrape health¶
Open Prometheus UI » Status » Targets. You want every target to be
UP. Anything DOWN for more than a few minutes deserves a look.
A quick PromQL check on overall scrape health:
# Targets currently down by job
sum by (job) (up == 0)
# Targets that have been down for >10m
sum by (job) ( (up == 0) and (up offset 10m == 0) )
Reviewing alert volume¶
Noisy alerting is the #1 reason teams disable monitoring. Track and trim:
# Top firing alerts in the last hour
topk(10, sum by (alertname) (ALERTS{alertstate="firing"}))
# Alert change rate over 24h
sum by (alertname) (changes(ALERTS{alertstate="firing"}[24h]))
For alerts that fire more than ~5 times a day without an action being taken,
either fix the underlying issue or rewrite the rule (longer for:, better
threshold, narrower scope).
Cardinality watch¶
High cardinality is the #1 cause of Prometheus running out of memory.
# Series count by metric (run in Prometheus UI)
topk(10, count by (__name__) ({__name__=~".+"}))
# Series count by job
topk(10, count by (job) ({__name__=~".+"}))
If a job dominates, find the offender label:
topk(10, count by (handler, instance, ...) ({job="my-app"}))
Drop unwanted labels with metricRelabelings in the matching ServiceMonitor.
Storage usage¶
# How many bytes per series on disk
prometheus_tsdb_storage_blocks_bytes / on(instance) prometheus_tsdb_head_series
# Disk used vs available
sum(kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"prometheus-.*"})
/ sum(kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"prometheus-.*"})
If your Prometheus PVC is climbing past ~80% utilization, either:
- Reduce retention (
prometheus.prometheusSpec.retention). - Shed metrics with
metricRelabelingsdrops. - Resize the PVC and update
prometheus.prometheusSpec.storageSpec. Note that Helm cannot grow PVCs in place — you may need to use the StatefulSet PVC resize workflow or recreate the Prometheus pod after the PVC has been resized.
Backups¶
Prometheus is generally treated as disposable — if you lose the volume, you lose history but not real-time monitoring. For long-term metric retention, integrate with Thanos / Cortex / Mimir rather than backing up the local TSDB.
What is worth backing up:
values.yamland any override files (under version control).AlertmanagerConfigresources (under version control).PrometheusRuleresources (under version control).- Grafana persistent volume (saves user-created dashboards, alerts, API keys). Either snapshot the PVC or use Grafana's built-in provisioning to keep dashboards as code.
Rotating credentials¶
| Credential | How to rotate |
|---|---|
| Grafana admin password | Update grafana.adminPassword in values, helm upgrade, restart Grafana pod. Or use grafana.admin.existingSecret + rotate the Secret. |
| Alertmanager receiver tokens (Slack, PagerDuty) | Move them to a Kubernetes Secret and reference via alertmanager.alertmanagerSpec.secrets. Then mount and use alertmanager.config env-var substitution / file references. |
| Ingress basic-auth | Recreate the auth Secret (htpasswd) referenced by the ingress annotation. |
Capacity sizing checks¶
Run these every quarter on a real production Prometheus:
# Active series
prometheus_tsdb_head_series
# Samples ingested per second
rate(prometheus_tsdb_head_samples_appended_total[5m])
# Bytes per sample (should be ~1.5 - 2 bytes)
rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1h])
/ rate(prometheus_tsdb_compaction_chunk_samples_sum[1h])
# Memory pressure
process_resident_memory_bytes{job="prometheus"}
If process_resident_memory_bytes is regularly >70% of the pod's memory
limit, either bump the limit or shed series.
Documentation hygiene¶
Whenever you add or remove a ServiceMonitor, PrometheusRule, or
AlertmanagerConfig, update the runbook URL in the annotations.runbook_url
field of the affected alerts. Operators paged at 3 AM rely on those links.