Skip to content

Day 2 tasks

Routine operational tasks once the chart is installed and traffic is flowing through Prometheus and Alertmanager.

Checking stack health at a glance

# All chart-owned pods Running, none CrashLoopBackOff
kubectl -n monitoring get pods -l app.kubernetes.io/instance=prometheus-stack

# Operator is healthy
kubectl -n monitoring get deploy prometheus-stack-kube-prom-operator

# Prometheus and Alertmanager StatefulSets at desired replicas
kubectl -n monitoring get statefulset

# CR status conditions
kubectl -n monitoring get prometheus,alertmanager

The Operator updates .status.conditions on the Prometheus and Alertmanager CRs — that is the canonical "is it converged" signal.

Reviewing scrape health

Open Prometheus UI » Status » Targets. You want every target to be UP. Anything DOWN for more than a few minutes deserves a look.

A quick PromQL check on overall scrape health:

# Targets currently down by job
sum by (job) (up == 0)

# Targets that have been down for >10m
sum by (job) ( (up == 0) and (up offset 10m == 0) )

Reviewing alert volume

Noisy alerting is the #1 reason teams disable monitoring. Track and trim:

# Top firing alerts in the last hour
topk(10, sum by (alertname) (ALERTS{alertstate="firing"}))

# Alert change rate over 24h
sum by (alertname) (changes(ALERTS{alertstate="firing"}[24h]))

For alerts that fire more than ~5 times a day without an action being taken, either fix the underlying issue or rewrite the rule (longer for:, better threshold, narrower scope).

Cardinality watch

High cardinality is the #1 cause of Prometheus running out of memory.

# Series count by metric (run in Prometheus UI)
topk(10, count by (__name__) ({__name__=~".+"}))

# Series count by job
topk(10, count by (job) ({__name__=~".+"}))

If a job dominates, find the offender label:

topk(10, count by (handler, instance, ...) ({job="my-app"}))

Drop unwanted labels with metricRelabelings in the matching ServiceMonitor.

Storage usage

# How many bytes per series on disk
prometheus_tsdb_storage_blocks_bytes / on(instance) prometheus_tsdb_head_series

# Disk used vs available
sum(kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"prometheus-.*"})
  / sum(kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"prometheus-.*"})

If your Prometheus PVC is climbing past ~80% utilization, either:

  • Reduce retention (prometheus.prometheusSpec.retention).
  • Shed metrics with metricRelabelings drops.
  • Resize the PVC and update prometheus.prometheusSpec.storageSpec. Note that Helm cannot grow PVCs in place — you may need to use the StatefulSet PVC resize workflow or recreate the Prometheus pod after the PVC has been resized.

Backups

Prometheus is generally treated as disposable — if you lose the volume, you lose history but not real-time monitoring. For long-term metric retention, integrate with Thanos / Cortex / Mimir rather than backing up the local TSDB.

What is worth backing up:

  • values.yaml and any override files (under version control).
  • AlertmanagerConfig resources (under version control).
  • PrometheusRule resources (under version control).
  • Grafana persistent volume (saves user-created dashboards, alerts, API keys). Either snapshot the PVC or use Grafana's built-in provisioning to keep dashboards as code.

Rotating credentials

Credential How to rotate
Grafana admin password Update grafana.adminPassword in values, helm upgrade, restart Grafana pod. Or use grafana.admin.existingSecret + rotate the Secret.
Alertmanager receiver tokens (Slack, PagerDuty) Move them to a Kubernetes Secret and reference via alertmanager.alertmanagerSpec.secrets. Then mount and use alertmanager.config env-var substitution / file references.
Ingress basic-auth Recreate the auth Secret (htpasswd) referenced by the ingress annotation.

Capacity sizing checks

Run these every quarter on a real production Prometheus:

# Active series
prometheus_tsdb_head_series

# Samples ingested per second
rate(prometheus_tsdb_head_samples_appended_total[5m])

# Bytes per sample (should be ~1.5 - 2 bytes)
rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1h])
  / rate(prometheus_tsdb_compaction_chunk_samples_sum[1h])

# Memory pressure
process_resident_memory_bytes{job="prometheus"}

If process_resident_memory_bytes is regularly >70% of the pod's memory limit, either bump the limit or shed series.

Documentation hygiene

Whenever you add or remove a ServiceMonitor, PrometheusRule, or AlertmanagerConfig, update the runbook URL in the annotations.runbook_url field of the affected alerts. Operators paged at 3 AM rely on those links.