Configuration¶
The chart's values.yaml is large (~5000 lines, ~169 KB) because it composes
five subcharts plus its own templates. This page walks you through the keys
you almost always touch, and points you at sections you might touch
occasionally.
For an exhaustive reference, see the Values Reference page.
Values file map¶
| Top-level key | What it configures |
|---|---|
nameOverride, namespaceOverride, fullnameOverride |
Naming for chart-rendered resources |
commonLabels |
Labels added to every resource the chart renders |
crds |
Whether to install CRDs (crds.enabled) |
defaultRules |
Whether to ship the default PrometheusRules, and which groups |
additionalPrometheusRulesMap |
Inline custom PrometheusRules |
global |
Image registry, image pull secrets, RBAC settings shared by subcharts |
windowsMonitoring |
Toggle Windows node monitoring |
prometheus-windows-exporter |
Pass-through to the Windows exporter subchart |
alertmanager |
Alertmanager CR + ingress + Service + config |
grafana |
Pass-through to the Grafana subchart |
kubernetesServiceMonitors |
Toggle the bundle of kube-* ServiceMonitors |
kubeApiServer, kubelet, kubeControllerManager, kubeScheduler, kubeProxy, kubeEtcd, coreDns, kubeDns |
Per-component ServiceMonitors |
kubeStateMetrics / kube-state-metrics |
Toggle and pass-through to subchart |
nodeExporter / prometheus-node-exporter |
Toggle and pass-through to subchart |
prometheusOperator |
The operator Deployment, RBAC, webhooks, TLS |
prometheus |
The Prometheus CR + ingress + Service + scrape config selectors |
thanosRuler |
Optional ThanosRuler CR |
cleanPrometheusOperatorObjectNames |
Renaming knob for legacy installs |
extraManifests |
Free-form list of extra Kubernetes manifests |
Naming and namespace¶
nameOverride: "monitoring"
namespaceOverride: "monitoring"
The OVES copy of the chart sets nameOverride: "monitoring". This makes
rendered resource names like monitoring-kube-prometheus-prometheus rather
than prometheus-stack-kube-prometheus-prometheus. Be careful when changing
this on an existing install — resource names will move.
Storage and retention¶
Prometheus retention and storage are the two settings most often wrong on first install.
prometheus:
prometheusSpec:
retention: 30d # how long to keep samples
retentionSize: "" # set together with PVC size, e.g. "80GiB"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 100Gi
Rules of thumb:
- Pick
retentionbased on how far back you need to query during incidents (often 14–30 days). - Set
retentionSizeto ~80–90% of the PVC size to avoid the WAL filling the disk. - Use a
StorageClasswithvolumeBindingMode: WaitForFirstConsumerand SSD performance. - Do not leave
storageSpecunset in production — without it, Prometheus uses anemptyDirand you lose data on every pod restart.
Replicas and HA¶
prometheus:
prometheusSpec:
replicas: 2
shards: 1 # only increase if you have >1M active series
podAntiAffinity: "soft"
alertmanager:
alertmanagerSpec:
replicas: 3 # 3 is the recommended HA size for gossip
podAntiAffinity: "soft"
Two Prometheis run independently (each scrapes the same targets and stores its own copy of the data). For deduplicated long-term storage, integrate with Thanos / Cortex / Mimir.
Grafana¶
grafana:
enabled: true
adminPassword: "<rotate-me>" # or grafana.admin.existingSecret
defaultDashboardsEnabled: true
persistence:
enabled: true
storageClassName: gp3
size: 10Gi
ingress:
enabled: true
ingressClassName: nginx
hosts:
- grafana.example.com
tls:
- secretName: grafana-tls
hosts:
- grafana.example.com
additionalDataSources:
- name: Loki
type: loki
url: http://loki-gateway.logging.svc:3100
access: proxy
Ingress¶
Each ingress (Prometheus, Alertmanager, Grafana) has its own ingress.*
block. Typical pattern:
prometheus:
ingress:
enabled: true
ingressClassName: nginx
hosts: [ "prometheus.example.com" ]
tls:
- secretName: prometheus-tls
hosts: [ "prometheus.example.com" ]
annotations:
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: prometheus-basic-auth
Prometheus and Alertmanager have no auth
The Prometheus and Alertmanager UIs ship with no authentication. Put
them behind an authenticating ingress (basic-auth Secret, OAuth2-proxy,
Cloudflare Access, or similar) or leave them unexposed and reach them via
kubectl port-forward.
Alertmanager routing¶
alertmanager:
enabled: true
config:
global:
resolve_timeout: 5m
slack_api_url: "<your-slack-webhook>"
route:
receiver: "slack-default"
group_by: [ "alertname", "namespace" ]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers:
- severity = "critical"
receiver: "pagerduty"
continue: true
receivers:
- name: "slack-default"
slack_configs:
- channel: "#alerts"
send_resolved: true
title: '{{ template "slack.default.title" . }}'
text: '{{ template "slack.default.text" . }}'
- name: "pagerduty"
pagerduty_configs:
- service_key: "<pd-routing-key>"
You can also leave the static config minimal and let teams manage their own
routes via namespaced AlertmanagerConfig CRs.
Selecting custom rules and ServiceMonitors¶
By default the Operator only picks up ServiceMonitors and PrometheusRules
that carry the chart's release labels. Two common requirements:
1. Pick up resources from any namespace, with any labels:
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelector: {}
serviceMonitorNamespaceSelector: {}
podMonitorSelectorNilUsesHelmValues: false
podMonitorSelector: {}
podMonitorNamespaceSelector: {}
ruleSelectorNilUsesHelmValues: false
ruleSelector: {}
ruleNamespaceSelector: {}
2. Only pick up resources tagged team=platform:
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelector:
matchLabels:
team: platform
Resources¶
Always set Prometheus and Alertmanager resources for production. A starting point for a small cluster (~50 pods, ~10 nodes):
prometheus:
prometheusSpec:
resources:
requests: { cpu: 500m, memory: 2Gi }
limits: { cpu: "2", memory: 4Gi }
alertmanager:
alertmanagerSpec:
resources:
requests: { cpu: 50m, memory: 128Mi }
limits: { cpu: 200m, memory: 256Mi }
grafana:
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
Tune from there based on actual usage seen in
process_resident_memory_bytes / container_cpu_usage_seconds_total.
Disabling components¶
Common reasons to disable parts of the stack:
# Managed control plane (EKS / GKE / AKS): scheduler/controller-manager/etcd
# are not user-reachable
kubeControllerManager: { enabled: false }
kubeScheduler: { enabled: false }
kubeEtcd: { enabled: false }
# CNI without kube-proxy (e.g. Cilium kube-proxy replacement)
kubeProxy: { enabled: false }
# Already running an external Grafana
grafana: { enabled: false }
# Already running CRDs from another release / operator
crds: { enabled: false }
When you disable a component, also disable the matching default rule group:
defaultRules:
rules:
etcd: false
kubeControllerManager: false
kubeScheduler: false
kubeProxy: false
Otherwise you'll get spurious *Down alerts forever.
Image overrides for air-gapped clusters¶
global:
imageRegistry: "my-internal-registry.example.com"
imagePullSecrets:
- name: registry-creds
prometheusOperator:
image:
registry: my-internal-registry.example.com
repository: quay.io/prometheus-operator/prometheus-operator
tag: v0.78.2
Each component (Prometheus, Alertmanager, Grafana, exporters) has its own
image.* override block.