Skip to content

Configuration

The chart's values.yaml is large (~5000 lines, ~169 KB) because it composes five subcharts plus its own templates. This page walks you through the keys you almost always touch, and points you at sections you might touch occasionally.

For an exhaustive reference, see the Values Reference page.

Values file map

Top-level key What it configures
nameOverride, namespaceOverride, fullnameOverride Naming for chart-rendered resources
commonLabels Labels added to every resource the chart renders
crds Whether to install CRDs (crds.enabled)
defaultRules Whether to ship the default PrometheusRules, and which groups
additionalPrometheusRulesMap Inline custom PrometheusRules
global Image registry, image pull secrets, RBAC settings shared by subcharts
windowsMonitoring Toggle Windows node monitoring
prometheus-windows-exporter Pass-through to the Windows exporter subchart
alertmanager Alertmanager CR + ingress + Service + config
grafana Pass-through to the Grafana subchart
kubernetesServiceMonitors Toggle the bundle of kube-* ServiceMonitors
kubeApiServer, kubelet, kubeControllerManager, kubeScheduler, kubeProxy, kubeEtcd, coreDns, kubeDns Per-component ServiceMonitors
kubeStateMetrics / kube-state-metrics Toggle and pass-through to subchart
nodeExporter / prometheus-node-exporter Toggle and pass-through to subchart
prometheusOperator The operator Deployment, RBAC, webhooks, TLS
prometheus The Prometheus CR + ingress + Service + scrape config selectors
thanosRuler Optional ThanosRuler CR
cleanPrometheusOperatorObjectNames Renaming knob for legacy installs
extraManifests Free-form list of extra Kubernetes manifests

Naming and namespace

nameOverride: "monitoring"
namespaceOverride: "monitoring"

The OVES copy of the chart sets nameOverride: "monitoring". This makes rendered resource names like monitoring-kube-prometheus-prometheus rather than prometheus-stack-kube-prometheus-prometheus. Be careful when changing this on an existing install — resource names will move.

Storage and retention

Prometheus retention and storage are the two settings most often wrong on first install.

prometheus:
  prometheusSpec:
    retention: 30d            # how long to keep samples
    retentionSize: ""         # set together with PVC size, e.g. "80GiB"
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 100Gi

Rules of thumb:

  • Pick retention based on how far back you need to query during incidents (often 14–30 days).
  • Set retentionSize to ~80–90% of the PVC size to avoid the WAL filling the disk.
  • Use a StorageClass with volumeBindingMode: WaitForFirstConsumer and SSD performance.
  • Do not leave storageSpec unset in production — without it, Prometheus uses an emptyDir and you lose data on every pod restart.

Replicas and HA

prometheus:
  prometheusSpec:
    replicas: 2
    shards: 1                 # only increase if you have >1M active series
    podAntiAffinity: "soft"

alertmanager:
  alertmanagerSpec:
    replicas: 3               # 3 is the recommended HA size for gossip
    podAntiAffinity: "soft"

Two Prometheis run independently (each scrapes the same targets and stores its own copy of the data). For deduplicated long-term storage, integrate with Thanos / Cortex / Mimir.

Grafana

grafana:
  enabled: true
  adminPassword: "<rotate-me>"   # or grafana.admin.existingSecret
  defaultDashboardsEnabled: true
  persistence:
    enabled: true
    storageClassName: gp3
    size: 10Gi
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - grafana.example.com
    tls:
      - secretName: grafana-tls
        hosts:
          - grafana.example.com
  additionalDataSources:
    - name: Loki
      type: loki
      url: http://loki-gateway.logging.svc:3100
      access: proxy

Ingress

Each ingress (Prometheus, Alertmanager, Grafana) has its own ingress.* block. Typical pattern:

prometheus:
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts: [ "prometheus.example.com" ]
    tls:
      - secretName: prometheus-tls
        hosts: [ "prometheus.example.com" ]
    annotations:
      nginx.ingress.kubernetes.io/auth-type: basic
      nginx.ingress.kubernetes.io/auth-secret: prometheus-basic-auth

Prometheus and Alertmanager have no auth

The Prometheus and Alertmanager UIs ship with no authentication. Put them behind an authenticating ingress (basic-auth Secret, OAuth2-proxy, Cloudflare Access, or similar) or leave them unexposed and reach them via kubectl port-forward.

Alertmanager routing

alertmanager:
  enabled: true
  config:
    global:
      resolve_timeout: 5m
      slack_api_url: "<your-slack-webhook>"
    route:
      receiver: "slack-default"
      group_by: [ "alertname", "namespace" ]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      routes:
        - matchers:
            - severity = "critical"
          receiver: "pagerduty"
          continue: true
    receivers:
      - name: "slack-default"
        slack_configs:
          - channel: "#alerts"
            send_resolved: true
            title: '{{ template "slack.default.title" . }}'
            text: '{{ template "slack.default.text" . }}'
      - name: "pagerduty"
        pagerduty_configs:
          - service_key: "<pd-routing-key>"

You can also leave the static config minimal and let teams manage their own routes via namespaced AlertmanagerConfig CRs.

Selecting custom rules and ServiceMonitors

By default the Operator only picks up ServiceMonitors and PrometheusRules that carry the chart's release labels. Two common requirements:

1. Pick up resources from any namespace, with any labels:

prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelector: {}
    serviceMonitorNamespaceSelector: {}

    podMonitorSelectorNilUsesHelmValues: false
    podMonitorSelector: {}
    podMonitorNamespaceSelector: {}

    ruleSelectorNilUsesHelmValues: false
    ruleSelector: {}
    ruleNamespaceSelector: {}

2. Only pick up resources tagged team=platform:

prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelector:
      matchLabels:
        team: platform

Resources

Always set Prometheus and Alertmanager resources for production. A starting point for a small cluster (~50 pods, ~10 nodes):

prometheus:
  prometheusSpec:
    resources:
      requests: { cpu: 500m, memory: 2Gi }
      limits:   { cpu: "2",  memory: 4Gi }

alertmanager:
  alertmanagerSpec:
    resources:
      requests: { cpu: 50m,  memory: 128Mi }
      limits:   { cpu: 200m, memory: 256Mi }

grafana:
  resources:
    requests: { cpu: 100m, memory: 256Mi }
    limits:   { cpu: 500m, memory: 512Mi }

Tune from there based on actual usage seen in process_resident_memory_bytes / container_cpu_usage_seconds_total.

Disabling components

Common reasons to disable parts of the stack:

# Managed control plane (EKS / GKE / AKS): scheduler/controller-manager/etcd
# are not user-reachable
kubeControllerManager: { enabled: false }
kubeScheduler:         { enabled: false }
kubeEtcd:              { enabled: false }

# CNI without kube-proxy (e.g. Cilium kube-proxy replacement)
kubeProxy: { enabled: false }

# Already running an external Grafana
grafana: { enabled: false }

# Already running CRDs from another release / operator
crds: { enabled: false }

When you disable a component, also disable the matching default rule group:

defaultRules:
  rules:
    etcd: false
    kubeControllerManager: false
    kubeScheduler: false
    kubeProxy: false

Otherwise you'll get spurious *Down alerts forever.

Image overrides for air-gapped clusters

global:
  imageRegistry: "my-internal-registry.example.com"
  imagePullSecrets:
    - name: registry-creds

prometheusOperator:
  image:
    registry: my-internal-registry.example.com
    repository: quay.io/prometheus-operator/prometheus-operator
    tag: v0.78.2

Each component (Prometheus, Alertmanager, Grafana, exporters) has its own image.* override block.