Skip to content

Monitoring

The cluster uses a comprehensive observability stack built on Prometheus for metrics, Loki for logs, and Grafana for visualization. All components are deployed in the monitoring namespace and managed via Helm charts through ArgoCD.

Overview

flowchart LR
    subgraph Applications
        App1[Cilium / Hubble]
        App2[external-dns]
        App3[cloudflared]
        App4[Authelia]
        App5[Other Apps]
    end

    subgraph Metrics Pipeline
        SM[ServiceMonitors]
        Prom[Prometheus]
    end

    subgraph Logs Pipeline
        FB[Fluent Bit]
        Loki[Loki]
    end

    Grafana[Grafana]

    App1 & App2 & App3 & App4 & App5 -->|expose /metrics| SM
    SM -->|scrape targets| Prom
    Prom -->|query metrics| Grafana

    App1 & App2 & App3 & App4 & App5 -->|stdout/stderr| FB
    FB -->|forward logs| Loki
    Loki -->|query logs| Grafana

Observability Strategy

The monitoring stack follows a pull-based model for metrics and a push-based model for logs:

  • Metrics -- Applications expose Prometheus-compatible /metrics endpoints. ServiceMonitor resources tell Prometheus where to scrape. The kube-prometheus-stack provides built-in monitoring for Kubernetes internals (kubelet, API server, etcd, controller manager, scheduler).
  • Logs -- Fluent Bit runs as a DaemonSet on every node, tailing container log files from /var/log/containers/ and forwarding them to Loki. Logs are queryable via LogQL in Grafana.
  • Dashboards -- Grafana auto-provisions dashboards from ConfigMaps labeled grafana_dashboard: "true" and organizes them into folders using the grafana_folder annotation. Additional dashboards are loaded from Grafana.com and upstream project repositories.

No Alertmanager

Alertmanager is currently disabled in this cluster. Alerting can be enabled in the kube-prometheus-stack values when needed.

Components

Component Helm Chart Version Purpose
kube-prometheus-stack prometheus-community/kube-prometheus-stack 81.6.9 Prometheus, node-exporter, kube-state-metrics, recording rules
Grafana grafana/grafana 10.5.15 Dashboard visualization, SSO via Authelia
Loki grafana/loki 6.51.0 Log aggregation and storage
Fluent Bit fluent/fluent-bit 0.55.0 Log collection from all nodes
Gatus app-template (bjw-s) v5.17.0 Uptime monitoring and status page

Namespace Configuration

The monitoring namespace runs with privileged pod security standards to accommodate node-exporter and Fluent Bit, which require host-level access:

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
  labels:
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/warn: privileged

Key Endpoints

Service URL Gateway
Prometheus https://prometheus.example.com envoy-internal
Grafana https://grafana.example.com envoy-external
Gatus https://status.pitower.link envoy-internal

Key Design Decisions

  • Separate Grafana deployment -- Grafana is deployed as its own Helm release rather than the one bundled in kube-prometheus-stack, allowing independent upgrades and more flexible configuration.
  • Sidecar-based dashboard discovery -- The Grafana sidecar watches all namespaces for ConfigMaps with the grafana_dashboard label, so any application can ship its own dashboards.
  • SingleBinary Loki -- Loki runs in single-binary mode with filesystem storage on OpenEBS, keeping the deployment simple for a single-cluster setup.
  • Fluent Bit over Promtail -- Fluent Bit was chosen for log collection due to its low resource footprint and flexible filtering pipeline.
  • WAL compression -- Prometheus uses WAL compression to reduce disk usage on the 20Gi Ceph-backed PVC.