Building an Observability Stack: Prometheus, Grafana, and Alertmanager for Infrastructure Monitoring

Modern infrastructure generates more telemetry than any human can parse manually. Prometheus, Grafana, and Alertmanager form the de facto open-source observability stack for infrastructure monitoring — each component handling a distinct responsibility: collection and storage, visualization, and alerting. This guide covers the full stack from initial Prometheus configuration through production-ready Alertmanager routing, with practical patterns for recording rules, dashboard design, and on-call integrations.

Architecture Overview

The data flow is straightforward:

  • Prometheus scrapes metrics from instrumented targets (exporters, application /metrics endpoints) on a pull model. It stores time-series data locally and evaluates alerting and recording rules.
  • Alertmanager receives alert notifications from Prometheus, applies deduplication, grouping, and silencing, then routes alerts to the correct receivers (PagerDuty, Slack, email).
  • Grafana queries Prometheus via PromQL to render dashboards. It can also query Alertmanager directly to display active alert states.

Prometheus Scrape Configuration

The core of Prometheus configuration is the scrape config, which defines what to collect and how:

global:
  scrape_interval:     15s
  evaluation_interval: 15s
  external_labels:
    environment: production
    datacenter:  dc1

scrape_configs:
  # Scrape Prometheus itself
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

  # Linux node exporters
  - job_name: node
    file_sd_configs:
      - files: ['/etc/prometheus/targets/nodes/*.yaml']
        refresh_interval: 5m
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '(.+):.*'
        replacement: '$1'

  # Kubernetes pods with prometheus.io/scrape annotation
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2

  # PostgreSQL exporter
  - job_name: postgres
    static_configs:
      - targets: ['db01.internal:9187', 'db02.internal:9187']
    params:
      auth_module: [pgbouncer]

Service Discovery

Static target lists become unmanageable at scale. Use file-based service discovery for targets that Puppet or your deployment system manages:

# /etc/prometheus/targets/nodes/production.yaml
- targets:
    - web01.internal:9100
    - web02.internal:9100
    - bastion01.internal:9100
    - db01.internal:9100
  labels:
    env: production
    team: platform

For Kubernetes, the built-in kubernetes_sd_configs discovers pods, services, and nodes dynamically. Use annotations on deployments to opt in:

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "3001"
  prometheus.io/path: "/metrics"

Recording Rules

PromQL aggregations over large time ranges are expensive at query time. Recording rules pre-compute expensive expressions and store the result as a new metric series, making dashboard queries fast regardless of data volume:

groups:
  - name: node_recording_rules
    interval: 60s
    rules:
      # CPU utilization per node (1-minute rolling average)
      - record: job:node_cpu_utilization:avg1m
        expr: |
          1 - avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[1m])
          )

      # Memory utilization percentage
      - record: job:node_memory_utilization:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes /
            node_memory_MemTotal_bytes
          )

      # HTTP request rate per service (5-minute window)
      - record: job:http_requests:rate5m
        expr: sum by (job, status_code) (rate(http_requests_total[5m]))

Alerting Rules

Structure alerting rules with consistent labels and annotations that give on-call engineers actionable context:

groups:
  - name: infrastructure.rules
    rules:
      - alert: NodeHighCPU
        expr: job:node_cpu_utilization:avg1m > 0.85
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU utilization is {{ $value | humanizePercentage }} on {{ $labels.instance }} for 5 minutes."
          runbook: "https://wiki.internal.example-corp.com/runbooks/node-high-cpu"

      - alert: NodeDiskFilling
        expr: |
          (
            node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
            / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}
          ) < 0.15
        for: 10m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Disk filling on {{ $labels.instance }} ({{ $labels.mountpoint }})"
          description: "Only {{ $value | humanizePercentage }} free space remaining."

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "{{ $labels.instance }} has been unreachable for 1 minute."

Alertmanager Configuration

Alertmanager's power is in its routing tree, which applies deduplication, grouping, and routing logic before notifications are sent:

global:
  smtp_smarthost: 'smtp.internal.example-corp.com:587'
  smtp_from: '[email protected]'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

route:
  receiver: default-slack
  group_by: ['alertname', 'instance']
  group_wait:      30s
  group_interval:  5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
      continue: true  # Also send to Slack
    - match:
        severity: critical
      receiver: slack-critical
    - match:
        severity: warning
      receiver: slack-warning
    - match:
        team: platform
      receiver: platform-team-slack

receivers:
  - name: default-slack
    slack_configs:
      - channel: '#alerts-general'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: 'YOUR_PAGERDUTY_ROUTING_KEY'
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        severity: critical
        links:
          - href: '{{ .CommonAnnotations.runbook }}'
            text: Runbook

  - name: slack-critical
    slack_configs:
      - channel: '#alerts-critical'
        color: 'danger'
        title: 'CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ "\n" }}Runbook: {{ .Annotations.runbook }}{{ end }}'

inhibit_rules:
  - source_match:
      alertname: ServiceDown
    target_match:
      alertname: NodeHighCPU
    equal: ['instance']
    # Don't alert for CPU when the whole node is down

Grafana Dashboard Principles

Effective dashboards answer the four golden signal questions at a glance: latency, traffic, errors, and saturation. Structure dashboards in rows from coarse to fine:

  • Row 1 — SLO Status: Large stat panels showing current SLO compliance (green/red). Engineers should be able to assess system health in under 5 seconds.
  • Row 2 — Request Rate and Error Rate: Time-series panels with alerting thresholds marked. Use job:http_requests:rate5m recording rules here for fast rendering.
  • Row 3 — Latency Percentiles: P50, P95, P99 latency from histogram metrics. Avoid averages — they mask tail latency that affects user experience.
  • Row 4 — Resource Utilization: CPU, memory, disk I/O per instance. Useful for capacity planning, not primary SLO tracking.
  • Row 5 — Dependency Health: Database connection pool usage, cache hit rate, queue depth. These are often the leading indicators before user-facing metrics degrade.

Retention and Storage Sizing

Prometheus's default retention is 15 days. Estimate storage requirements:

# Rough formula: bytes_per_sample * samples_per_second * retention_seconds
# ~2 bytes per sample, scrape every 15s, 10,000 time series:
2 * (10000 / 15) * (15 * 24 * 3600) = ~17 GB

For long-term retention, use Thanos or Cortex to ship blocks to object storage (S3, GCS). Grafana can query both the local Prometheus store and the long-term store transparently via a single datasource.

Conclusion

Prometheus, Grafana, and Alertmanager together cover the full observability lifecycle for infrastructure: metrics collection with flexible service discovery, pre-computed recording rules for query performance, expressive alerting with contextual runbook links, and intelligent alert routing that reaches the right person at the right time. The discipline of writing runbooks for every alert and structuring dashboards around golden signals transforms raw metrics into operational intelligence.

Scroll to Top