Modern infrastructure generates more telemetry than any human can parse manually. Prometheus, Grafana, and Alertmanager form the de facto open-source observability stack for infrastructure monitoring — each component handling a distinct responsibility: collection and storage, visualization, and alerting. This guide covers the full stack from initial Prometheus configuration through production-ready Alertmanager routing, with practical patterns for recording rules, dashboard design, and on-call integrations.
Architecture Overview
The data flow is straightforward:
- Prometheus scrapes metrics from instrumented targets (exporters, application /metrics endpoints) on a pull model. It stores time-series data locally and evaluates alerting and recording rules.
- Alertmanager receives alert notifications from Prometheus, applies deduplication, grouping, and silencing, then routes alerts to the correct receivers (PagerDuty, Slack, email).
- Grafana queries Prometheus via PromQL to render dashboards. It can also query Alertmanager directly to display active alert states.
Prometheus Scrape Configuration
The core of Prometheus configuration is the scrape config, which defines what to collect and how:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
environment: production
datacenter: dc1
scrape_configs:
# Scrape Prometheus itself
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
# Linux node exporters
- job_name: node
file_sd_configs:
- files: ['/etc/prometheus/targets/nodes/*.yaml']
refresh_interval: 5m
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '(.+):.*'
replacement: '$1'
# Kubernetes pods with prometheus.io/scrape annotation
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
# PostgreSQL exporter
- job_name: postgres
static_configs:
- targets: ['db01.internal:9187', 'db02.internal:9187']
params:
auth_module: [pgbouncer]
Service Discovery
Static target lists become unmanageable at scale. Use file-based service discovery for targets that Puppet or your deployment system manages:
# /etc/prometheus/targets/nodes/production.yaml
- targets:
- web01.internal:9100
- web02.internal:9100
- bastion01.internal:9100
- db01.internal:9100
labels:
env: production
team: platform
For Kubernetes, the built-in kubernetes_sd_configs discovers pods, services, and nodes dynamically. Use annotations on deployments to opt in:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3001"
prometheus.io/path: "/metrics"
Recording Rules
PromQL aggregations over large time ranges are expensive at query time. Recording rules pre-compute expensive expressions and store the result as a new metric series, making dashboard queries fast regardless of data volume:
groups:
- name: node_recording_rules
interval: 60s
rules:
# CPU utilization per node (1-minute rolling average)
- record: job:node_cpu_utilization:avg1m
expr: |
1 - avg by (instance) (
rate(node_cpu_seconds_total{mode="idle"}[1m])
)
# Memory utilization percentage
- record: job:node_memory_utilization:ratio
expr: |
1 - (
node_memory_MemAvailable_bytes /
node_memory_MemTotal_bytes
)
# HTTP request rate per service (5-minute window)
- record: job:http_requests:rate5m
expr: sum by (job, status_code) (rate(http_requests_total[5m]))
Alerting Rules
Structure alerting rules with consistent labels and annotations that give on-call engineers actionable context:
groups:
- name: infrastructure.rules
rules:
- alert: NodeHighCPU
expr: job:node_cpu_utilization:avg1m > 0.85
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU utilization is {{ $value | humanizePercentage }} on {{ $labels.instance }} for 5 minutes."
runbook: "https://wiki.internal.example-corp.com/runbooks/node-high-cpu"
- alert: NodeDiskFilling
expr: |
(
node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
/ node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}
) < 0.15
for: 10m
labels:
severity: critical
team: platform
annotations:
summary: "Disk filling on {{ $labels.instance }} ({{ $labels.mountpoint }})"
description: "Only {{ $value | humanizePercentage }} free space remaining."
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
description: "{{ $labels.instance }} has been unreachable for 1 minute."
Alertmanager Configuration
Alertmanager's power is in its routing tree, which applies deduplication, grouping, and routing logic before notifications are sent:
global:
smtp_smarthost: 'smtp.internal.example-corp.com:587'
smtp_from: '[email protected]'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
receiver: default-slack
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty-critical
continue: true # Also send to Slack
- match:
severity: critical
receiver: slack-critical
- match:
severity: warning
receiver: slack-warning
- match:
team: platform
receiver: platform-team-slack
receivers:
- name: default-slack
slack_configs:
- channel: '#alerts-general'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: pagerduty-critical
pagerduty_configs:
- routing_key: 'YOUR_PAGERDUTY_ROUTING_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
severity: critical
links:
- href: '{{ .CommonAnnotations.runbook }}'
text: Runbook
- name: slack-critical
slack_configs:
- channel: '#alerts-critical'
color: 'danger'
title: 'CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ "\n" }}Runbook: {{ .Annotations.runbook }}{{ end }}'
inhibit_rules:
- source_match:
alertname: ServiceDown
target_match:
alertname: NodeHighCPU
equal: ['instance']
# Don't alert for CPU when the whole node is down
Grafana Dashboard Principles
Effective dashboards answer the four golden signal questions at a glance: latency, traffic, errors, and saturation. Structure dashboards in rows from coarse to fine:
- Row 1 — SLO Status: Large stat panels showing current SLO compliance (green/red). Engineers should be able to assess system health in under 5 seconds.
- Row 2 — Request Rate and Error Rate: Time-series panels with alerting thresholds marked. Use
job:http_requests:rate5mrecording rules here for fast rendering. - Row 3 — Latency Percentiles: P50, P95, P99 latency from histogram metrics. Avoid averages — they mask tail latency that affects user experience.
- Row 4 — Resource Utilization: CPU, memory, disk I/O per instance. Useful for capacity planning, not primary SLO tracking.
- Row 5 — Dependency Health: Database connection pool usage, cache hit rate, queue depth. These are often the leading indicators before user-facing metrics degrade.
Retention and Storage Sizing
Prometheus's default retention is 15 days. Estimate storage requirements:
# Rough formula: bytes_per_sample * samples_per_second * retention_seconds
# ~2 bytes per sample, scrape every 15s, 10,000 time series:
2 * (10000 / 15) * (15 * 24 * 3600) = ~17 GB
For long-term retention, use Thanos or Cortex to ship blocks to object storage (S3, GCS). Grafana can query both the local Prometheus store and the long-term store transparently via a single datasource.
Conclusion
Prometheus, Grafana, and Alertmanager together cover the full observability lifecycle for infrastructure: metrics collection with flexible service discovery, pre-computed recording rules for query performance, expressive alerting with contextual runbook links, and intelligent alert routing that reaches the right person at the right time. The discipline of writing runbooks for every alert and structuring dashboards around golden signals transforms raw metrics into operational intelligence.
