Prometheus and Grafana Monitoring Stack

Module 28: DevOps & Deployment

Introduction to Prometheus and Grafana

In our previous lectures, we explored application monitoring strategies and centralized logging. Now, we'll focus on implementing a complete monitoring solution using two powerful, open-source tools: Prometheus and Grafana.

This combination has become the de facto standard for cloud-native monitoring, especially in Kubernetes environments. If metrics and logs are the foundation of observability, the Prometheus-Grafana stack provides the tools to collect, store, analyze, and visualize this data effectively.

flowchart TD subgraph "Data Collection" A[Application Metrics] --> P[Prometheus] B[Node Metrics] --> P C[Kubernetes Metrics] --> P D[External System Metrics] --> P E[Alert Rules] --> P end subgraph "Visualization & Alerting" P --> G[Grafana] P --> AM[Alertmanager] L[Loki] --> G Other[Other Data Sources] --> G AM --> Emails[Email Notifications] AM --> Slack[Slack Notifications] AM --> PD[PagerDuty Alerts] end subgraph "Consumers" G --> U[Users/Dashboards] G --> Auto[Automated Systems] end

Think of Prometheus and Grafana as complementary parts of a complete monitoring solution:

Understanding Prometheus

What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud, now a graduated project of the Cloud Native Computing Foundation (CNCF).

Key features of Prometheus include:

Prometheus Architecture

Let's examine the core components of the Prometheus architecture:

flowchart TD subgraph "Prometheus Server" S[Scraper] --> TSDB[Time Series Database] TSDB --> Q[PromQL Query Engine] SD[Service Discovery] --> S AR[Alert Rules] --> TSDB end subgraph "Targets" T1[Application Exporters] --> S T2[Node Exporter] --> S T3[Kubernetes Components] --> S T4[Other Exporters] --> S end subgraph "Alert Management" AR --> AM[Alertmanager] AM --> NM[Notification Methods] end subgraph "Visualization" Q --> WebUI[Prometheus Web UI] Q --> G[Grafana] end

How Prometheus Works

  1. Service Discovery: Prometheus discovers targets to monitor (static configuration, Kubernetes API, Consul, etc.)
  2. Data Collection (Scraping): Prometheus pulls metrics from HTTP endpoints exposed by applications and exporters
  3. Storage: Metrics are stored in a time-series database optimized for time-series data
  4. Querying: PromQL allows powerful querying of collected metrics
  5. Alerting: Prometheus evaluates alert rules and sends firing alerts to Alertmanager
  6. Alertmanager: Handles alert deduplication, grouping, and routing to notification channels

A real-world analogy for Prometheus is a health monitoring system in a hospital:

Prometheus Data Model and Metrics Types

The Time-Series Data Model

Prometheus stores all data as time series: streams of timestamped values belonging to the same metric and set of labeled dimensions.

Every time series is uniquely identified by:

This results in a time series identifier like:

http_requests_total{method="GET", status="200", path="/api/users"}

Metric Types

Prometheus supports four core metric types:

classDiagram class Counter { +Monotonically increasing value +Only goes up or resets to zero +Example: http_requests_total } class Gauge { +Value can go up and down +Snapshot of current state +Example: memory_usage_bytes } class Histogram { +Samples observations in buckets +Calculates sums and counts +Example: http_request_duration_seconds } class Summary { +Similar to histogram +Tracks quantiles over sliding time window +Example: http_request_duration_quantiles }
  1. Counter: A cumulative metric that represents a single monotonically increasing counter
    • Example: Number of requests, errors, or completed tasks
    • Only increases or resets to zero (e.g., on restart)
    • Typically used with rate() function to calculate rate of change
  2. Gauge: A metric that represents a single numerical value that can go up and down
    • Example: Memory usage, CPU utilization, active connections
    • Snapshot of current state
    • Can be used directly or with functions like min(), max(), avg()
  3. Histogram: Samples observations and counts them in configurable buckets
    • Example: Request durations, response sizes
    • Automatically generates multiple time series:
    • Count of observations in each bucket (_bucket)
    • Sum of all observed values (_sum)
    • Count of events observed (_count)
    • Used with histogram_quantile() function to calculate quantiles
  4. Summary: Similar to histogram, but calculates configurable quantiles over a sliding time window
    • Example: Request duration percentiles
    • Pre-calculates quantiles, unlike histograms which calculate them on query

Examples in Code


// Node.js example with prom-client
const prometheus = require('prom-client');

// Create a Registry to register metrics
const register = new prometheus.Registry();

// Counter example
const httpRequestsTotal = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'status', 'path'],
  registers: [register]
});

// Gauge example
const activeSessions = new prometheus.Gauge({
  name: 'active_sessions',
  help: 'Current number of active user sessions',
  registers: [register]
});

// Histogram example
const httpRequestDurationSeconds = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.1, 0.3, 0.5, 1, 3, 5, 10], // in seconds
  registers: [register]
});

// Summary example
const httpRequestDurationSummary = new prometheus.Summary({
  name: 'http_request_duration_summary',
  help: 'HTTP request duration summary',
  labelNames: ['method', 'path'],
  percentiles: [0.5, 0.9, 0.95, 0.99],
  registers: [register]
});

// Usage in Express middleware
app.use((req, res, next) => {
  // Increment counter
  httpRequestsTotal.inc({
    method: req.method,
    path: req.path,
    status: res.statusCode
  });
  
  // Update active sessions gauge (example)
  const activeSessions = getActiveSessions(); // Your function to get the count
  activeSessionsGauge.set(activeSessions);
  
  // Measure request duration with histogram
  const end = httpRequestDurationSeconds.startTimer({
    method: req.method,
    path: req.path
  });
  
  // Measure with summary too
  const endSummary = httpRequestDurationSummary.startTimer({
    method: req.method,
    path: req.path
  });
  
  res.on('finish', () => {
    // End timers when the response is sent
    end();
    endSummary();
  });
  
  next();
});

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});
          

Setting Up Prometheus

Installation with Docker

The easiest way to get started with Prometheus is using Docker:


# Docker run command
docker run -d \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  --name prometheus \
  prom/prometheus
          

Basic Prometheus Configuration

Here's a simple prometheus.yml configuration file:


global:
  scrape_interval: 15s    # How often to scrape targets
  evaluation_interval: 15s # How often to evaluate rules

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

# Rule files to load
rule_files:
  - "rules/*.yml"

# Scrape configuration
scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  # Node Exporter for host metrics
  - job_name: 'node'
    static_configs:
    - targets: ['node-exporter:9100']

  # Application metrics
  - job_name: 'web-app'
    static_configs:
    - targets: ['web-app:3000']
          

Service Discovery

For dynamic environments, static configuration isn't sufficient. Prometheus supports several service discovery mechanisms:


# Kubernetes service discovery example
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape=true annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use the port specified in prometheus.io/port annotation if present
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: (.+)
        target_label: __metrics_path__
        replacement: /metrics
      # Set service label from kubernetes service name
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: replace
        target_label: app
          

Alert Rules

Alert rules define conditions that should trigger alerts:


# rules/alerts.yml
groups:
- name: example
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

  - alert: HighCPULoad
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU load on {{ $labels.instance }}"
      description: "CPU load is above 80% for more than 10 minutes on {{ $labels.instance }}"

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Memory usage is above 90% for more than 5 minutes on {{ $labels.instance }}"

  - alert: HighRequestLatency
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High request latency"
      description: "95th percentile request latency is above 1 second for more than 5 minutes"
          

PromQL: Prometheus Query Language

PromQL is a powerful functional query language that allows you to select and aggregate time series data.

Basic Queries


# Simple metric selection
http_requests_total

# With label matcher
http_requests_total{status="200", method="GET"}

# Negative matching
http_requests_total{status!="500"}

# Regex matching
http_requests_total{path=~"/api/.*"}

# Range vector (last 5 minutes)
http_requests_total[5m]

# Offset (5 minutes ago)
http_requests_total offset 5m
          

Functions and Operators

PromQL includes numerous functions for time series analysis:


# Rate of increase (for counters)
rate(http_requests_total[5m])

# Increase over time period
increase(http_requests_total[1h])

# Aggregation functions
sum(rate(http_requests_total[5m])) by (status)
avg(node_cpu_seconds_total{mode!="idle"}) by (instance)
max(node_memory_MemFree_bytes) by (instance)
min(node_filesystem_free_bytes) by (mountpoint)

# Percentiles from histograms
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Arithmetic operators
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes * 100

# Boolean operators for thresholds
rate(http_requests_total{status=~"5.."}[5m]) > 1
          

Common Query Patterns

Here are some useful patterns for monitoring:


# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# Request rate by path
sum(rate(http_requests_total[5m])) by (path)

# 95th percentile response time
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# CPU usage per instance
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Disk space usage percentage
(1 - node_filesystem_free_bytes / node_filesystem_size_bytes) * 100

# Container CPU usage in Kubernetes
sum(rate(container_cpu_usage_seconds_total{container_name!="POD",container_name!=""}[5m])) by (pod_name)

# Container memory usage in Kubernetes
sum(container_memory_usage_bytes{container_name!="POD",container_name!=""}) by (pod_name)
          

Understanding Grafana

What is Grafana?

Grafana is an open-source platform for monitoring and observability that enables you to query, visualize, alert on, and understand your metrics no matter where they are stored.

Key features of Grafana include:

Grafana Architecture

Grafana follows a simple client-server architecture:

flowchart TD subgraph "Grafana Server" DB[Database] <--> BE[Backend Services] BE <--> API[HTTP API] BE <--> DS[Data Source Plugins] BE <--> AN[Alerting Engine] end subgraph "Data Sources" DS --> Prometheus DS --> Elasticsearch DS --> InfluxDB DS --> MySQL DS --> Other["Other Sources..."] end subgraph "Clients" UI[Web UI] <--> API MP[Mobile Apps] <--> API EX[External Systems] <--> API end subgraph "Notifications" AN --> Email AN --> Slack AN --> Webhooks AN --> PagerDuty end

Data Sources

Grafana can connect to various data sources, including:

This multi-source capability is one of Grafana's greatest strengths, allowing you to create unified dashboards that combine data from different systems.

Setting Up Grafana

Installation with Docker

Similar to Prometheus, Grafana can be easily deployed with Docker:


# Docker run command
docker run -d \
  -p 3000:3000 \
  --name grafana \
  -v grafana-storage:/var/lib/grafana \
  grafana/grafana
          

Docker Compose for Prometheus and Grafana

Here's a complete Docker Compose setup for the Prometheus-Grafana stack:


version: '3'
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    ports:
      - "9090:9090"
    restart: always

  alertmanager:
    image: prom/alertmanager
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports:
      - "9093:9093"
    restart: always

  node-exporter:
    image: prom/node-exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    restart: always

  grafana:
    image: grafana/grafana
    depends_on:
      - prometheus
    ports:
      - "3000:3000"
    volumes:
      - grafana-storage:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin  # Change this in production!
      - GF_USERS_ALLOW_SIGN_UP=false
    restart: always

volumes:
  grafana-storage:
          

Initial Configuration

After installation, access Grafana at http://localhost:3000 (default credentials: admin/admin).

  1. Add Prometheus data source:
    • Go to Configuration > Data Sources > Add data source
    • Select Prometheus
    • Set URL to http://prometheus:9090 (in Docker setup)
    • Click "Save & Test"
  2. Add Loki data source (for logs):
    • Go to Configuration > Data Sources > Add data source
    • Select Loki
    • Set URL to http://loki:3100 (if you have Loki running)
    • Click "Save & Test"

Grafana Provisioning

For automated setup, Grafana supports provisioning through YAML files:


# ./grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

# ./grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards/json
          

Building Grafana Dashboards

Dashboard Structure

Grafana dashboards are composed of:

Panel Types

Grafana offers various panel types for visualizing different kinds of data:

Dashboard Variables

Variables make dashboards dynamic and reusable:


# Example variable query for Prometheus
label_values(node_cpu_seconds_total, instance)
          

Example Dashboards

Let's look at some essential dashboards you should create:

Infrastructure Overview


# CPU Usage Query
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Usage Query
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Disk Usage Query
(1 - node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"}) * 100

# Network Traffic Query
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# Load Average Query
node_load1
node_load5
node_load15
          

Application Performance Dashboard


# Request Rate Query
sum(rate(http_requests_total[5m])) by (path)

# Error Rate Query
sum(rate(http_requests_total{status=~"5.."}[5m])) by (path) / sum(rate(http_requests_total[5m])) by (path) * 100

# Response Time Percentiles Query
histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))

# Active Sessions Query
active_sessions

# Database Query Performance
histogram_quantile(0.95, sum(rate(database_query_duration_seconds_bucket[5m])) by (le, query_type))
          

Business Metrics Dashboard


# User Sign-ups Query
increase(user_signups_total[1d])

# Orders Query
sum(rate(orders_placed_total[1h])) * 3600

# Revenue Query
sum(rate(order_value_total[1h])) by (product_category) * 3600

# Active Users Query
active_users

# Conversion Rate Query
sum(rate(checkout_completed_total[1h])) / sum(rate(checkout_started_total[1h])) * 100
          

Alerting with Prometheus and Grafana

Prometheus Alerting

Prometheus follows a two-step alerting process:

  1. Alert Rules: Define conditions in Prometheus that trigger alerts
  2. Alertmanager: Handles deduplication, grouping, and routing of alerts

# Example Alertmanager configuration (alertmanager.yml)
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.org:587'
  smtp_from: 'alertmanager@example.org'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-emails'
  routes:
  - match:
      severity: critical
    receiver: 'pager'
  - match:
      severity: warning
    receiver: 'team-emails'

receivers:
- name: 'team-emails'
  email_configs:
  - to: 'team@example.org'

- name: 'pager'
  pagerduty_configs:
  - service_key: ''
          

Grafana Alerting

Grafana also has its own alerting system that works directly with dashboards:

  1. Alert Rules: Created from dashboard panels
    • Define conditions based on query results
    • Set evaluation intervals and periods
    • Add custom annotations for context
  2. Notification Channels: Configure destinations for alerts
    • Email, Slack, PagerDuty, webhook, etc.
    • Support for custom templates
  3. Alerting UI: Central place to view and manage alerts
    • Current status of all alerts
    • History of alert state changes
    • Silence and acknowledgment functions

Alert Best Practices

Integration with Logging

Combining metrics and logs provides a more complete observability solution:

Bringing It All Together

flowchart TD subgraph "Metrics Collection" A[Applications] --> PE[Prometheus Exporters] PE --> P[Prometheus] end subgraph "Log Collection" A --> LS[Log Shippers] LS --> L[Loki/Elasticsearch] end subgraph "Tracing" A --> T[Jaeger/Zipkin] end subgraph "Visualization & Analysis" P --> G[Grafana] L --> G T --> G end P --> AM[Alertmanager] AM --> N[Notifications] G --> GN[Grafana Notifications]

Correlating Metrics and Logs

Use these techniques to correlate metrics and logs:

Grafana Dashboard with Metrics and Logs

Create dashboards that show metrics with relevant logs:


# Example of adding a data link in Grafana from a metrics panel to logs
# 1. Create a variable for the instance/service
# 2. Add a data link to the panel with:
#    URL: /explore?orgId=1&left={"datasource":"Loki","queries":[{"expr":"{service=\"$service\"}"}]}
#    Link text: View logs for $service
          

Scaling the Monitoring Stack

Prometheus Scaling Challenges

Prometheus has some limitations when scaling:

Scaling Strategies

Address scaling challenges with these approaches:

flowchart TD subgraph "Data Center 1" P1[Prometheus] --> A1[Apps] P1 --> I1[Infrastructure] end subgraph "Data Center 2" P2[Prometheus] --> A2[Apps] P2 --> I2[Infrastructure] end subgraph "Global View" FP[Federation Prometheus] end P1 --> FP P2 --> FP FP --> G[Grafana]

Advanced Solutions

For large-scale environments, consider these solutions:

Practical Exercise

Let's put these concepts into practice:

  1. Set up the Prometheus-Grafana stack:
    • Use the Docker Compose configuration provided earlier
    • Configure basic alert rules for system metrics
    • Configure Alertmanager for email notifications
  2. Instrument an application:
    • Choose a web application in your preferred language (Node.js, Python, Java)
    • Add instrumentation using the appropriate client library
    • Expose metrics on a /metrics endpoint
    • Configure Prometheus to scrape the application
  3. Create comprehensive dashboards:
    • Infrastructure overview dashboard
    • Application performance dashboard
    • Business metrics dashboard
    • Add variables for dynamic filtering
    • Set up data links between metrics and logs
  4. Implement alerting:
    • Create alert rules for critical conditions
    • Configure different notification channels based on severity
    • Test alerts to ensure they work correctly

Best Practices for Production

Further Learning Resources

Summary

In this lecture, we've explored the powerful combination of Prometheus and Grafana for monitoring:

Remember that monitoring is not a set-it-and-forget-it task. It requires continuous refinement and adaptation as your systems evolve. The goal is not just to detect issues after they occur, but to identify trends and potential problems before they impact users.

By implementing the strategies and practices we've discussed in this module, you'll be well-equipped to maintain reliable, performant systems in production.