Prometheus and Grafana Monitoring Stack

Introduction to Prometheus and Grafana

In our previous lectures, we explored application monitoring strategies and centralized logging. Now, we'll focus on implementing a complete monitoring solution using two powerful, open-source tools: Prometheus and Grafana.

This combination has become the de facto standard for cloud-native monitoring, especially in Kubernetes environments. If metrics and logs are the foundation of observability, the Prometheus-Grafana stack provides the tools to collect, store, analyze, and visualize this data effectively.

flowchart TD subgraph "Data Collection" A[Application Metrics] --> P[Prometheus] B[Node Metrics] --> P C[Kubernetes Metrics] --> P D[External System Metrics] --> P E[Alert Rules] --> P end subgraph "Visualization & Alerting" P --> G[Grafana] P --> AM[Alertmanager] L[Loki] --> G Other[Other Data Sources] --> G AM --> Emails[Email Notifications] AM --> Slack[Slack Notifications] AM --> PD[PagerDuty Alerts] end subgraph "Consumers" G --> U[Users/Dashboards] G --> Auto[Automated Systems] end

Think of Prometheus and Grafana as complementary parts of a complete monitoring solution:

Prometheus is like your home's electrical meter, continuously measuring and recording data, and detecting when values exceed thresholds.
Grafana is like your smart home dashboard, visualizing that data in meaningful ways, and pulling in data from multiple sources to give you a complete picture.

Understanding Prometheus

What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud, now a graduated project of the Cloud Native Computing Foundation (CNCF).

Key features of Prometheus include:

Time-series database for storing metric data
Flexible query language (PromQL) for data analysis
Pull-based model for collecting metrics
Service discovery for dynamic environments
Built-in alerting based on metric thresholds
Multi-dimensional data model with key-value pairs (labels)

Prometheus Architecture

Let's examine the core components of the Prometheus architecture:

flowchart TD subgraph "Prometheus Server" S[Scraper] --> TSDB[Time Series Database] TSDB --> Q[PromQL Query Engine] SD[Service Discovery] --> S AR[Alert Rules] --> TSDB end subgraph "Targets" T1[Application Exporters] --> S T2[Node Exporter] --> S T3[Kubernetes Components] --> S T4[Other Exporters] --> S end subgraph "Alert Management" AR --> AM[Alertmanager] AM --> NM[Notification Methods] end subgraph "Visualization" Q --> WebUI[Prometheus Web UI] Q --> G[Grafana] end

How Prometheus Works

Service Discovery: Prometheus discovers targets to monitor (static configuration, Kubernetes API, Consul, etc.)
Data Collection (Scraping): Prometheus pulls metrics from HTTP endpoints exposed by applications and exporters
Storage: Metrics are stored in a time-series database optimized for time-series data
Querying: PromQL allows powerful querying of collected metrics
Alerting: Prometheus evaluates alert rules and sends firing alerts to Alertmanager
Alertmanager: Handles alert deduplication, grouping, and routing to notification channels

A real-world analogy for Prometheus is a health monitoring system in a hospital:

Nurses (scrapers) collect vital signs (metrics) from patients (targets) at regular intervals
Patient records (TSDB) store the history of these measurements
Doctors (PromQL) analyze the data to diagnose issues
Monitors (alerting) sound alarms when readings exceed safe thresholds
The nursing station (Alertmanager) ensures the right medical personnel are notified

Prometheus Data Model and Metrics Types

The Time-Series Data Model

Prometheus stores all data as time series: streams of timestamped values belonging to the same metric and set of labeled dimensions.

Every time series is uniquely identified by:

Metric name: Describes what is being measured (e.g., http_requests_total)
Labels: Key-value pairs providing dimensions (e.g., method="GET", status="200", path="/api/users")

This results in a time series identifier like:

http_requests_total{method="GET", status="200", path="/api/users"}

Metric Types

Prometheus supports four core metric types:

classDiagram class Counter { +Monotonically increasing value +Only goes up or resets to zero +Example: http_requests_total } class Gauge { +Value can go up and down +Snapshot of current state +Example: memory_usage_bytes } class Histogram { +Samples observations in buckets +Calculates sums and counts +Example: http_request_duration_seconds } class Summary { +Similar to histogram +Tracks quantiles over sliding time window +Example: http_request_duration_quantiles }

Counter: A cumulative metric that represents a single monotonically increasing counter
- Example: Number of requests, errors, or completed tasks
- Only increases or resets to zero (e.g., on restart)
- Typically used with rate() function to calculate rate of change
Gauge: A metric that represents a single numerical value that can go up and down
- Example: Memory usage, CPU utilization, active connections
- Snapshot of current state
- Can be used directly or with functions like min(), max(), avg()
Histogram: Samples observations and counts them in configurable buckets
- Example: Request durations, response sizes
- Automatically generates multiple time series:
- Count of observations in each bucket (_bucket)
- Sum of all observed values (_sum)
- Count of events observed (_count)
- Used with histogram_quantile() function to calculate quantiles
Summary: Similar to histogram, but calculates configurable quantiles over a sliding time window
- Example: Request duration percentiles
- Pre-calculates quantiles, unlike histograms which calculate them on query

Examples in Code


// Node.js example with prom-client
const prometheus = require('prom-client');

// Create a Registry to register metrics
const register = new prometheus.Registry();

// Counter example
const httpRequestsTotal = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'status', 'path'],
  registers: [register]
});

// Gauge example
const activeSessions = new prometheus.Gauge({
  name: 'active_sessions',
  help: 'Current number of active user sessions',
  registers: [register]
});

// Histogram example
const httpRequestDurationSeconds = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.1, 0.3, 0.5, 1, 3, 5, 10], // in seconds
  registers: [register]
});

// Summary example
const httpRequestDurationSummary = new prometheus.Summary({
  name: 'http_request_duration_summary',
  help: 'HTTP request duration summary',
  labelNames: ['method', 'path'],
  percentiles: [0.5, 0.9, 0.95, 0.99],
  registers: [register]
});

// Usage in Express middleware
app.use((req, res, next) => {
  // Increment counter
  httpRequestsTotal.inc({
    method: req.method,
    path: req.path,
    status: res.statusCode
  });
  
  // Update active sessions gauge (example)
  const activeSessions = getActiveSessions(); // Your function to get the count
  activeSessionsGauge.set(activeSessions);
  
  // Measure request duration with histogram
  const end = httpRequestDurationSeconds.startTimer({
    method: req.method,
    path: req.path
  });
  
  // Measure with summary too
  const endSummary = httpRequestDurationSummary.startTimer({
    method: req.method,
    path: req.path
  });
  
  res.on('finish', () => {
    // End timers when the response is sent
    end();
    endSummary();
  });
  
  next();
});

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Setting Up Prometheus

Installation with Docker

The easiest way to get started with Prometheus is using Docker:


# Docker run command
docker run -d \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  --name prometheus \
  prom/prometheus

Basic Prometheus Configuration

Here's a simple prometheus.yml configuration file:


global:
  scrape_interval: 15s    # How often to scrape targets
  evaluation_interval: 15s # How often to evaluate rules

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

# Rule files to load
rule_files:
  - "rules/*.yml"

# Scrape configuration
scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  # Node Exporter for host metrics
  - job_name: 'node'
    static_configs:
    - targets: ['node-exporter:9100']

  # Application metrics
  - job_name: 'web-app'
    static_configs:
    - targets: ['web-app:3000']

Service Discovery

For dynamic environments, static configuration isn't sufficient. Prometheus supports several service discovery mechanisms:


# Kubernetes service discovery example
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape=true annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use the port specified in prometheus.io/port annotation if present
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: (.+)
        target_label: __metrics_path__
        replacement: /metrics
      # Set service label from kubernetes service name
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: replace
        target_label: app

Alert Rules

Alert rules define conditions that should trigger alerts:


# rules/alerts.yml
groups:
- name: example
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

  - alert: HighCPULoad
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU load on {{ $labels.instance }}"
      description: "CPU load is above 80% for more than 10 minutes on {{ $labels.instance }}"

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Memory usage is above 90% for more than 5 minutes on {{ $labels.instance }}"

  - alert: HighRequestLatency
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High request latency"
      description: "95th percentile request latency is above 1 second for more than 5 minutes"

PromQL: Prometheus Query Language

PromQL is a powerful functional query language that allows you to select and aggregate time series data.

Basic Queries

Instant vector selector: Selects time series at a single point in time
Range vector selector: Selects time series over a range of time
Offset modifier: Selects data from some time ago


# Simple metric selection
http_requests_total

# With label matcher
http_requests_total{status="200", method="GET"}

# Negative matching
http_requests_total{status!="500"}

# Regex matching
http_requests_total{path=~"/api/.*"}

# Range vector (last 5 minutes)
http_requests_total[5m]

# Offset (5 minutes ago)
http_requests_total offset 5m

Functions and Operators

PromQL includes numerous functions for time series analysis:


# Rate of increase (for counters)
rate(http_requests_total[5m])

# Increase over time period
increase(http_requests_total[1h])

# Aggregation functions
sum(rate(http_requests_total[5m])) by (status)
avg(node_cpu_seconds_total{mode!="idle"}) by (instance)
max(node_memory_MemFree_bytes) by (instance)
min(node_filesystem_free_bytes) by (mountpoint)

# Percentiles from histograms
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Arithmetic operators
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes * 100

# Boolean operators for thresholds
rate(http_requests_total{status=~"5.."}[5m]) > 1

Common Query Patterns

Here are some useful patterns for monitoring:


# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# Request rate by path
sum(rate(http_requests_total[5m])) by (path)

# 95th percentile response time
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# CPU usage per instance
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Disk space usage percentage
(1 - node_filesystem_free_bytes / node_filesystem_size_bytes) * 100

# Container CPU usage in Kubernetes
sum(rate(container_cpu_usage_seconds_total{container_name!="POD",container_name!=""}[5m])) by (pod_name)

# Container memory usage in Kubernetes
sum(container_memory_usage_bytes{container_name!="POD",container_name!=""}) by (pod_name)

Understanding Grafana

What is Grafana?

Grafana is an open-source platform for monitoring and observability that enables you to query, visualize, alert on, and understand your metrics no matter where they are stored.

Key features of Grafana include:

Support for multiple data sources (Prometheus, Elasticsearch, InfluxDB, etc.)
Rich visualization options (graphs, tables, heatmaps, etc.)
Interactive dashboards with templating and variables
Alerting system integrated with various notification channels
User authentication and authorization with fine-grained permissions
Extensibility through plugins

Grafana Architecture

Grafana follows a simple client-server architecture:

flowchart TD subgraph "Grafana Server" DB[Database] <--> BE[Backend Services] BE <--> API[HTTP API] BE <--> DS[Data Source Plugins] BE <--> AN[Alerting Engine] end subgraph "Data Sources" DS --> Prometheus DS --> Elasticsearch DS --> InfluxDB DS --> MySQL DS --> Other["Other Sources..."] end subgraph "Clients" UI[Web UI] <--> API MP[Mobile Apps] <--> API EX[External Systems] <--> API end subgraph "Notifications" AN --> Email AN --> Slack AN --> Webhooks AN --> PagerDuty end

Data Sources

Grafana can connect to various data sources, including:

Time Series Databases: Prometheus, InfluxDB, Graphite, OpenTSDB
Logs: Elasticsearch, Loki, CloudWatch Logs
SQL Databases: MySQL, PostgreSQL, Microsoft SQL Server
Cloud Services: AWS CloudWatch, Google Cloud Monitoring, Azure Monitor
Distributed Tracing: Jaeger, Zipkin, Tempo

This multi-source capability is one of Grafana's greatest strengths, allowing you to create unified dashboards that combine data from different systems.

Setting Up Grafana

Installation with Docker

Similar to Prometheus, Grafana can be easily deployed with Docker:


# Docker run command
docker run -d \
  -p 3000:3000 \
  --name grafana \
  -v grafana-storage:/var/lib/grafana \
  grafana/grafana

Docker Compose for Prometheus and Grafana

Here's a complete Docker Compose setup for the Prometheus-Grafana stack:


version: '3'
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    ports:
      - "9090:9090"
    restart: always

  alertmanager:
    image: prom/alertmanager
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports:
      - "9093:9093"
    restart: always

  node-exporter:
    image: prom/node-exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    restart: always

  grafana:
    image: grafana/grafana
    depends_on:
      - prometheus
    ports:
      - "3000:3000"
    volumes:
      - grafana-storage:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin  # Change this in production!
      - GF_USERS_ALLOW_SIGN_UP=false
    restart: always

volumes:
  grafana-storage:

Initial Configuration

After installation, access Grafana at http://localhost:3000 (default credentials: admin/admin).

Add Prometheus data source:
- Go to Configuration > Data Sources > Add data source
- Select Prometheus
- Set URL to http://prometheus:9090 (in Docker setup)
- Click "Save & Test"
Add Loki data source (for logs):
- Go to Configuration > Data Sources > Add data source
- Select Loki
- Set URL to http://loki:3100 (if you have Loki running)
- Click "Save & Test"

Grafana Provisioning

For automated setup, Grafana supports provisioning through YAML files:


# ./grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

# ./grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards/json

Building Grafana Dashboards

Dashboard Structure

Grafana dashboards are composed of:

Panels: Individual visualizations of data
Rows: Logical groupings of panels
Variables: Dynamic values used for templating
Annotations: Event markers on graphs
Time Range: Period of time displayed

Panel Types

Grafana offers various panel types for visualizing different kinds of data:

Time series: For temporal data
Stat: For single value display
Gauge: For values within a range
Bar chart: For comparing categorical data
Table: For tabular data presentation
Logs: For displaying log entries
Heatmap: For showing data density
Node Graph: For service dependencies

Dashboard Variables

Variables make dashboards dynamic and reusable:

Query variables: Values from data source queries
Custom variables: User-defined values
Text box variables: Free-form input
Interval variables: Time ranges
Data source variables: Selection of data sources


# Example variable query for Prometheus
label_values(node_cpu_seconds_total, instance)

Example Dashboards

Let's look at some essential dashboards you should create:

Infrastructure Overview

CPU usage by node
Memory usage by node
Disk usage by node and mount point
Network traffic by node
System load average


# CPU Usage Query
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Usage Query
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Disk Usage Query
(1 - node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"}) * 100

# Network Traffic Query
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# Load Average Query
node_load1
node_load5
node_load15

Application Performance Dashboard

Request rate by endpoint
Error rate by endpoint
Response time percentiles
Active sessions
Database query performance


# Request Rate Query
sum(rate(http_requests_total[5m])) by (path)

# Error Rate Query
sum(rate(http_requests_total{status=~"5.."}[5m])) by (path) / sum(rate(http_requests_total[5m])) by (path) * 100

# Response Time Percentiles Query
histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))

# Active Sessions Query
active_sessions

# Database Query Performance
histogram_quantile(0.95, sum(rate(database_query_duration_seconds_bucket[5m])) by (le, query_type))

Business Metrics Dashboard

User sign-ups per day
Orders placed per hour
Revenue by product category
Active users
Conversion rate


# User Sign-ups Query
increase(user_signups_total[1d])

# Orders Query
sum(rate(orders_placed_total[1h])) * 3600

# Revenue Query
sum(rate(order_value_total[1h])) by (product_category) * 3600

# Active Users Query
active_users

# Conversion Rate Query
sum(rate(checkout_completed_total[1h])) / sum(rate(checkout_started_total[1h])) * 100

Alerting with Prometheus and Grafana

Prometheus Alerting

Prometheus follows a two-step alerting process:

Alert Rules: Define conditions in Prometheus that trigger alerts
Alertmanager: Handles deduplication, grouping, and routing of alerts


# Example Alertmanager configuration (alertmanager.yml)
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.org:587'
  smtp_from: 'alertmanager@example.org'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-emails'
  routes:
  - match:
      severity: critical
    receiver: 'pager'
  - match:
      severity: warning
    receiver: 'team-emails'

receivers:
- name: 'team-emails'
  email_configs:
  - to: 'team@example.org'

- name: 'pager'
  pagerduty_configs:
  - service_key: ''

Grafana Alerting

Grafana also has its own alerting system that works directly with dashboards:

Alert Rules: Created from dashboard panels
- Define conditions based on query results
- Set evaluation intervals and periods
- Add custom annotations for context
Notification Channels: Configure destinations for alerts
- Email, Slack, PagerDuty, webhook, etc.
- Support for custom templates
Alerting UI: Central place to view and manage alerts
- Current status of all alerts
- History of alert state changes
- Silence and acknowledgment functions

Alert Best Practices

Focus on symptoms, not causes: Alert on user-impacting issues
Reduce noise: Only alert on actionable conditions
Set appropriate thresholds: Use historical data to establish baselines
Add context: Include relevant information to help troubleshoot
Implement severity levels: Different notification channels for different severities
Group related alerts: Avoid alert storms during systemic issues
Test your alerts: Regularly verify that alerts work as expected

Integration with Logging

Combining metrics and logs provides a more complete observability solution:

Bringing It All Together

flowchart TD subgraph "Metrics Collection" A[Applications] --> PE[Prometheus Exporters] PE --> P[Prometheus] end subgraph "Log Collection" A --> LS[Log Shippers] LS --> L[Loki/Elasticsearch] end subgraph "Tracing" A --> T[Jaeger/Zipkin] end subgraph "Visualization & Analysis" P --> G[Grafana] L --> G T --> G end P --> AM[Alertmanager] AM --> N[Notifications] G --> GN[Grafana Notifications]

Correlating Metrics and Logs

Use these techniques to correlate metrics and logs:

Common labels: Use the same labels in metrics and logs
Trace IDs: Include trace IDs in both metrics and logs
Timestamps: Use precise timestamps for correlation
Service names: Consistent service naming across systems

Grafana Dashboard with Metrics and Logs

Create dashboards that show metrics with relevant logs:

Data links: Link from metrics to related logs
Split view: Show metrics and logs side by side
Annotations: Mark significant log events on metric graphs
Variables: Use the same variables to filter both metrics and logs


# Example of adding a data link in Grafana from a metrics panel to logs
# 1. Create a variable for the instance/service
# 2. Add a data link to the panel with:
#    URL: /explore?orgId=1&left={"datasource":"Loki","queries":[{"expr":"{service=\"$service\"}"}]}
#    Link text: View logs for $service

Scaling the Monitoring Stack

Prometheus Scaling Challenges

Prometheus has some limitations when scaling:

Single-node architecture (not horizontally scalable)
Local storage (limited by disk space)
In-memory processing (limited by RAM)

Scaling Strategies

Address scaling challenges with these approaches:

Functional sharding: Multiple Prometheus servers monitoring different parts of the infrastructure
Hierarchical federation: Higher-level Prometheus servers scraping from lower-level ones
Remote storage: Writing data to long-term storage systems
Thanos/Cortex: Distributed Prometheus systems

flowchart TD subgraph "Data Center 1" P1[Prometheus] --> A1[Apps] P1 --> I1[Infrastructure] end subgraph "Data Center 2" P2[Prometheus] --> A2[Apps] P2 --> I2[Infrastructure] end subgraph "Global View" FP[Federation Prometheus] end P1 --> FP P2 --> FP FP --> G[Grafana]

Advanced Solutions

For large-scale environments, consider these solutions:

Thanos: Set of components that create a highly available metric system with unlimited storage capacity
- Global query view
- Long-term storage in object storage
- High availability
Cortex: Horizontally scalable, highly available, multi-tenant Prometheus-as-a-Service
- Horizontally scalable
- Multi-tenancy support
- Long-term storage
Grafana Mimir: Grafana's new TSDB combining Cortex and Thanos features
- Massive scale (billions of active series)
- High cardinality support
- Simplified operation

Practical Exercise

Let's put these concepts into practice:

Set up the Prometheus-Grafana stack:
- Use the Docker Compose configuration provided earlier
- Configure basic alert rules for system metrics
- Configure Alertmanager for email notifications
Instrument an application:
- Choose a web application in your preferred language (Node.js, Python, Java)
- Add instrumentation using the appropriate client library
- Expose metrics on a /metrics endpoint
- Configure Prometheus to scrape the application
Create comprehensive dashboards:
- Infrastructure overview dashboard
- Application performance dashboard
- Business metrics dashboard
- Add variables for dynamic filtering
- Set up data links between metrics and logs
Implement alerting:
- Create alert rules for critical conditions
- Configure different notification channels based on severity
- Test alerts to ensure they work correctly

Best Practices for Production

High availability: Deploy multiple instances of critical components
Backup strategies: Regularly back up Prometheus and Grafana data
Security considerations:
- Enable authentication and authorization
- Use TLS for all communications
- Apply least privilege principles for users and services
Resource management:
- Set appropriate retention periods based on needs
- Monitor the monitoring system itself
- Right-size memory and CPU allocations
Continuous improvement:
- Regularly review and refine dashboards
- Adjust alert thresholds based on experience
- Collect feedback from stakeholders

Further Learning Resources

Prometheus Documentation - https://prometheus.io/docs/
Grafana Documentation - https://grafana.com/docs/
Awesome Prometheus Resources - https://github.com/roaldnefs/awesome-prometheus
Prometheus: Up & Running by Brian Brazil
Grafana Beginner's Guide - https://grafana.com/tutorials/
PromCon and GrafanaCon talks on YouTube

Summary

In this lecture, we've explored the powerful combination of Prometheus and Grafana for monitoring:

Prometheus provides a robust time-series database and query language for metrics
Grafana offers flexible visualization and dashboard creation
Together, they form a comprehensive monitoring solution for modern applications
Integration with logging systems provides complete observability
Various scaling options exist for enterprise deployments

Remember that monitoring is not a set-it-and-forget-it task. It requires continuous refinement and adaptation as your systems evolve. The goal is not just to detect issues after they occur, but to identify trends and potential problems before they impact users.

By implementing the strategies and practices we've discussed in this module, you'll be well-equipped to maintain reliable, performant systems in production.