Introduction to Prometheus and Grafana
In our previous lectures, we explored application monitoring strategies and centralized logging. Now, we'll focus on implementing a complete monitoring solution using two powerful, open-source tools: Prometheus and Grafana.
This combination has become the de facto standard for cloud-native monitoring, especially in Kubernetes environments. If metrics and logs are the foundation of observability, the Prometheus-Grafana stack provides the tools to collect, store, analyze, and visualize this data effectively.
Think of Prometheus and Grafana as complementary parts of a complete monitoring solution:
- Prometheus is like your home's electrical meter, continuously measuring and recording data, and detecting when values exceed thresholds.
- Grafana is like your smart home dashboard, visualizing that data in meaningful ways, and pulling in data from multiple sources to give you a complete picture.
Understanding Prometheus
What is Prometheus?
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud, now a graduated project of the Cloud Native Computing Foundation (CNCF).
Key features of Prometheus include:
- Time-series database for storing metric data
- Flexible query language (PromQL) for data analysis
- Pull-based model for collecting metrics
- Service discovery for dynamic environments
- Built-in alerting based on metric thresholds
- Multi-dimensional data model with key-value pairs (labels)
Prometheus Architecture
Let's examine the core components of the Prometheus architecture:
How Prometheus Works
- Service Discovery: Prometheus discovers targets to monitor (static configuration, Kubernetes API, Consul, etc.)
- Data Collection (Scraping): Prometheus pulls metrics from HTTP endpoints exposed by applications and exporters
- Storage: Metrics are stored in a time-series database optimized for time-series data
- Querying: PromQL allows powerful querying of collected metrics
- Alerting: Prometheus evaluates alert rules and sends firing alerts to Alertmanager
- Alertmanager: Handles alert deduplication, grouping, and routing to notification channels
A real-world analogy for Prometheus is a health monitoring system in a hospital:
- Nurses (scrapers) collect vital signs (metrics) from patients (targets) at regular intervals
- Patient records (TSDB) store the history of these measurements
- Doctors (PromQL) analyze the data to diagnose issues
- Monitors (alerting) sound alarms when readings exceed safe thresholds
- The nursing station (Alertmanager) ensures the right medical personnel are notified
Prometheus Data Model and Metrics Types
The Time-Series Data Model
Prometheus stores all data as time series: streams of timestamped values belonging to the same metric and set of labeled dimensions.
Every time series is uniquely identified by:
- Metric name: Describes what is being measured (e.g.,
http_requests_total) - Labels: Key-value pairs providing dimensions (e.g.,
method="GET", status="200", path="/api/users")
This results in a time series identifier like:
http_requests_total{method="GET", status="200", path="/api/users"}
Metric Types
Prometheus supports four core metric types:
-
Counter: A cumulative metric that represents a single monotonically increasing counter
- Example: Number of requests, errors, or completed tasks
- Only increases or resets to zero (e.g., on restart)
- Typically used with rate() function to calculate rate of change
-
Gauge: A metric that represents a single numerical value that can go up and down
- Example: Memory usage, CPU utilization, active connections
- Snapshot of current state
- Can be used directly or with functions like min(), max(), avg()
-
Histogram: Samples observations and counts them in configurable buckets
- Example: Request durations, response sizes
- Automatically generates multiple time series:
- Count of observations in each bucket (_bucket)
- Sum of all observed values (_sum)
- Count of events observed (_count)
- Used with histogram_quantile() function to calculate quantiles
-
Summary: Similar to histogram, but calculates configurable quantiles over a sliding time window
- Example: Request duration percentiles
- Pre-calculates quantiles, unlike histograms which calculate them on query
Examples in Code
// Node.js example with prom-client
const prometheus = require('prom-client');
// Create a Registry to register metrics
const register = new prometheus.Registry();
// Counter example
const httpRequestsTotal = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'status', 'path'],
registers: [register]
});
// Gauge example
const activeSessions = new prometheus.Gauge({
name: 'active_sessions',
help: 'Current number of active user sessions',
registers: [register]
});
// Histogram example
const httpRequestDurationSeconds = new prometheus.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path'],
buckets: [0.1, 0.3, 0.5, 1, 3, 5, 10], // in seconds
registers: [register]
});
// Summary example
const httpRequestDurationSummary = new prometheus.Summary({
name: 'http_request_duration_summary',
help: 'HTTP request duration summary',
labelNames: ['method', 'path'],
percentiles: [0.5, 0.9, 0.95, 0.99],
registers: [register]
});
// Usage in Express middleware
app.use((req, res, next) => {
// Increment counter
httpRequestsTotal.inc({
method: req.method,
path: req.path,
status: res.statusCode
});
// Update active sessions gauge (example)
const activeSessions = getActiveSessions(); // Your function to get the count
activeSessionsGauge.set(activeSessions);
// Measure request duration with histogram
const end = httpRequestDurationSeconds.startTimer({
method: req.method,
path: req.path
});
// Measure with summary too
const endSummary = httpRequestDurationSummary.startTimer({
method: req.method,
path: req.path
});
res.on('finish', () => {
// End timers when the response is sent
end();
endSummary();
});
next();
});
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
Setting Up Prometheus
Installation with Docker
The easiest way to get started with Prometheus is using Docker:
# Docker run command
docker run -d \
-p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
--name prometheus \
prom/prometheus
Basic Prometheus Configuration
Here's a simple prometheus.yml configuration file:
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate rules
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Rule files to load
rule_files:
- "rules/*.yml"
# Scrape configuration
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter for host metrics
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# Application metrics
- job_name: 'web-app'
static_configs:
- targets: ['web-app:3000']
Service Discovery
For dynamic environments, static configuration isn't sufficient. Prometheus supports several service discovery mechanisms:
# Kubernetes service discovery example
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with prometheus.io/scrape=true annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use the port specified in prometheus.io/port annotation if present
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: (.+)
target_label: __metrics_path__
replacement: /metrics
# Set service label from kubernetes service name
- source_labels: [__meta_kubernetes_pod_label_app]
action: replace
target_label: app
Alert Rules
Alert rules define conditions that should trigger alerts:
# rules/alerts.yml
groups:
- name: example
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
- alert: HighCPULoad
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU load on {{ $labels.instance }}"
description: "CPU load is above 80% for more than 10 minutes on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% for more than 5 minutes on {{ $labels.instance }}"
- alert: HighRequestLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High request latency"
description: "95th percentile request latency is above 1 second for more than 5 minutes"
PromQL: Prometheus Query Language
PromQL is a powerful functional query language that allows you to select and aggregate time series data.
Basic Queries
- Instant vector selector: Selects time series at a single point in time
- Range vector selector: Selects time series over a range of time
- Offset modifier: Selects data from some time ago
# Simple metric selection
http_requests_total
# With label matcher
http_requests_total{status="200", method="GET"}
# Negative matching
http_requests_total{status!="500"}
# Regex matching
http_requests_total{path=~"/api/.*"}
# Range vector (last 5 minutes)
http_requests_total[5m]
# Offset (5 minutes ago)
http_requests_total offset 5m
Functions and Operators
PromQL includes numerous functions for time series analysis:
# Rate of increase (for counters)
rate(http_requests_total[5m])
# Increase over time period
increase(http_requests_total[1h])
# Aggregation functions
sum(rate(http_requests_total[5m])) by (status)
avg(node_cpu_seconds_total{mode!="idle"}) by (instance)
max(node_memory_MemFree_bytes) by (instance)
min(node_filesystem_free_bytes) by (mountpoint)
# Percentiles from histograms
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Arithmetic operators
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes * 100
# Boolean operators for thresholds
rate(http_requests_total{status=~"5.."}[5m]) > 1
Common Query Patterns
Here are some useful patterns for monitoring:
# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# Request rate by path
sum(rate(http_requests_total[5m])) by (path)
# 95th percentile response time
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# CPU usage per instance
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Disk space usage percentage
(1 - node_filesystem_free_bytes / node_filesystem_size_bytes) * 100
# Container CPU usage in Kubernetes
sum(rate(container_cpu_usage_seconds_total{container_name!="POD",container_name!=""}[5m])) by (pod_name)
# Container memory usage in Kubernetes
sum(container_memory_usage_bytes{container_name!="POD",container_name!=""}) by (pod_name)
Understanding Grafana
What is Grafana?
Grafana is an open-source platform for monitoring and observability that enables you to query, visualize, alert on, and understand your metrics no matter where they are stored.
Key features of Grafana include:
- Support for multiple data sources (Prometheus, Elasticsearch, InfluxDB, etc.)
- Rich visualization options (graphs, tables, heatmaps, etc.)
- Interactive dashboards with templating and variables
- Alerting system integrated with various notification channels
- User authentication and authorization with fine-grained permissions
- Extensibility through plugins
Grafana Architecture
Grafana follows a simple client-server architecture:
Data Sources
Grafana can connect to various data sources, including:
- Time Series Databases: Prometheus, InfluxDB, Graphite, OpenTSDB
- Logs: Elasticsearch, Loki, CloudWatch Logs
- SQL Databases: MySQL, PostgreSQL, Microsoft SQL Server
- Cloud Services: AWS CloudWatch, Google Cloud Monitoring, Azure Monitor
- Distributed Tracing: Jaeger, Zipkin, Tempo
This multi-source capability is one of Grafana's greatest strengths, allowing you to create unified dashboards that combine data from different systems.
Setting Up Grafana
Installation with Docker
Similar to Prometheus, Grafana can be easily deployed with Docker:
# Docker run command
docker run -d \
-p 3000:3000 \
--name grafana \
-v grafana-storage:/var/lib/grafana \
grafana/grafana
Docker Compose for Prometheus and Grafana
Here's a complete Docker Compose setup for the Prometheus-Grafana stack:
version: '3'
services:
prometheus:
image: prom/prometheus
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules:/etc/prometheus/rules
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
ports:
- "9090:9090"
restart: always
alertmanager:
image: prom/alertmanager
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
ports:
- "9093:9093"
restart: always
node-exporter:
image: prom/node-exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
restart: always
grafana:
image: grafana/grafana
depends_on:
- prometheus
ports:
- "3000:3000"
volumes:
- grafana-storage:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin # Change this in production!
- GF_USERS_ALLOW_SIGN_UP=false
restart: always
volumes:
grafana-storage:
Initial Configuration
After installation, access Grafana at http://localhost:3000 (default credentials: admin/admin).
-
Add Prometheus data source:
- Go to Configuration > Data Sources > Add data source
- Select Prometheus
- Set URL to http://prometheus:9090 (in Docker setup)
- Click "Save & Test"
-
Add Loki data source (for logs):
- Go to Configuration > Data Sources > Add data source
- Select Loki
- Set URL to http://loki:3100 (if you have Loki running)
- Click "Save & Test"
Grafana Provisioning
For automated setup, Grafana supports provisioning through YAML files:
# ./grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
# ./grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /etc/grafana/provisioning/dashboards/json
Building Grafana Dashboards
Dashboard Structure
Grafana dashboards are composed of:
- Panels: Individual visualizations of data
- Rows: Logical groupings of panels
- Variables: Dynamic values used for templating
- Annotations: Event markers on graphs
- Time Range: Period of time displayed
Panel Types
Grafana offers various panel types for visualizing different kinds of data:
- Time series: For temporal data
- Stat: For single value display
- Gauge: For values within a range
- Bar chart: For comparing categorical data
- Table: For tabular data presentation
- Logs: For displaying log entries
- Heatmap: For showing data density
- Node Graph: For service dependencies
Dashboard Variables
Variables make dashboards dynamic and reusable:
- Query variables: Values from data source queries
- Custom variables: User-defined values
- Text box variables: Free-form input
- Interval variables: Time ranges
- Data source variables: Selection of data sources
# Example variable query for Prometheus
label_values(node_cpu_seconds_total, instance)
Example Dashboards
Let's look at some essential dashboards you should create:
Infrastructure Overview
- CPU usage by node
- Memory usage by node
- Disk usage by node and mount point
- Network traffic by node
- System load average
# CPU Usage Query
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory Usage Query
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Disk Usage Query
(1 - node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"}) * 100
# Network Traffic Query
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
# Load Average Query
node_load1
node_load5
node_load15
Application Performance Dashboard
- Request rate by endpoint
- Error rate by endpoint
- Response time percentiles
- Active sessions
- Database query performance
# Request Rate Query
sum(rate(http_requests_total[5m])) by (path)
# Error Rate Query
sum(rate(http_requests_total{status=~"5.."}[5m])) by (path) / sum(rate(http_requests_total[5m])) by (path) * 100
# Response Time Percentiles Query
histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))
# Active Sessions Query
active_sessions
# Database Query Performance
histogram_quantile(0.95, sum(rate(database_query_duration_seconds_bucket[5m])) by (le, query_type))
Business Metrics Dashboard
- User sign-ups per day
- Orders placed per hour
- Revenue by product category
- Active users
- Conversion rate
# User Sign-ups Query
increase(user_signups_total[1d])
# Orders Query
sum(rate(orders_placed_total[1h])) * 3600
# Revenue Query
sum(rate(order_value_total[1h])) by (product_category) * 3600
# Active Users Query
active_users
# Conversion Rate Query
sum(rate(checkout_completed_total[1h])) / sum(rate(checkout_started_total[1h])) * 100
Alerting with Prometheus and Grafana
Prometheus Alerting
Prometheus follows a two-step alerting process:
- Alert Rules: Define conditions in Prometheus that trigger alerts
- Alertmanager: Handles deduplication, grouping, and routing of alerts
# Example Alertmanager configuration (alertmanager.yml)
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.org:587'
smtp_from: 'alertmanager@example.org'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-emails'
routes:
- match:
severity: critical
receiver: 'pager'
- match:
severity: warning
receiver: 'team-emails'
receivers:
- name: 'team-emails'
email_configs:
- to: 'team@example.org'
- name: 'pager'
pagerduty_configs:
- service_key: ''
Grafana Alerting
Grafana also has its own alerting system that works directly with dashboards:
-
Alert Rules: Created from dashboard panels
- Define conditions based on query results
- Set evaluation intervals and periods
- Add custom annotations for context
-
Notification Channels: Configure destinations for alerts
- Email, Slack, PagerDuty, webhook, etc.
- Support for custom templates
-
Alerting UI: Central place to view and manage alerts
- Current status of all alerts
- History of alert state changes
- Silence and acknowledgment functions
Alert Best Practices
- Focus on symptoms, not causes: Alert on user-impacting issues
- Reduce noise: Only alert on actionable conditions
- Set appropriate thresholds: Use historical data to establish baselines
- Add context: Include relevant information to help troubleshoot
- Implement severity levels: Different notification channels for different severities
- Group related alerts: Avoid alert storms during systemic issues
- Test your alerts: Regularly verify that alerts work as expected
Integration with Logging
Combining metrics and logs provides a more complete observability solution:
Bringing It All Together
Correlating Metrics and Logs
Use these techniques to correlate metrics and logs:
- Common labels: Use the same labels in metrics and logs
- Trace IDs: Include trace IDs in both metrics and logs
- Timestamps: Use precise timestamps for correlation
- Service names: Consistent service naming across systems
Grafana Dashboard with Metrics and Logs
Create dashboards that show metrics with relevant logs:
- Data links: Link from metrics to related logs
- Split view: Show metrics and logs side by side
- Annotations: Mark significant log events on metric graphs
- Variables: Use the same variables to filter both metrics and logs
# Example of adding a data link in Grafana from a metrics panel to logs
# 1. Create a variable for the instance/service
# 2. Add a data link to the panel with:
# URL: /explore?orgId=1&left={"datasource":"Loki","queries":[{"expr":"{service=\"$service\"}"}]}
# Link text: View logs for $service
Scaling the Monitoring Stack
Prometheus Scaling Challenges
Prometheus has some limitations when scaling:
- Single-node architecture (not horizontally scalable)
- Local storage (limited by disk space)
- In-memory processing (limited by RAM)
Scaling Strategies
Address scaling challenges with these approaches:
- Functional sharding: Multiple Prometheus servers monitoring different parts of the infrastructure
- Hierarchical federation: Higher-level Prometheus servers scraping from lower-level ones
- Remote storage: Writing data to long-term storage systems
- Thanos/Cortex: Distributed Prometheus systems
Advanced Solutions
For large-scale environments, consider these solutions:
-
Thanos: Set of components that create a highly available metric system with unlimited storage capacity
- Global query view
- Long-term storage in object storage
- High availability
-
Cortex: Horizontally scalable, highly available, multi-tenant Prometheus-as-a-Service
- Horizontally scalable
- Multi-tenancy support
- Long-term storage
-
Grafana Mimir: Grafana's new TSDB combining Cortex and Thanos features
- Massive scale (billions of active series)
- High cardinality support
- Simplified operation
Practical Exercise
Let's put these concepts into practice:
-
Set up the Prometheus-Grafana stack:
- Use the Docker Compose configuration provided earlier
- Configure basic alert rules for system metrics
- Configure Alertmanager for email notifications
-
Instrument an application:
- Choose a web application in your preferred language (Node.js, Python, Java)
- Add instrumentation using the appropriate client library
- Expose metrics on a /metrics endpoint
- Configure Prometheus to scrape the application
-
Create comprehensive dashboards:
- Infrastructure overview dashboard
- Application performance dashboard
- Business metrics dashboard
- Add variables for dynamic filtering
- Set up data links between metrics and logs
-
Implement alerting:
- Create alert rules for critical conditions
- Configure different notification channels based on severity
- Test alerts to ensure they work correctly
Best Practices for Production
- High availability: Deploy multiple instances of critical components
- Backup strategies: Regularly back up Prometheus and Grafana data
-
Security considerations:
- Enable authentication and authorization
- Use TLS for all communications
- Apply least privilege principles for users and services
-
Resource management:
- Set appropriate retention periods based on needs
- Monitor the monitoring system itself
- Right-size memory and CPU allocations
-
Continuous improvement:
- Regularly review and refine dashboards
- Adjust alert thresholds based on experience
- Collect feedback from stakeholders
Further Learning Resources
- Prometheus Documentation - https://prometheus.io/docs/
- Grafana Documentation - https://grafana.com/docs/
- Awesome Prometheus Resources - https://github.com/roaldnefs/awesome-prometheus
- Prometheus: Up & Running by Brian Brazil
- Grafana Beginner's Guide - https://grafana.com/tutorials/
- PromCon and GrafanaCon talks on YouTube
Summary
In this lecture, we've explored the powerful combination of Prometheus and Grafana for monitoring:
- Prometheus provides a robust time-series database and query language for metrics
- Grafana offers flexible visualization and dashboard creation
- Together, they form a comprehensive monitoring solution for modern applications
- Integration with logging systems provides complete observability
- Various scaling options exist for enterprise deployments
Remember that monitoring is not a set-it-and-forget-it task. It requires continuous refinement and adaptation as your systems evolve. The goal is not just to detect issues after they occur, but to identify trends and potential problems before they impact users.
By implementing the strategies and practices we've discussed in this module, you'll be well-equipped to maintain reliable, performant systems in production.