Introduction to Centralized Logging
In our previous lecture, we explored application monitoring focusing on metrics—quantitative measurements of system behavior. Now, we turn our attention to logging, which provides qualitative insights into application behavior through event-based records.
Imagine trying to diagnose a mysterious car problem. Metrics are like your dashboard gauges—they tell you if the engine is overheating or if you're low on fuel. Logs are like the detailed service history and the mechanic's notes—they tell you what happened, when, and in what sequence. Both are essential for a complete picture.
In modern distributed systems, logs from individual components must be brought together into a centralized system for effective analysis. This is what we call centralized logging.
Why Centralized Logging Matters
Centralized logging is not just a convenience—it's a necessity for modern applications. Here's why:
Challenges of Distributed Systems
- Volume: Applications can generate gigabytes or terabytes of logs daily
- Velocity: High-traffic systems produce logs at an extremely rapid rate
- Variety: Different components may produce different log formats
- Distribution: Logs are scattered across many services and servers
Benefits of Centralized Logging
- Unified View: See logs from all system components in one place
- Cross-Service Tracing: Follow requests as they travel through multiple services
- Faster Troubleshooting: Quickly search and filter logs to find relevant information
- Historical Analysis: Maintain logs beyond the lifetime of individual containers or instances
- Pattern Recognition: Identify trends and recurring issues across the system
- Automated Analysis: Apply machine learning to detect anomalies
A real-world example: A major e-commerce site implemented centralized logging and reduced their mean time to resolution (MTTR) for critical incidents by 45%. What previously took hours to diagnose could now be identified in minutes by correlating logs across services.
The Centralized Logging Architecture
A typical centralized logging system has several key components:
Log Collection
Collection involves capturing logs from various sources and forwarding them to the central system.
-
Log Shippers/Agents: Software running on servers or containers to collect and forward logs
- Examples: Filebeat, Fluentd, Logstash, Vector, Fluent Bit
-
Direct API Integration: Applications directly sending logs to the centralized system
- Examples: Applications using the Elasticsearch client, Logstash HTTP input
-
Sidecar Pattern: In containerized environments, dedicated containers collecting logs from main containers
- Examples: Fluent Bit as a Kubernetes sidecar
Log Processing
Processing transforms raw logs into a structured, searchable format:
- Parsing: Converting unstructured text into structured data
- Filtering: Removing unnecessary or sensitive information
- Enrichment: Adding contextual information (e.g., geographic data for IP addresses)
- Normalization: Standardizing fields across different log sources
Log Storage
Storage solutions need to handle high-volume write operations while supporting fast queries:
- Elasticsearch: Distributed search engine optimized for log data
- InfluxDB: Time-series database suitable for metrics and structured logs
- Loki: Horizontally-scalable, highly-available log aggregation system
- Cloud Solutions: AWS CloudWatch Logs, Google Cloud Logging, Azure Monitor Logs
Log Analysis
Analysis tools help extract insights from collected logs:
- Search Interfaces: Kibana, Grafana, Graylog Web Interface
- Visualization: Dashboards showing log patterns, error rates, etc.
- Alerting: Notification systems based on log patterns or anomalies
Popular Centralized Logging Stacks
The ELK Stack
The ELK Stack is one of the most popular open-source logging solutions, consisting of:
- Elasticsearch: For storing and searching logs
- Logstash: For collecting, processing, and forwarding logs
- Kibana: For visualizing and analyzing logs
- Beats: Lightweight data shippers (often included as part of the stack)
The PLG Stack
A newer alternative using:
- Promtail: Log collector designed for Loki
- Loki: Horizontally-scalable log aggregation system
- Grafana: Visualization platform (same as used for metrics)
Fluentd + Elasticsearch + Kibana
Popular in Kubernetes environments:
- Fluentd: Unified logging layer with plugins for various inputs and outputs
- Elasticsearch: For storing and searching logs
- Kibana: For visualizing and analyzing logs
Cloud-based Solutions
Managed services offer logging without the operational overhead:
- AWS: CloudWatch Logs + CloudWatch Insights
- Google Cloud: Cloud Logging + Cloud Monitoring
- Azure: Azure Monitor Logs + Log Analytics
- Third-party Services: Datadog, New Relic, Splunk, Sumo Logic
The choice between these stacks depends on your specific requirements, existing infrastructure, team expertise, and budget. Many organizations use a combination of solutions.
Implementing Structured Logging
Structured logging is the practice of formatting logs as structured data (typically JSON) rather than plain text. This makes logs more machine-readable and easier to parse, search, and analyze.
Benefits of Structured Logging
- Consistent Format: Standardized structure across all services
- Improved Searchability: Easy to search by specific fields
- Better Analysis: Simplified aggregation and statistical analysis
- Reduced Processing Overhead: Less complex parsing rules needed
Structured Logging Implementation
Node.js Example with Winston
const winston = require('winston');
// Define the logger
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
defaultMeta: { service: 'user-service' },
transports: [
new winston.transports.Console(),
new winston.transports.File({ filename: 'error.log', level: 'error' }),
new winston.transports.File({ filename: 'combined.log' })
]
});
// Usage examples
logger.info('User logged in', {
userId: '12345',
username: 'johndoe',
loginTime: new Date().toISOString()
});
logger.error('Payment failed', {
userId: '12345',
amount: 99.99,
errorCode: 'PAYMENT_DECLINED',
errorMessage: 'Insufficient funds'
});
Sample output:
{
"level": "info",
"message": "User logged in",
"service": "user-service",
"timestamp": "2025-05-11T15:23:45.678Z",
"userId": "12345",
"username": "johndoe",
"loginTime": "2025-05-11T15:23:45.678Z"
}
Python Example with structlog
import structlog
import logging
import sys
# Set up structlog
structlog.configure(
processors=[
structlog.stdlib.add_log_level,
structlog.stdlib.add_logger_name,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer()
],
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
wrapper_class=structlog.stdlib.BoundLogger,
cache_logger_on_first_use=True,
)
# Create logger
logger = structlog.get_logger("payment-service")
# Usage examples
logger.info("payment_processed",
user_id="67890",
payment_id="PMT123456",
amount=99.99,
currency="USD"
)
logger.error("payment_failed",
user_id="67890",
payment_id="PMT123457",
amount=149.99,
currency="USD",
error_code="CARD_EXPIRED",
error_message="Credit card has expired"
)
Java Example with Logback and logstash-logback-encoder
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import net.logstash.logback.argument.StructuredArguments;
public class PaymentService {
private static final Logger logger = LoggerFactory.getLogger(PaymentService.class);
public void processPayment(String userId, String paymentId, double amount, String currency) {
// Process payment logic
// Log successful payment
logger.info("Payment processed successfully",
StructuredArguments.kv("event", "payment_processed"),
StructuredArguments.kv("user_id", userId),
StructuredArguments.kv("payment_id", paymentId),
StructuredArguments.kv("amount", amount),
StructuredArguments.kv("currency", currency)
);
}
public void handlePaymentFailure(String userId, String paymentId, double amount,
String currency, String errorCode, String errorMessage) {
logger.error("Payment processing failed",
StructuredArguments.kv("event", "payment_failed"),
StructuredArguments.kv("user_id", userId),
StructuredArguments.kv("payment_id", paymentId),
StructuredArguments.kv("amount", amount),
StructuredArguments.kv("currency", currency),
StructuredArguments.kv("error_code", errorCode),
StructuredArguments.kv("error_message", errorMessage)
);
}
}
Distributed Tracing Integration
Distributed tracing adds context to logs by tracking requests as they flow through microservices. It helps answer questions like:
- Which services did a request pass through?
- How long did each service take to process the request?
- Where did the failure occur in a chain of services?
Implementing Tracing with OpenTelemetry
OpenTelemetry is an open-source observability framework that provides standardized ways to generate, collect, and export telemetry data (traces, metrics, and logs).
// Node.js example with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/node');
const { SimpleSpanProcessor } = require('@opentelemetry/tracing');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
// Create a tracer provider
const provider = new NodeTracerProvider();
// Configure span processor and exporter
const exporter = new JaegerExporter({
serviceName: 'user-service',
});
provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
// Register the provider
provider.register();
// Register instrumentations
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
],
});
// Your Express app
const express = require('express');
const app = express();
// Now your application is instrumented, and traces will be sent to Jaeger
app.get('/users/:id', (req, res) => {
// Custom spans can be added for specific operations
const tracer = provider.getTracer('user-service');
const span = tracer.startSpan('fetch-user-data');
// Add attributes to the span
span.setAttribute('user.id', req.params.id);
// Simulate database operation
setTimeout(() => {
// If there's an error, you can record it
if (Math.random() > 0.8) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: 'Failed to fetch user data',
});
res.status(500).json({ error: 'Internal server error' });
} else {
res.json({ id: req.params.id, name: 'John Doe' });
}
// End the span
span.end();
}, 100);
});
app.listen(3000);
Once logs include trace IDs, you can correlate them across services to follow the path of a request, making troubleshooting much faster, especially in complex microservices architectures.
Log Management Best Practices
What to Log
Deciding what to log requires balancing detail with performance and storage concerns:
-
Essential Information:
- Request metadata (ID, IP, user agent, timestamps)
- Authentication and authorization events
- Business-critical operations
- Errors and exceptions with context
- System state changes
-
Context Information:
- User/account identifiers (but not PII)
- Correlation IDs and trace IDs
- Relevant business context (order IDs, transaction IDs)
-
Avoid Logging:
- Personally identifiable information (PII)
- Credentials or secrets
- Payment information
- Sensitive business data
- Excessive debug information in production
Log Levels and When to Use Them
Using appropriate log levels helps filter logs based on importance:
-
ERROR: System errors that require immediate attention
- Example: Database connection failure, payment processing error
-
WARN: Potentially harmful situations that don't prevent operation
- Example: Deprecated API usage, retry attempts, configuration issues
-
INFO: Normal operational messages, milestones, state changes
- Example: Service startup, user login, order placed
-
DEBUG: Detailed information for debugging
- Example: Function entry/exit, variable values, detailed flow
-
TRACE: Very detailed debugging, typically only used during development
- Example: Internal function calls, lowest-level debugging details
Log Retention and Rotation
Manage log volume with proper retention policies:
- Hot Storage: Recent logs (1-7 days) for immediate analysis
- Warm Storage: Medium-term logs (1-4 weeks) for ongoing investigations
- Cold Storage: Historical logs (months/years) for compliance and infrequent access
- Log Rotation: Automatically archive or delete old logs
- Compression: Reduce storage requirements for older logs
Security Considerations
Protect your logs from unauthorized access and tampering:
- Encryption: Encrypt logs in transit and at rest
- Access Control: Implement strict access controls to log data
- Audit Trails: Log access to logs themselves
- Data Masking: Automatically mask sensitive information
- Tamper Protection: Ensure logs cannot be modified after writing
Real-world Implementation: ELK Stack
Let's walk through setting up an ELK stack with Docker Compose:
version: '3'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.16.3
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ports:
- "9200:9200"
volumes:
- elasticsearch-data:/usr/share/elasticsearch/data
networks:
- elk
logstash:
image: docker.elastic.co/logstash/logstash:7.16.3
volumes:
- ./logstash/pipeline:/usr/share/logstash/pipeline
ports:
- "5044:5044"
- "5000:5000/tcp"
- "5000:5000/udp"
- "9600:9600"
depends_on:
- elasticsearch
networks:
- elk
kibana:
image: docker.elastic.co/kibana/kibana:7.16.3
ports:
- "5601:5601"
environment:
ELASTICSEARCH_URL: http://elasticsearch:9200
ELASTICSEARCH_HOSTS: http://elasticsearch:9200
depends_on:
- elasticsearch
networks:
- elk
filebeat:
image: docker.elastic.co/beats/filebeat:7.16.3
volumes:
- ./filebeat/config/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
depends_on:
- elasticsearch
- logstash
networks:
- elk
networks:
elk:
driver: bridge
volumes:
elasticsearch-data:
Logstash configuration example (pipeline/logstash.conf):
input {
beats {
port => 5044
}
tcp {
port => 5000
codec => json
}
}
filter {
if [fields][log_type] == "access_log" {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
geoip {
source => "clientip"
}
}
if [fields][log_type] == "app_log" {
json {
source => "message"
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "%{[fields][log_type]}-%{+YYYY.MM.dd}"
}
}
Filebeat configuration example (config/filebeat.yml):
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/nginx/access.log
fields:
log_type: access_log
- type: log
enabled: true
paths:
- /var/log/app/*.log
fields:
log_type: app_log
json.keys_under_root: true
json.add_error_key: true
json.message_key: log
filebeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
setup.template.enabled: false
setup.kibana.enabled: false
output.logstash:
hosts: ["logstash:5044"]
Advanced Log Analysis Techniques
Log Query Languages
Each logging platform has its own query language for searching and analyzing logs:
- Elasticsearch Query DSL: Powerful but complex JSON-based query language
- Kibana Query Language (KQL): Simpler, more user-friendly syntax for Elasticsearch
- LogQL: Loki's query language, inspired by PromQL
- Splunk Search Processing Language: Powerful language for searching and manipulating Splunk data
Example Kibana Query Language (KQL) queries:
# Find all logs with an error level
level:error
# Find logs for a specific user
user.id:"12345"
# Find payment errors with a specific code
level:error AND event:payment_failed AND error_code:PAYMENT_DECLINED
# Find slow requests (taking more than 500ms)
response.time > 500
# Complex query with time range
service:"user-service" AND level:error AND @timestamp > "2025-05-10T00:00:00Z"
Log Visualization
Visualization helps identify patterns and anomalies that might not be obvious in raw logs:
- Time Series Charts: Show log volume trends over time
- Pie and Bar Charts: Visualize distribution of log levels, error types, etc.
- Heat Maps: Show intensity of log events across dimensions
- Data Tables: For detailed inspection of log entries
- Dashboards: Combine multiple visualizations for a comprehensive view
Alerting on Logs
Set up alerts to be notified of important events or patterns in your logs:
-
Threshold Alerts: Trigger when log counts exceed thresholds
- Example: Alert if more than 10 payment errors occur in 5 minutes
-
Pattern Matching: Alert on specific log patterns
- Example: Alert on "database connection failed" messages
-
Anomaly Detection: Alert on unusual log patterns
- Example: Unusual spike in 404 errors compared to baseline
# Elasticsearch Watcher alert example (simplified)
{
"trigger": {
"schedule": {
"interval": "5m"
}
},
"input": {
"search": {
"request": {
"indices": ["app_log-*"],
"body": {
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "match": { "event": "payment_failed" } },
{ "range": { "@timestamp": { "gte": "now-5m" } } }
]
}
}
}
}
}
},
"condition": {
"compare": {
"ctx.payload.hits.total": {
"gt": 10
}
}
},
"actions": {
"send_email": {
"email": {
"to": "alerts@example.com",
"subject": "Payment Error Alert",
"body": "There have been {{ctx.payload.hits.total}} payment errors in the last 5 minutes."
}
}
}
}
Practical Exercise
Let's put these concepts into practice:
-
Set Up ELK Stack:
- Use the Docker Compose configuration provided above
- Configure Logstash to process logs from different sources
- Set up Filebeat to collect logs from your application
-
Implement Structured Logging:
- Choose a language (Node.js, Python, Java) and implement structured logging
- Ensure logs include consistent fields (timestamp, level, service name, message)
- Add context-specific fields based on the type of event
-
Create Kibana Dashboards:
- Set up index patterns in Kibana
- Create visualizations for common log metrics
- Build a dashboard combining multiple visualizations
- Configure alerts for critical error conditions
Further Learning Resources
- Elastic Stack Documentation - https://www.elastic.co/guide/index.html
- Grafana Loki Documentation - https://grafana.com/docs/loki/latest/
- Fluentd Documentation - https://docs.fluentd.org/
- "Logging in Action" by Phil Wilkins
- "Distributed Systems Observability" by Cindy Sridharan
Summary
Centralized logging is a critical component of modern application observability. We've covered key concepts including:
- The architecture and components of centralized logging systems
- Popular logging stacks like ELK, PLG, and cloud-based solutions
- Structured logging implementation in different languages
- Integration with distributed tracing
- Best practices for log management
- Advanced log analysis techniques
Remember that effective logging is not just about collecting data—it's about making that data actionable and insightful. By implementing the strategies we've discussed, you'll be better equipped to troubleshoot issues, understand system behavior, and improve your applications.
In our next lecture, we'll explore how to combine metrics and logs in a comprehensive monitoring solution using Prometheus and Grafana.