Centralized Logging Implementation

Introduction to Centralized Logging

In our previous lecture, we explored application monitoring focusing on metrics—quantitative measurements of system behavior. Now, we turn our attention to logging, which provides qualitative insights into application behavior through event-based records.

Imagine trying to diagnose a mysterious car problem. Metrics are like your dashboard gauges—they tell you if the engine is overheating or if you're low on fuel. Logs are like the detailed service history and the mechanic's notes—they tell you what happened, when, and in what sequence. Both are essential for a complete picture.

In modern distributed systems, logs from individual components must be brought together into a centralized system for effective analysis. This is what we call centralized logging.

Why Centralized Logging Matters

Centralized logging is not just a convenience—it's a necessity for modern applications. Here's why:

Challenges of Distributed Systems

Volume: Applications can generate gigabytes or terabytes of logs daily
Velocity: High-traffic systems produce logs at an extremely rapid rate
Variety: Different components may produce different log formats
Distribution: Logs are scattered across many services and servers

Benefits of Centralized Logging

Unified View: See logs from all system components in one place
Cross-Service Tracing: Follow requests as they travel through multiple services
Faster Troubleshooting: Quickly search and filter logs to find relevant information
Historical Analysis: Maintain logs beyond the lifetime of individual containers or instances
Pattern Recognition: Identify trends and recurring issues across the system
Automated Analysis: Apply machine learning to detect anomalies

A real-world example: A major e-commerce site implemented centralized logging and reduced their mean time to resolution (MTTR) for critical incidents by 45%. What previously took hours to diagnose could now be identified in minutes by correlating logs across services.

The Centralized Logging Architecture

A typical centralized logging system has several key components:

flowchart LR subgraph Sources A[Application Logs] B[System Logs] C[Network Logs] end subgraph Collection D[Log Shippers/Agents] E[Aggregators/Buffers] end subgraph Processing F[Parsing] G[Enrichment] H[Transformation] end subgraph Storage I[Indexes] J[Archives] end subgraph Analysis K[Search] L[Visualization] M[Alerting] end Sources --> Collection Collection --> Processing Processing --> Storage Storage --> Analysis

Log Collection

Collection involves capturing logs from various sources and forwarding them to the central system.

Log Shippers/Agents: Software running on servers or containers to collect and forward logs
- Examples: Filebeat, Fluentd, Logstash, Vector, Fluent Bit
Direct API Integration: Applications directly sending logs to the centralized system
- Examples: Applications using the Elasticsearch client, Logstash HTTP input
Sidecar Pattern: In containerized environments, dedicated containers collecting logs from main containers
- Examples: Fluent Bit as a Kubernetes sidecar

Log Processing

Processing transforms raw logs into a structured, searchable format:

Parsing: Converting unstructured text into structured data
Filtering: Removing unnecessary or sensitive information
Enrichment: Adding contextual information (e.g., geographic data for IP addresses)
Normalization: Standardizing fields across different log sources

Log Storage

Storage solutions need to handle high-volume write operations while supporting fast queries:

Elasticsearch: Distributed search engine optimized for log data
InfluxDB: Time-series database suitable for metrics and structured logs
Loki: Horizontally-scalable, highly-available log aggregation system
Cloud Solutions: AWS CloudWatch Logs, Google Cloud Logging, Azure Monitor Logs

Log Analysis

Analysis tools help extract insights from collected logs:

Search Interfaces: Kibana, Grafana, Graylog Web Interface
Visualization: Dashboards showing log patterns, error rates, etc.
Alerting: Notification systems based on log patterns or anomalies

Popular Centralized Logging Stacks

The ELK Stack

The ELK Stack is one of the most popular open-source logging solutions, consisting of:

Elasticsearch: For storing and searching logs
Logstash: For collecting, processing, and forwarding logs
Kibana: For visualizing and analyzing logs
Beats: Lightweight data shippers (often included as part of the stack)

flowchart LR A[Application Logs] --> B[Filebeat] B --> C[Logstash] C --> D[Elasticsearch] D --> E[Kibana]

The PLG Stack

A newer alternative using:

Promtail: Log collector designed for Loki
Loki: Horizontally-scalable log aggregation system
Grafana: Visualization platform (same as used for metrics)

flowchart LR A[Application Logs] --> B[Promtail] B --> C[Loki] C --> D[Grafana]

Fluentd + Elasticsearch + Kibana

Popular in Kubernetes environments:

Fluentd: Unified logging layer with plugins for various inputs and outputs
Elasticsearch: For storing and searching logs
Kibana: For visualizing and analyzing logs

flowchart LR A[Application Logs] --> B[Fluentd] B --> C[Elasticsearch] C --> D[Kibana]

Cloud-based Solutions

Managed services offer logging without the operational overhead:

AWS: CloudWatch Logs + CloudWatch Insights
Google Cloud: Cloud Logging + Cloud Monitoring
Azure: Azure Monitor Logs + Log Analytics
Third-party Services: Datadog, New Relic, Splunk, Sumo Logic

The choice between these stacks depends on your specific requirements, existing infrastructure, team expertise, and budget. Many organizations use a combination of solutions.

Implementing Structured Logging

Structured logging is the practice of formatting logs as structured data (typically JSON) rather than plain text. This makes logs more machine-readable and easier to parse, search, and analyze.

Benefits of Structured Logging

Consistent Format: Standardized structure across all services
Improved Searchability: Easy to search by specific fields
Better Analysis: Simplified aggregation and statistical analysis
Reduced Processing Overhead: Less complex parsing rules needed

Structured Logging Implementation

Node.js Example with Winston


const winston = require('winston');

// Define the logger
const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  defaultMeta: { service: 'user-service' },
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' })
  ]
});

// Usage examples
logger.info('User logged in', { 
  userId: '12345', 
  username: 'johndoe', 
  loginTime: new Date().toISOString() 
});

logger.error('Payment failed', { 
  userId: '12345', 
  amount: 99.99, 
  errorCode: 'PAYMENT_DECLINED',
  errorMessage: 'Insufficient funds'
});

Sample output:


{
  "level": "info",
  "message": "User logged in",
  "service": "user-service",
  "timestamp": "2025-05-11T15:23:45.678Z",
  "userId": "12345",
  "username": "johndoe",
  "loginTime": "2025-05-11T15:23:45.678Z"
}

Python Example with structlog


import structlog
import logging
import sys

# Set up structlog
structlog.configure(
    processors=[
        structlog.stdlib.add_log_level,
        structlog.stdlib.add_logger_name,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    wrapper_class=structlog.stdlib.BoundLogger,
    cache_logger_on_first_use=True,
)

# Create logger
logger = structlog.get_logger("payment-service")

# Usage examples
logger.info("payment_processed", 
    user_id="67890", 
    payment_id="PMT123456",
    amount=99.99, 
    currency="USD"
)

logger.error("payment_failed",
    user_id="67890",
    payment_id="PMT123457",
    amount=149.99,
    currency="USD",
    error_code="CARD_EXPIRED",
    error_message="Credit card has expired"
)

Java Example with Logback and logstash-logback-encoder


import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import net.logstash.logback.argument.StructuredArguments;

public class PaymentService {
    private static final Logger logger = LoggerFactory.getLogger(PaymentService.class);
    
    public void processPayment(String userId, String paymentId, double amount, String currency) {
        // Process payment logic
        
        // Log successful payment
        logger.info("Payment processed successfully", 
            StructuredArguments.kv("event", "payment_processed"),
            StructuredArguments.kv("user_id", userId),
            StructuredArguments.kv("payment_id", paymentId),
            StructuredArguments.kv("amount", amount),
            StructuredArguments.kv("currency", currency)
        );
    }
    
    public void handlePaymentFailure(String userId, String paymentId, double amount, 
                                    String currency, String errorCode, String errorMessage) {
        logger.error("Payment processing failed", 
            StructuredArguments.kv("event", "payment_failed"),
            StructuredArguments.kv("user_id", userId),
            StructuredArguments.kv("payment_id", paymentId),
            StructuredArguments.kv("amount", amount),
            StructuredArguments.kv("currency", currency),
            StructuredArguments.kv("error_code", errorCode),
            StructuredArguments.kv("error_message", errorMessage)
        );
    }
}

Distributed Tracing Integration

Distributed tracing adds context to logs by tracking requests as they flow through microservices. It helps answer questions like:

Which services did a request pass through?
How long did each service take to process the request?
Where did the failure occur in a chain of services?

sequenceDiagram participant User participant API Gateway participant Auth Service participant Product Service participant Cart Service participant Logging System User->>API Gateway: GET /cart API Gateway->>Auth Service: Validate token Auth Service-->>API Gateway: Token valid API Gateway->>Product Service: Get product details Product Service-->>API Gateway: Product details API Gateway->>Cart Service: Get cart items Cart Service-->>API Gateway: Cart items API Gateway-->>User: Cart response Note over API Gateway,Logging System: All services send logs with trace ID API Gateway->>Logging System: Logs with trace ID Auth Service->>Logging System: Logs with trace ID Product Service->>Logging System: Logs with trace ID Cart Service->>Logging System: Logs with trace ID

Implementing Tracing with OpenTelemetry

OpenTelemetry is an open-source observability framework that provides standardized ways to generate, collect, and export telemetry data (traces, metrics, and logs).


// Node.js example with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/node');
const { SimpleSpanProcessor } = require('@opentelemetry/tracing');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');

// Create a tracer provider
const provider = new NodeTracerProvider();

// Configure span processor and exporter
const exporter = new JaegerExporter({
  serviceName: 'user-service',
});
provider.addSpanProcessor(new SimpleSpanProcessor(exporter));

// Register the provider
provider.register();

// Register instrumentations
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

// Your Express app
const express = require('express');
const app = express();

// Now your application is instrumented, and traces will be sent to Jaeger
app.get('/users/:id', (req, res) => {
  // Custom spans can be added for specific operations
  const tracer = provider.getTracer('user-service');
  const span = tracer.startSpan('fetch-user-data');
  
  // Add attributes to the span
  span.setAttribute('user.id', req.params.id);
  
  // Simulate database operation
  setTimeout(() => {
    // If there's an error, you can record it
    if (Math.random() > 0.8) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: 'Failed to fetch user data',
      });
      res.status(500).json({ error: 'Internal server error' });
    } else {
      res.json({ id: req.params.id, name: 'John Doe' });
    }
    
    // End the span
    span.end();
  }, 100);
});

app.listen(3000);

Once logs include trace IDs, you can correlate them across services to follow the path of a request, making troubleshooting much faster, especially in complex microservices architectures.

Log Management Best Practices

What to Log

Deciding what to log requires balancing detail with performance and storage concerns:

Essential Information:
- Request metadata (ID, IP, user agent, timestamps)
- Authentication and authorization events
- Business-critical operations
- Errors and exceptions with context
- System state changes
Context Information:
- User/account identifiers (but not PII)
- Correlation IDs and trace IDs
- Relevant business context (order IDs, transaction IDs)
Avoid Logging:
- Personally identifiable information (PII)
- Credentials or secrets
- Payment information
- Sensitive business data
- Excessive debug information in production

Log Levels and When to Use Them

Using appropriate log levels helps filter logs based on importance:

ERROR: System errors that require immediate attention
- Example: Database connection failure, payment processing error
WARN: Potentially harmful situations that don't prevent operation
- Example: Deprecated API usage, retry attempts, configuration issues
INFO: Normal operational messages, milestones, state changes
- Example: Service startup, user login, order placed
DEBUG: Detailed information for debugging
- Example: Function entry/exit, variable values, detailed flow
TRACE: Very detailed debugging, typically only used during development
- Example: Internal function calls, lowest-level debugging details

Log Retention and Rotation

Manage log volume with proper retention policies:

Hot Storage: Recent logs (1-7 days) for immediate analysis
Warm Storage: Medium-term logs (1-4 weeks) for ongoing investigations
Cold Storage: Historical logs (months/years) for compliance and infrequent access
Log Rotation: Automatically archive or delete old logs
Compression: Reduce storage requirements for older logs

Security Considerations

Protect your logs from unauthorized access and tampering:

Encryption: Encrypt logs in transit and at rest
Access Control: Implement strict access controls to log data
Audit Trails: Log access to logs themselves
Data Masking: Automatically mask sensitive information
Tamper Protection: Ensure logs cannot be modified after writing

Real-world Implementation: ELK Stack

Let's walk through setting up an ELK stack with Docker Compose:


version: '3'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.16.3
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data
    networks:
      - elk

  logstash:
    image: docker.elastic.co/logstash/logstash:7.16.3
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
    ports:
      - "5044:5044"
      - "5000:5000/tcp"
      - "5000:5000/udp"
      - "9600:9600"
    depends_on:
      - elasticsearch
    networks:
      - elk

  kibana:
    image: docker.elastic.co/kibana/kibana:7.16.3
    ports:
      - "5601:5601"
    environment:
      ELASTICSEARCH_URL: http://elasticsearch:9200
      ELASTICSEARCH_HOSTS: http://elasticsearch:9200
    depends_on:
      - elasticsearch
    networks:
      - elk

  filebeat:
    image: docker.elastic.co/beats/filebeat:7.16.3
    volumes:
      - ./filebeat/config/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    depends_on:
      - elasticsearch
      - logstash
    networks:
      - elk

networks:
  elk:
    driver: bridge

volumes:
  elasticsearch-data:

Logstash configuration example (pipeline/logstash.conf):


input {
  beats {
    port => 5044
  }
  tcp {
    port => 5000
    codec => json
  }
}

filter {
  if [fields][log_type] == "access_log" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
    geoip {
      source => "clientip"
    }
  }
  
  if [fields][log_type] == "app_log" {
    json {
      source => "message"
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "%{[fields][log_type]}-%{+YYYY.MM.dd}"
  }
}

Filebeat configuration example (config/filebeat.yml):


filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/nginx/access.log
  fields:
    log_type: access_log

- type: log
  enabled: true
  paths:
    - /var/log/app/*.log
  fields:
    log_type: app_log
  json.keys_under_root: true
  json.add_error_key: true
  json.message_key: log

filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false

setup.template.enabled: false
setup.kibana.enabled: false

output.logstash:
  hosts: ["logstash:5044"]

Advanced Log Analysis Techniques

Log Query Languages

Each logging platform has its own query language for searching and analyzing logs:

Elasticsearch Query DSL: Powerful but complex JSON-based query language
Kibana Query Language (KQL): Simpler, more user-friendly syntax for Elasticsearch
LogQL: Loki's query language, inspired by PromQL
Splunk Search Processing Language: Powerful language for searching and manipulating Splunk data

Example Kibana Query Language (KQL) queries:


# Find all logs with an error level
level:error

# Find logs for a specific user
user.id:"12345"

# Find payment errors with a specific code
level:error AND event:payment_failed AND error_code:PAYMENT_DECLINED

# Find slow requests (taking more than 500ms)
response.time > 500

# Complex query with time range
service:"user-service" AND level:error AND @timestamp > "2025-05-10T00:00:00Z"

Log Visualization

Visualization helps identify patterns and anomalies that might not be obvious in raw logs:

Time Series Charts: Show log volume trends over time
Pie and Bar Charts: Visualize distribution of log levels, error types, etc.
Heat Maps: Show intensity of log events across dimensions
Data Tables: For detailed inspection of log entries
Dashboards: Combine multiple visualizations for a comprehensive view

Alerting on Logs

Set up alerts to be notified of important events or patterns in your logs:

Threshold Alerts: Trigger when log counts exceed thresholds
- Example: Alert if more than 10 payment errors occur in 5 minutes
Pattern Matching: Alert on specific log patterns
- Example: Alert on "database connection failed" messages
Anomaly Detection: Alert on unusual log patterns
- Example: Unusual spike in 404 errors compared to baseline


# Elasticsearch Watcher alert example (simplified)
{
  "trigger": {
    "schedule": {
      "interval": "5m"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": ["app_log-*"],
        "body": {
          "query": {
            "bool": {
              "must": [
                { "match": { "level": "error" } },
                { "match": { "event": "payment_failed" } },
                { "range": { "@timestamp": { "gte": "now-5m" } } }
              ]
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gt": 10
      }
    }
  },
  "actions": {
    "send_email": {
      "email": {
        "to": "alerts@example.com",
        "subject": "Payment Error Alert",
        "body": "There have been {{ctx.payload.hits.total}} payment errors in the last 5 minutes."
      }
    }
  }
}

Practical Exercise

Let's put these concepts into practice:

Set Up ELK Stack:
- Use the Docker Compose configuration provided above
- Configure Logstash to process logs from different sources
- Set up Filebeat to collect logs from your application
Implement Structured Logging:
- Choose a language (Node.js, Python, Java) and implement structured logging
- Ensure logs include consistent fields (timestamp, level, service name, message)
- Add context-specific fields based on the type of event
Create Kibana Dashboards:
- Set up index patterns in Kibana
- Create visualizations for common log metrics
- Build a dashboard combining multiple visualizations
- Configure alerts for critical error conditions

Further Learning Resources

Elastic Stack Documentation - https://www.elastic.co/guide/index.html
Grafana Loki Documentation - https://grafana.com/docs/loki/latest/
Fluentd Documentation - https://docs.fluentd.org/
"Logging in Action" by Phil Wilkins
"Distributed Systems Observability" by Cindy Sridharan

Summary

Centralized logging is a critical component of modern application observability. We've covered key concepts including:

The architecture and components of centralized logging systems
Popular logging stacks like ELK, PLG, and cloud-based solutions
Structured logging implementation in different languages
Integration with distributed tracing
Best practices for log management
Advanced log analysis techniques

Remember that effective logging is not just about collecting data—it's about making that data actionable and insightful. By implementing the strategies we've discussed, you'll be better equipped to troubleshoot issues, understand system behavior, and improve your applications.

In our next lecture, we'll explore how to combine metrics and logs in a comprehensive monitoring solution using Prometheus and Grafana.