Monitoring Training - Observability & Infrastructure Monitoring

Overview

Monitoring and observability are critical components of modern infrastructure management. This training covers the tools, techniques, and best practices for implementing comprehensive monitoring solutions that provide visibility into system health, performance, and availability across cloud and on-premises environments.

Observability Fundamentals

Understanding the three pillars of observability—metrics, logs, and traces—is essential for building effective monitoring strategies.

Metrics - Numerical measurements of system behavior over time
Logs - Event records with context and timestamps
Traces - Request flow tracking across distributed systems
Alerting - Notification systems for anomalies and incidents

Prometheus

Prometheus is an open-source monitoring system with a powerful query language, designed for reliability and scalability in cloud-native environments.

Architecture - Time series database, pull-based collection, service discovery
PromQL - Prometheus Query Language for data analysis
Exporters - Node exporter, blackbox exporter, custom exporters
Alertmanager - Alert routing, grouping, and silencing

                        
# prometheus.yml - Configuration example
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alert_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node1:9100', 'node2:9100']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
                        
                    

Grafana

Grafana is the leading open-source platform for monitoring visualization, supporting multiple data sources and powerful dashboard capabilities.

Dashboards - Creating and organizing visualizations
Panels - Time series, gauges, tables, and custom visualizations
Data Sources - Prometheus, InfluxDB, Elasticsearch, CloudWatch
Alerting - Grafana alerting rules and notification channels

                        
# Grafana Dashboard JSON snippet
{
  "dashboard": {
    "title": "Infrastructure Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "CPU %"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 70, "color": "yellow" },
                { "value": 90, "color": "red" }
              ]
            }
          }
        }
      }
    ]
  }
}
                        
                    

AWS CloudWatch

CloudWatch provides monitoring and observability services for AWS resources and applications running on AWS.

Metrics - Built-in and custom metrics for AWS services
Logs - Centralized log management with Log Insights
Alarms - Metric alarms, composite alarms, and anomaly detection
Dashboards - Custom dashboards and automatic dashboards

Azure Monitor

Azure Monitor provides comprehensive monitoring for applications and infrastructure running on Azure and hybrid environments.

Metrics & Logs - Platform metrics and Log Analytics workspace
Application Insights - APM for application monitoring
Alerts - Metric alerts, log alerts, and action groups
Workbooks - Interactive reports and visualizations

ELK Stack

The ELK Stack (Elasticsearch, Logstash, Kibana) provides powerful log aggregation, search, and visualization capabilities.

Elasticsearch - Distributed search and analytics engine
Logstash - Log processing and transformation pipelines
Kibana - Visualization and exploration interface
Beats - Lightweight data shippers (Filebeat, Metricbeat)

                        
# Logstash configuration example
input {
  beats {
    port => 5044
  }
}

filter {
  if [type] == "nginx" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
    geoip {
      source => "clientip"
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}
                        
                    

Distributed Tracing

Distributed tracing helps track requests across microservices to identify performance bottlenecks and failures.

Jaeger - Open-source distributed tracing platform
OpenTelemetry - Vendor-neutral observability framework
AWS X-Ray - Distributed tracing for AWS applications
Correlation - Connecting traces with logs and metrics

Best Practices

Implement these best practices for effective monitoring and observability.

SLIs/SLOs - Define service level indicators and objectives
Alert Fatigue - Meaningful alerts with proper thresholds
Infrastructure as Code - Version-controlled dashboards and alerts
Runbooks - Documented response procedures for alerts

Monitoring & Observability