# Observability & Monitoring Skill

A comprehensive skill for implementing production-grade observability and monitoring using Prometheus, Grafana, and the cloud-native monitoring ecosystem.

## Overview

This skill provides complete guidance for building robust monitoring systems that enable teams to understand system behavior, detect issues early, and maintain high reliability. It covers the full monitoring stack from metric collection through visualization and alerting.

## What This Skill Covers

### Core Technologies

- **Prometheus**: Time-series database and monitoring system
- **Grafana**: Visualization and dashboarding platform
- **Alertmanager**: Alert routing and notification management
- **Exporters**: Metric collection agents (Node, PostgreSQL, custom)
- **PromQL**: Powerful query language for metric analysis

### Key Concepts

**The Four Pillars of Observability**
1. Metrics - Numerical measurements over time
2. Logs - Discrete event records
3. Traces - Request flow through systems
4. Events - Significant system occurrences

**Monitoring Methodologies**
- RED Method (Request rate, Error rate, Duration)
- USE Method (Utilization, Saturation, Errors)
- Golden Signals (Latency, Traffic, Errors, Saturation)
- SLIs, SLOs, and Error Budgets

### What You'll Learn

**Prometheus Setup**
- Installation and configuration
- Service discovery (static, Kubernetes, cloud providers)
- Storage and retention strategies
- Remote write/read for long-term storage
- High availability and federation

**PromQL Mastery**
- Instant and range vectors
- Aggregation operators (sum, avg, max, min, count)
- Mathematical operations and comparisons
- Rate calculations and derivatives
- Histogram quantiles for latency percentiles
- Prediction and anomaly detection

**Alerting Strategy**
- Writing effective alert rules
- Multi-window multi-burn-rate alerts for SLOs
- Alertmanager routing and receivers
- Integration with PagerDuty, Slack, email
- Alert grouping and inhibition
- Reducing alert fatigue

**Grafana Dashboards**
- Dashboard design principles
- Template variables for flexibility
- Panel types and visualizations
- RED and USE method dashboards
- SLO tracking dashboards
- Annotations for deployments and incidents

**Custom Exporters**
- Building application-specific exporters
- Metric types: Counter, Gauge, Histogram, Summary
- Best practices for metric naming and labels
- Managing cardinality
- Client libraries (Python, Go, Java, Node.js)

**Production Best Practices**
- Metric naming conventions
- Label design for low cardinality
- Recording rules for performance
- Capacity planning and forecasting
- Cost monitoring and optimization
- Security and authentication

## Quick Start

### 1. Install Prometheus

```bash
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvfz prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64

# Create basic configuration
cat > prometheus.yml <<EOF
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
EOF

# Start Prometheus
./prometheus --config.file=prometheus.yml
```

Access Prometheus UI at http://localhost:9090

### 2. Install Node Exporter

```bash
# Download and run node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
./node_exporter
```

Add to prometheus.yml:
```yaml
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
```

### 3. Install Grafana

```bash
# Using Docker
docker run -d -p 3000:3000 --name=grafana grafana/grafana

# Or download binary
wget https://dl.grafana.com/oss/release/grafana-10.0.0.linux-amd64.tar.gz
tar -zxvf grafana-10.0.0.linux-amd64.tar.gz
cd grafana-10.0.0
./bin/grafana-server
```

Access Grafana at http://localhost:3000 (default: admin/admin)

### 4. Configure Grafana Data Source

1. Navigate to Configuration > Data Sources
2. Add Prometheus data source
3. URL: http://localhost:9090
4. Click "Save & Test"

### 5. Create Your First Dashboard

Import pre-built Node Exporter dashboard:
1. Click "+" > Import
2. Enter dashboard ID: 1860
3. Select Prometheus data source
4. Click Import

## Common Use Cases

### Monitoring a Web Application

```yaml
# Add application to Prometheus
scrape_configs:
  - job_name: 'webapp'
    static_configs:
      - targets: ['app-1:8080', 'app-2:8080']
        labels:
          env: 'production'
          tier: 'frontend'
```

**Key Metrics to Track:**
- Request rate: `rate(http_requests_total[5m])`
- Error rate: `rate(http_requests_total{status=~"5.."}[5m])`
- Latency: `histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`
- Active connections: `http_active_connections`

### Database Monitoring

```bash
# Run PostgreSQL exporter
docker run -d \
  -p 9187:9187 \
  -e DATA_SOURCE_NAME="postgresql://user:password@localhost:5432/dbname?sslmode=disable" \
  prometheuscommunity/postgres-exporter
```

**Key Metrics:**
- Query rate: `rate(pg_stat_database_xact_commit[5m])`
- Connection pool: `pg_stat_database_numbackends`
- Replication lag: `pg_stat_replication_lag`
- Cache hit ratio: `pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read)`

### Kubernetes Monitoring

```yaml
# Kubernetes pod discovery
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (\d+)
        replacement: ${1}:${2}
```

**Annotate your pods:**
```yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
```

### Implementing SLOs

```yaml
# Define SLO: 99.9% availability
groups:
  - name: api_slo
    rules:
      - record: api:sli:availability
        expr: |
          sum(rate(http_requests_total{status=~"2.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      - alert: SLOBudgetBurn
        expr: api:sli:availability < 0.999
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API availability below SLO target"
          description: "Current availability: {{ $value | humanizePercentage }}"
```

### Creating Alerts

```yaml
# High error rate alert
groups:
  - name: alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"
```

## Architecture Patterns

### Single Prometheus Instance

**Best for:**
- Small deployments (< 1000 targets)
- Single cluster/datacenter
- Simple infrastructure

**Setup:**
```
Application Servers
    ↓ (scrape)
Prometheus
    ↓ (query)
Grafana
    ↓ (alerts)
Alertmanager
```

### Prometheus with Remote Storage

**Best for:**
- Long-term metric retention
- High query load
- Multiple teams querying data

**Setup:**
```
Prometheus (local TSDB)
    ↓ (remote_write)
Long-term Storage (Thanos, Cortex, Mimir)
    ↓ (query)
Grafana
```

### Federated Prometheus

**Best for:**
- Multiple clusters/regions
- Hierarchical monitoring
- Aggregated global view

**Setup:**
```
Cluster Prometheus 1 ──┐
Cluster Prometheus 2 ──┼─→ Global Prometheus → Grafana
Cluster Prometheus 3 ──┘
```

### High Availability Setup

**Best for:**
- Production systems requiring 99.9%+ uptime
- Critical monitoring infrastructure
- Large-scale deployments

**Setup:**
```
Prometheus Instance 1 ──┐
Prometheus Instance 2 ──┼─→ Load Balancer → Grafana
(Identical config)      │
                        └─→ Thanos (deduplication)
```

## Metric Collection Strategies

### Pull-Based (Prometheus Default)

**Advantages:**
- Centralized control of scraping
- Easy to detect down targets
- Better for service discovery

**When to use:**
- Internal services you control
- Kubernetes environments
- Standard HTTP endpoints

### Push-Based (with Pushgateway)

**Advantages:**
- Works for batch jobs
- Firewall-friendly
- Short-lived processes

**When to use:**
- Batch jobs and cron tasks
- Serverless functions
- Behind NAT/firewalls

```bash
# Push metrics to Pushgateway
echo "job_last_success_time $(date +%s)" | curl --data-binary @- \
  http://pushgateway:9091/metrics/job/backup
```

### Hybrid Approach

Combine pull and push for comprehensive coverage:
- Pull for long-running services
- Push for batch jobs and serverless
- Remote write for cross-datacenter

## Visualization Best Practices

### Dashboard Organization

**1. Executive Dashboard**
- High-level business metrics
- Overall system health
- SLO compliance
- Cost trends

**2. Service Dashboard**
- RED method (Request, Error, Duration)
- Service-specific metrics
- Dependencies and downstream services
- Recent deployments

**3. Infrastructure Dashboard**
- USE method (Utilization, Saturation, Errors)
- Resource consumption
- Capacity planning
- Hardware health

**4. Debugging Dashboard**
- Detailed metrics for troubleshooting
- Logs correlation
- Trace links
- Historical comparisons

### Panel Types

**Time Series Graph**: Trends over time
- Request rates, latency, resource usage
- Best for continuous metrics

**Gauge**: Current state
- % disk usage, active connections
- Best for point-in-time values

**Stat**: Single number
- Total requests, uptime
- Best for aggregated metrics

**Table**: Multiple dimensions
- Per-service metrics, resource breakdown
- Best for comparative analysis

**Heatmap**: Distribution visualization
- Latency distributions
- Best for understanding spread

## Alerting Philosophy

### Alert Tiers

**Critical (Page immediately)**
- User-facing service down
- Data loss in progress
- Security breach detected
- SLO budget exhausted

**Warning (Review within hours)**
- Degraded performance
- Resource approaching limits
- High error rates (not critical yet)
- SLO budget burning fast

**Info (Review during business hours)**
- Capacity planning triggers
- Optimization opportunities
- Deployment notifications
- Unusual but not urgent patterns

### Alert Attributes

Every alert should have:
1. **Clear summary**: What's wrong?
2. **Detailed description**: What's the impact?
3. **Runbook link**: How to fix it?
4. **Severity level**: How urgent?
5. **Team label**: Who's responsible?

### Reducing Alert Fatigue

**Techniques:**
- Use `for` duration to avoid flapping
- Set appropriate thresholds based on data
- Group related alerts
- Inhibit lower-priority alerts when critical fires
- Regular alert review and tuning
- Dead man's switch for monitoring health

## Integration Ecosystem

### Popular Integrations

**Notification Channels:**
- Slack, Microsoft Teams
- PagerDuty, Opsgenie
- Email, SMS
- Webhooks for custom integrations

**Log Aggregation:**
- Loki (Grafana's log system)
- Elasticsearch + Kibana
- Splunk

**Tracing:**
- Jaeger
- Tempo (Grafana's tracing)
- Zipkin

**APM Tools:**
- OpenTelemetry
- New Relic
- Datadog

**Cloud Platforms:**
- AWS CloudWatch
- Google Cloud Monitoring
- Azure Monitor

## Troubleshooting Guide

### Prometheus Not Scraping Targets

**Check:**
1. Target reachability: `curl http://target:port/metrics`
2. Service discovery config
3. Firewall rules
4. Label matchers in scrape config
5. Prometheus logs for errors

### High Cardinality Issues

**Symptoms:**
- Prometheus using excessive memory
- Slow queries
- High CPU usage

**Solutions:**
- Identify high-cardinality metrics
- Drop or relabel problematic labels
- Use recording rules for pre-aggregation
- Set shorter retention periods

### Missing Metrics

**Causes:**
- Metric not exposed by exporter
- Scrape interval too long
- Metric expired (removed by application)
- Relabeling dropping metrics

**Debug:**
```promql
# Check if metric exists at all
{__name__=~".*your_metric.*"}

# Check specific target's metrics
up{instance="target:port"}
```

### Alert Not Firing

**Verify:**
1. Alert rule syntax: `promtool check rules alerts.yml`
2. Rule evaluation: Check "Alerts" tab in Prometheus UI
3. Alert state: pending → firing transition
4. Alertmanager receives alert
5. Routing configuration in Alertmanager

## Performance Tuning

### Prometheus Optimization

```yaml
# Tune for high-cardinality environments
global:
  scrape_interval: 30s  # Increase if too many targets
  evaluation_interval: 30s

storage:
  tsdb:
    min-block-duration: 2h  # Default
    max-block-duration: 2h  # Keep same as min
    retention.time: 15d     # Adjust based on needs
    retention.size: 50GB    # Limit storage growth
```

### Query Optimization

```promql
# Bad: High cardinality, expensive
sum(rate(http_requests_total[5m])) by (user_id)

# Good: Pre-aggregated, efficient
sum(rate(http_requests_total[5m])) by (service, status_class)

# Use recording rules for expensive queries
job:http_requests:rate5m
```

### Grafana Dashboard Performance

- Use shorter time ranges when possible
- Limit number of series per panel
- Use query caching (Grafana Enterprise)
- Create separate dashboards for different use cases
- Use dashboard folder organization

## Security Considerations

### Authentication and Authorization

```yaml
# Enable basic auth in Prometheus
basic_auth_users:
  admin: $2y$10$hashed_password_here

# TLS configuration
tls_server_config:
  cert_file: server.crt
  key_file: server.key
```

### Network Security

- Use TLS for scraping sensitive targets
- Implement firewall rules
- Use VPN for cross-datacenter federation
- Restrict Prometheus API access
- Enable authentication on exporters

### Data Privacy

- Avoid collecting PII in metrics
- Use aggregation to anonymize data
- Implement data retention policies
- Secure remote storage credentials
- Regular security audits

## Resources and Learning

### Official Documentation

- Prometheus: https://prometheus.io/docs/
- Grafana: https://grafana.com/docs/
- Alertmanager: https://prometheus.io/docs/alerting/latest/alertmanager/
- PromQL: https://prometheus.io/docs/prometheus/latest/querying/basics/

### Community Resources

- Prometheus Mailing List
- CNCF Slack #prometheus channel
- Grafana Community Forums
- GitHub repositories for exporters

### Books and Guides

- "Prometheus: Up & Running" by Brian Brazil
- "Site Reliability Engineering" by Google
- "The Site Reliability Workbook" by Google
- CNCF Cloud Native Landscape

### Practice Labs

- Prometheus Demo: https://demo.prometheus.io
- Grafana Play: https://play.grafana.org
- Katacoda Prometheus Scenarios
- Local Kubernetes with kind/minikube

## Next Steps

1. **Start Small**: Install Prometheus and Node Exporter locally
2. **Learn PromQL**: Practice queries on your metrics
3. **Build Dashboards**: Create a simple RED method dashboard
4. **Add Alerts**: Define one critical alert for your system
5. **Iterate**: Gradually expand coverage and sophistication
6. **Share Knowledge**: Document runbooks and share with team

---

**Skill Version**: 1.0.0
**Maintained By**: Observability Community
**License**: MIT
