Health Monitoring

Teabar continuously monitors the health of your environments, providing real-time status updates, automated health checks, and configurable alerting to help you identify and resolve issues quickly.

Health Status

Status Levels

StatusDescriptionAction
healthyAll components operationalNone required
degradedPartial functionality, some issuesInvestigate
unhealthySignificant problems detectedImmediate action
unknownUnable to determine statusCheck connectivity

Viewing Health Status

# Check all environments
teabar health

# Check specific environment
teabar health my-feature-env

# Detailed health report
teabar health my-feature-env --verbose

Example output:

Environment Health: my-feature-env

Overall Status: HEALTHY

Components:
  web        healthy    Response time: 45ms
  api        healthy    Response time: 23ms
  database   healthy    Connections: 12/100
  cache      healthy    Hit rate: 94%
  worker     healthy    Queue depth: 5

Last Check: 2024-01-15 14:32:01 UTC
Next Check: 2024-01-15 14:33:01 UTC

Detailed Health Report

teabar health my-feature-env --verbose

Output:

Environment Health Report: my-feature-env
Generated: 2024-01-15 14:32:01 UTC

SUMMARY
  Status: HEALTHY
  Uptime: 7d 14h 32m
  Last Incident: 2024-01-08 (resolved)

COMPONENTS

  web (container)
    Status: healthy
    CPU: 45% (threshold: 80%)
    Memory: 1.2GB / 2GB (60%)
    Restarts: 0 (last 24h)
    Health Check: HTTP GET /health -> 200 OK (45ms)

  api (container)
    Status: healthy
    CPU: 32% (threshold: 80%)
    Memory: 890MB / 2GB (44%)
    Restarts: 0 (last 24h)
    Health Check: HTTP GET /api/health -> 200 OK (23ms)

  database (postgres)
    Status: healthy
    Connections: 12/100 (12%)
    Disk: 4.5GB / 20GB (22%)
    Replication Lag: 0ms
    Health Check: TCP connect -> success (5ms)

  cache (redis)
    Status: healthy
    Memory: 256MB / 1GB (25%)
    Hit Rate: 94%
    Connected Clients: 8
    Health Check: PING -> PONG (2ms)

RECENT EVENTS
  2024-01-15 12:00:00  Scheduled health check passed
  2024-01-15 06:00:00  Scheduled health check passed
  2024-01-14 18:00:00  Scheduled health check passed

RECOMMENDATIONS
  None - all components are healthy

Health Checks

Built-in Health Checks

Teabar automatically performs health checks based on component type:

Component TypeDefault CheckInterval
HTTP ServiceGET /health30s
TCP ServiceTCP connect30s
DatabaseConnection test60s
CachePING command30s

Custom Health Checks

Configure custom health checks in your blueprint:

# blueprint.yaml
components:
  api:
    image: myapp/api:latest
    health_check:
      type: http
      path: /api/v1/health
      port: 8080
      interval: 30s
      timeout: 10s
      retries: 3
      success_threshold: 1
      failure_threshold: 3

  worker:
    image: myapp/worker:latest
    health_check:
      type: exec
      command: ["./healthcheck.sh"]
      interval: 60s
      timeout: 30s

  database:
    image: postgres:15
    health_check:
      type: tcp
      port: 5432
      interval: 60s

Health Check Types

health_check:
  type: http
  path: /health
  port: 8080
  method: GET
  headers:
    Authorization: Bearer ${HEALTH_TOKEN}
  expected_status: [200, 201]
  expected_body: '"status":"ok"'
  interval: 30s
  timeout: 10s

Alerting

Configuring Alerts

# teabar.yaml
alerts:
  # Health-based alerts
  - name: environment-unhealthy
    condition: health_status == "unhealthy"
    duration: 2m
    severity: critical
    channels:
      - pagerduty:platform-oncall
      - slack:#incidents

  - name: environment-degraded
    condition: health_status == "degraded"
    duration: 5m
    severity: warning
    channels:
      - slack:#platform-alerts

  # Component-specific alerts
  - name: high-cpu
    condition: cpu_usage > 80%
    duration: 5m
    severity: warning
    channels:
      - slack:#platform-alerts

  - name: database-connections-high
    condition: database_connections > 80%
    duration: 2m
    severity: warning
    channels:
      - slack:#platform-alerts

Alert Channels

channels:
  slack:
    webhook_url: https://hooks.slack.com/services/xxx
    default_channel: "#platform-alerts"

Alert format:

🔴 CRITICAL: environment-unhealthy
Environment: my-feature-env
Project: frontend
Status: unhealthy for 2m
Components affected: api, worker
Time: 2024-01-15 14:32:01 UTC

Alert Lifecycle

# View active alerts
teabar alerts list

# Acknowledge an alert
teabar alerts ack alert_abc123

# Resolve an alert
teabar alerts resolve alert_abc123

# View alert history
teabar alerts history --since 7d

Incident Management

Incident Timeline

When health issues are detected, Teabar creates an incident timeline:

teabar incidents show inc_xyz789

Output:

Incident: inc_xyz789
Environment: my-feature-env
Status: resolved

Timeline:
  2024-01-15 14:30:00  Health check failed: api component
  2024-01-15 14:30:30  Health check failed: api component (retry 1)
  2024-01-15 14:31:00  Health check failed: api component (retry 2)
  2024-01-15 14:31:00  Status changed: healthy -> unhealthy
  2024-01-15 14:31:01  Alert triggered: environment-unhealthy
  2024-01-15 14:31:05  Alert sent to: pagerduty:platform-oncall
  2024-01-15 14:31:06  Alert sent to: slack:#incidents
  2024-01-15 14:35:00  Alert acknowledged by: [email protected]
  2024-01-15 14:42:00  Health check passed: api component
  2024-01-15 14:42:30  Status changed: unhealthy -> healthy
  2024-01-15 14:42:30  Incident auto-resolved

Duration: 12m 30s
Root Cause: OOM kill due to memory leak in /api/reports endpoint

Auto-Remediation

Configure automatic remediation actions:

# teabar.yaml
remediation:
  - trigger: health_status == "unhealthy"
    component: api
    actions:
      - type: restart
        max_attempts: 3
        cooldown: 5m
      - type: notify
        message: "Auto-restart triggered for api component"
        channels: [slack:#platform-alerts]

  - trigger: cpu_usage > 90%
    duration: 10m
    actions:
      - type: scale
        replicas: "+1"
        max_replicas: 5
      - type: notify
        message: "Auto-scaled due to high CPU"

Status Page

Internal Status Page

View aggregated health across all environments:

# View status summary
teabar status

# View specific project status
teabar status --project frontend

Public Status Page

Configure a public status page for your team:

# teabar.yaml
status_page:
  enabled: true
  public: true
  url: status.example.com
  components:
    - name: "Production API"
      environment: production
    - name: "Staging"
      environment: staging
  show_metrics: true
  show_incidents: true

Best Practices

  1. Configure health checks for all components - Don’t rely on default checks alone
  2. Set appropriate thresholds - Avoid alert fatigue with realistic thresholds
  3. Use escalation policies - Start with Slack, escalate to PagerDuty
  4. Enable auto-remediation cautiously - Start with notifications, add actions gradually
  5. Review incidents regularly - Conduct post-mortems for recurring issues

Troubleshooting

Health Checks Failing

# View health check logs
teabar logs my-feature-env --component health-agent

# Test health check manually
teabar health check my-feature-env --component api --verbose

# View recent health check results
teabar health history my-feature-env --since 1h

False Positives

If health checks are triggering incorrectly:

# Increase failure threshold
health_check:
  failure_threshold: 5  # Require 5 failures before unhealthy
  
# Increase timeout
health_check:
  timeout: 30s  # Allow more time for response
  
# Add retry delay
health_check:
  retry_delay: 5s  # Wait between retries
ende