Health Monitoring

Teabar continuously monitors the health of your environments, providing real-time status updates, automated health checks, and configurable alerting to help you identify and resolve issues quickly.

Health Status

Status Levels

Status	Description	Action
`healthy`	All components operational	None required
`degraded`	Partial functionality, some issues	Investigate
`unhealthy`	Significant problems detected	Immediate action
`unknown`	Unable to determine status	Check connectivity

Viewing Health Status

# Check all environments
teabar health

# Check specific environment
teabar health my-feature-env

# Detailed health report
teabar health my-feature-env --verbose

Example output:

Environment Health: my-feature-env

Overall Status: HEALTHY

Components:
  web        healthy    Response time: 45ms
  api        healthy    Response time: 23ms
  database   healthy    Connections: 12/100
  cache      healthy    Hit rate: 94%
  worker     healthy    Queue depth: 5

Last Check: 2024-01-15 14:32:01 UTC
Next Check: 2024-01-15 14:33:01 UTC

Detailed Health Report

teabar health my-feature-env --verbose

Output:

Environment Health Report: my-feature-env
Generated: 2024-01-15 14:32:01 UTC

SUMMARY
  Status: HEALTHY
  Uptime: 7d 14h 32m
  Last Incident: 2024-01-08 (resolved)

COMPONENTS

  web (container)
    Status: healthy
    CPU: 45% (threshold: 80%)
    Memory: 1.2GB / 2GB (60%)
    Restarts: 0 (last 24h)
    Health Check: HTTP GET /health -> 200 OK (45ms)

  api (container)
    Status: healthy
    CPU: 32% (threshold: 80%)
    Memory: 890MB / 2GB (44%)
    Restarts: 0 (last 24h)
    Health Check: HTTP GET /api/health -> 200 OK (23ms)

  database (postgres)
    Status: healthy
    Connections: 12/100 (12%)
    Disk: 4.5GB / 20GB (22%)
    Replication Lag: 0ms
    Health Check: TCP connect -> success (5ms)

  cache (redis)
    Status: healthy
    Memory: 256MB / 1GB (25%)
    Hit Rate: 94%
    Connected Clients: 8
    Health Check: PING -> PONG (2ms)

RECENT EVENTS
  2024-01-15 12:00:00  Scheduled health check passed
  2024-01-15 06:00:00  Scheduled health check passed
  2024-01-14 18:00:00  Scheduled health check passed

RECOMMENDATIONS
  None - all components are healthy

Health Checks

Built-in Health Checks

Teabar automatically performs health checks based on component type:

Component Type	Default Check	Interval
HTTP Service	GET /health	30s
TCP Service	TCP connect	30s
Database	Connection test	60s
Cache	PING command	30s

Custom Health Checks

Configure custom health checks in your blueprint:

# blueprint.yaml
components:
  api:
    image: myapp/api:latest
    health_check:
      type: http
      path: /api/v1/health
      port: 8080
      interval: 30s
      timeout: 10s
      retries: 3
      success_threshold: 1
      failure_threshold: 3

  worker:
    image: myapp/worker:latest
    health_check:
      type: exec
      command: ["./healthcheck.sh"]
      interval: 60s
      timeout: 30s

  database:
    image: postgres:15
    health_check:
      type: tcp
      port: 5432
      interval: 60s

Health Check Types

health_check:
  type: http
  path: /health
  port: 8080
  method: GET
  headers:
    Authorization: Bearer ${HEALTH_TOKEN}
  expected_status: [200, 201]
  expected_body: '"status":"ok"'
  interval: 30s
  timeout: 10s

Alerting

Configuring Alerts

# teabar.yaml
alerts:
  # Health-based alerts
  - name: environment-unhealthy
    condition: health_status == "unhealthy"
    duration: 2m
    severity: critical
    channels:
      - pagerduty:platform-oncall
      - slack:#incidents

  - name: environment-degraded
    condition: health_status == "degraded"
    duration: 5m
    severity: warning
    channels:
      - slack:#platform-alerts

  # Component-specific alerts
  - name: high-cpu
    condition: cpu_usage > 80%
    duration: 5m
    severity: warning
    channels:
      - slack:#platform-alerts

  - name: database-connections-high
    condition: database_connections > 80%
    duration: 2m
    severity: warning
    channels:
      - slack:#platform-alerts

Alert Channels

channels:
  slack:
    webhook_url: https://hooks.slack.com/services/xxx
    default_channel: "#platform-alerts"

Alert format:

🔴 CRITICAL: environment-unhealthy
Environment: my-feature-env
Project: frontend
Status: unhealthy for 2m
Components affected: api, worker
Time: 2024-01-15 14:32:01 UTC

Alert Lifecycle

# View active alerts
teabar alerts list

# Acknowledge an alert
teabar alerts ack alert_abc123

# Resolve an alert
teabar alerts resolve alert_abc123

# View alert history
teabar alerts history --since 7d

Incident Management

Incident Timeline

When health issues are detected, Teabar creates an incident timeline:

teabar incidents show inc_xyz789

Output:

Incident: inc_xyz789
Environment: my-feature-env
Status: resolved

Timeline:
  2024-01-15 14:30:00  Health check failed: api component
  2024-01-15 14:30:30  Health check failed: api component (retry 1)
  2024-01-15 14:31:00  Health check failed: api component (retry 2)
  2024-01-15 14:31:00  Status changed: healthy -> unhealthy
  2024-01-15 14:31:01  Alert triggered: environment-unhealthy
  2024-01-15 14:31:05  Alert sent to: pagerduty:platform-oncall
  2024-01-15 14:31:06  Alert sent to: slack:#incidents
  2024-01-15 14:35:00  Alert acknowledged by: [email protected]
  2024-01-15 14:42:00  Health check passed: api component
  2024-01-15 14:42:30  Status changed: unhealthy -> healthy
  2024-01-15 14:42:30  Incident auto-resolved

Duration: 12m 30s
Root Cause: OOM kill due to memory leak in /api/reports endpoint

Auto-Remediation

Configure automatic remediation actions:

# teabar.yaml
remediation:
  - trigger: health_status == "unhealthy"
    component: api
    actions:
      - type: restart
        max_attempts: 3
        cooldown: 5m
      - type: notify
        message: "Auto-restart triggered for api component"
        channels: [slack:#platform-alerts]

  - trigger: cpu_usage > 90%
    duration: 10m
    actions:
      - type: scale
        replicas: "+1"
        max_replicas: 5
      - type: notify
        message: "Auto-scaled due to high CPU"

Status Page

Internal Status Page

View aggregated health across all environments:

# View status summary
teabar status

# View specific project status
teabar status --project frontend

Public Status Page

Configure a public status page for your team:

# teabar.yaml
status_page:
  enabled: true
  public: true
  url: status.example.com
  components:
    - name: "Production API"
      environment: production
    - name: "Staging"
      environment: staging
  show_metrics: true
  show_incidents: true

Best Practices

Configure health checks for all components - Don’t rely on default checks alone
Set appropriate thresholds - Avoid alert fatigue with realistic thresholds
Use escalation policies - Start with Slack, escalate to PagerDuty
Enable auto-remediation cautiously - Start with notifications, add actions gradually
Review incidents regularly - Conduct post-mortems for recurring issues

Note

Health check results are stored for 30 days and can be used for SLA reporting.

Troubleshooting

Health Checks Failing

# View health check logs
teabar logs my-feature-env --component health-agent

# Test health check manually
teabar health check my-feature-env --component api --verbose

# View recent health check results
teabar health history my-feature-env --since 1h

False Positives

If health checks are triggering incorrectly:

# Increase failure threshold
health_check:
  failure_threshold: 5  # Require 5 failures before unhealthy
  
# Increase timeout
health_check:
  timeout: 30s  # Allow more time for response
  
# Add retry delay
health_check:
  retry_delay: 5s  # Wait between retries