Health Monitoring
Teabar continuously monitors the health of your environments, providing real-time status updates, automated health checks, and configurable alerting to help you identify and resolve issues quickly.
Health Status
Status Levels
| Status | Description | Action |
|---|---|---|
healthy | All components operational | None required |
degraded | Partial functionality, some issues | Investigate |
unhealthy | Significant problems detected | Immediate action |
unknown | Unable to determine status | Check connectivity |
Viewing Health Status
# Check all environments
teabar health
# Check specific environment
teabar health my-feature-env
# Detailed health report
teabar health my-feature-env --verbose Example output:
Environment Health: my-feature-env
Overall Status: HEALTHY
Components:
web healthy Response time: 45ms
api healthy Response time: 23ms
database healthy Connections: 12/100
cache healthy Hit rate: 94%
worker healthy Queue depth: 5
Last Check: 2024-01-15 14:32:01 UTC
Next Check: 2024-01-15 14:33:01 UTC Detailed Health Report
teabar health my-feature-env --verbose Output:
Environment Health Report: my-feature-env
Generated: 2024-01-15 14:32:01 UTC
SUMMARY
Status: HEALTHY
Uptime: 7d 14h 32m
Last Incident: 2024-01-08 (resolved)
COMPONENTS
web (container)
Status: healthy
CPU: 45% (threshold: 80%)
Memory: 1.2GB / 2GB (60%)
Restarts: 0 (last 24h)
Health Check: HTTP GET /health -> 200 OK (45ms)
api (container)
Status: healthy
CPU: 32% (threshold: 80%)
Memory: 890MB / 2GB (44%)
Restarts: 0 (last 24h)
Health Check: HTTP GET /api/health -> 200 OK (23ms)
database (postgres)
Status: healthy
Connections: 12/100 (12%)
Disk: 4.5GB / 20GB (22%)
Replication Lag: 0ms
Health Check: TCP connect -> success (5ms)
cache (redis)
Status: healthy
Memory: 256MB / 1GB (25%)
Hit Rate: 94%
Connected Clients: 8
Health Check: PING -> PONG (2ms)
RECENT EVENTS
2024-01-15 12:00:00 Scheduled health check passed
2024-01-15 06:00:00 Scheduled health check passed
2024-01-14 18:00:00 Scheduled health check passed
RECOMMENDATIONS
None - all components are healthy Health Checks
Built-in Health Checks
Teabar automatically performs health checks based on component type:
| Component Type | Default Check | Interval |
|---|---|---|
| HTTP Service | GET /health | 30s |
| TCP Service | TCP connect | 30s |
| Database | Connection test | 60s |
| Cache | PING command | 30s |
Custom Health Checks
Configure custom health checks in your blueprint:
# blueprint.yaml
components:
api:
image: myapp/api:latest
health_check:
type: http
path: /api/v1/health
port: 8080
interval: 30s
timeout: 10s
retries: 3
success_threshold: 1
failure_threshold: 3
worker:
image: myapp/worker:latest
health_check:
type: exec
command: ["./healthcheck.sh"]
interval: 60s
timeout: 30s
database:
image: postgres:15
health_check:
type: tcp
port: 5432
interval: 60s Health Check Types
health_check:
type: http
path: /health
port: 8080
method: GET
headers:
Authorization: Bearer ${HEALTH_TOKEN}
expected_status: [200, 201]
expected_body: '"status":"ok"'
interval: 30s
timeout: 10sAlerting
Configuring Alerts
# teabar.yaml
alerts:
# Health-based alerts
- name: environment-unhealthy
condition: health_status == "unhealthy"
duration: 2m
severity: critical
channels:
- pagerduty:platform-oncall
- slack:#incidents
- name: environment-degraded
condition: health_status == "degraded"
duration: 5m
severity: warning
channels:
- slack:#platform-alerts
# Component-specific alerts
- name: high-cpu
condition: cpu_usage > 80%
duration: 5m
severity: warning
channels:
- slack:#platform-alerts
- name: database-connections-high
condition: database_connections > 80%
duration: 2m
severity: warning
channels:
- slack:#platform-alerts Alert Channels
channels:
slack:
webhook_url: https://hooks.slack.com/services/xxx
default_channel: "#platform-alerts" Alert format:
🔴 CRITICAL: environment-unhealthy
Environment: my-feature-env
Project: frontend
Status: unhealthy for 2m
Components affected: api, worker
Time: 2024-01-15 14:32:01 UTCAlert Lifecycle
# View active alerts
teabar alerts list
# Acknowledge an alert
teabar alerts ack alert_abc123
# Resolve an alert
teabar alerts resolve alert_abc123
# View alert history
teabar alerts history --since 7d Incident Management
Incident Timeline
When health issues are detected, Teabar creates an incident timeline:
teabar incidents show inc_xyz789 Output:
Incident: inc_xyz789
Environment: my-feature-env
Status: resolved
Timeline:
2024-01-15 14:30:00 Health check failed: api component
2024-01-15 14:30:30 Health check failed: api component (retry 1)
2024-01-15 14:31:00 Health check failed: api component (retry 2)
2024-01-15 14:31:00 Status changed: healthy -> unhealthy
2024-01-15 14:31:01 Alert triggered: environment-unhealthy
2024-01-15 14:31:05 Alert sent to: pagerduty:platform-oncall
2024-01-15 14:31:06 Alert sent to: slack:#incidents
2024-01-15 14:35:00 Alert acknowledged by: [email protected]
2024-01-15 14:42:00 Health check passed: api component
2024-01-15 14:42:30 Status changed: unhealthy -> healthy
2024-01-15 14:42:30 Incident auto-resolved
Duration: 12m 30s
Root Cause: OOM kill due to memory leak in /api/reports endpoint Auto-Remediation
Configure automatic remediation actions:
# teabar.yaml
remediation:
- trigger: health_status == "unhealthy"
component: api
actions:
- type: restart
max_attempts: 3
cooldown: 5m
- type: notify
message: "Auto-restart triggered for api component"
channels: [slack:#platform-alerts]
- trigger: cpu_usage > 90%
duration: 10m
actions:
- type: scale
replicas: "+1"
max_replicas: 5
- type: notify
message: "Auto-scaled due to high CPU" Status Page
Internal Status Page
View aggregated health across all environments:
# View status summary
teabar status
# View specific project status
teabar status --project frontend Public Status Page
Configure a public status page for your team:
# teabar.yaml
status_page:
enabled: true
public: true
url: status.example.com
components:
- name: "Production API"
environment: production
- name: "Staging"
environment: staging
show_metrics: true
show_incidents: true Best Practices
- Configure health checks for all components - Don’t rely on default checks alone
- Set appropriate thresholds - Avoid alert fatigue with realistic thresholds
- Use escalation policies - Start with Slack, escalate to PagerDuty
- Enable auto-remediation cautiously - Start with notifications, add actions gradually
- Review incidents regularly - Conduct post-mortems for recurring issues
Note
Health check results are stored for 30 days and can be used for SLA reporting.
Troubleshooting
Health Checks Failing
# View health check logs
teabar logs my-feature-env --component health-agent
# Test health check manually
teabar health check my-feature-env --component api --verbose
# View recent health check results
teabar health history my-feature-env --since 1h False Positives
If health checks are triggering incorrectly:
# Increase failure threshold
health_check:
failure_threshold: 5 # Require 5 failures before unhealthy
# Increase timeout
health_check:
timeout: 30s # Allow more time for response
# Add retry delay
health_check:
retry_delay: 5s # Wait between retries