Alerts & Thresholds
Alerts decide when humans should intervene. Poor alerting causes either missed incidents or constant noise.
Why Alerting Mattersβ
Without alerts:
- Incidents are discovered by users
- SLAs are violated silently
- Small issues become outages
Too many alerts are equally dangerous.
Alerts vs Metricsβ
- Metrics show trends
- Alerts demand action
Every alert should be actionable.
Alert Design Principlesβ
Good alerts are:
- Actionable
- Specific
- Timely
- Rare
If an alert doesnβt require action, remove it.
Critical Jenkins Alertsβ
Must-have alerts:
- Jenkins controller down
- Queue wait time exceeds threshold
- No available agents
- Authentication failures spike
- Disk almost full
These indicate real impact.
Warning-Level Alertsβ
Useful warnings:
- Increasing queue length
- Rising build failure rate
- High JVM memory usage
- Agent startup delays
Warnings allow proactive fixes.
Threshold Selectionβ
Thresholds should be:
- Based on baselines
- Tuned over time
- Different for prod vs non-prod
Avoid static, arbitrary thresholds.
Alert Routingβ
Best practices:
- Route alerts to correct team
- Use escalation policies
- Separate infra vs pipeline alerts
Avoid alert confusion.
Alert Fatigue Preventionβ
Techniques:
- Alert on symptoms, not causes
- Aggregate related alerts
- Suppress during maintenance
Noise kills alert effectiveness.
Testing Alertsβ
Always:
- Test alerts periodically
- Verify delivery channels
- Confirm runbooks exist
Untested alerts will fail.
Common Alerting Failuresβ
- Alerting on every metric
- No ownership defined
- No runbooks
- Ignoring alert history
Best Practicesβ
- Start with few critical alerts
- Add warnings gradually
- Review alerts quarterly
- Tie alerts to SLOs
Interview Focus Areasβ
- What makes a good alert
- Difference between alerts and monitoring
- Queue-based alerting importance