Skip to main content
Use this as your team runbook for deciding promote vs rollback.

Daily checklist (5 minutes)

  • Is canary traffic close to target percent?
  • Are error and timeout rates within limits?
  • Is disagreement stable (not spiking)?
  • Any obvious quality regressions from reviewers?

Weekly checklist (15 minutes)

  • Compare canary vs control on quality metrics
  • Review top disagreement examples manually
  • Decide: promote one step, hold, or roll back

Core metrics and targets

MetricTarget
Canary traffic sharenear configured percent
Error rate< 1%
Timeout rate< 2%
Override ratestable vs control
Type disagreementstable vs control
Risk disagreementstable vs control

Alert policy

Create alerts for:
  • error rate > 1% for 15m
  • timeout rate > 2% for 15m
  • disagreement rate > 2x 7-day baseline
  • data ingestion completeness < 99.5% daily

Promotion ladder

  1. 5% for 2-7 days
  2. 15% for 2-7 days
  3. 30% for 2-7 days
  4. 50% for 2-7 days
  5. 100% only after stable metrics + human review
Rollback switch:
GITHUB_LABELER_CANARY_PERCENT=0

Sample Metabase SQL (starter set)

Assume table: analytics.label_flywheel_events

1) Canary traffic share by hour

select
  date_trunc('hour', ts) as hour,
  count(*) as total_events,
  sum(case when model_lane = 'canary' then 1 else 0 end) as canary_events,
  100.0 * sum(case when model_lane = 'canary' then 1 else 0 end) / nullif(count(*), 0) as canary_share_pct
from analytics.label_flywheel_events
where ts >= now() - interval '7 day'
group by 1
order by 1;

2) Error + timeout rate by lane

select
  model_lane,
  count(*) as events,
  100.0 * sum(case when provider like 'error:%' or provider like 'http-%' then 1 else 0 end) / nullif(count(*), 0) as error_rate_pct,
  100.0 * sum(case when provider like '%timeout%' or provider like '%abort%' then 1 else 0 end) / nullif(count(*), 0) as timeout_rate_pct
from analytics.label_flywheel_events
where ts >= now() - interval '7 day'
group by 1
order by 1;

3) Override + disagreement by lane

select
  model_lane,
  count(*) as events,
  100.0 * sum(case when model_overrode_deterministic then 1 else 0 end) / nullif(count(*), 0) as override_rate_pct,
  100.0 * sum(case when model_type_label is not null and model_type_label <> deterministic_type_label then 1 else 0 end) /
    nullif(sum(case when model_type_label is not null then 1 else 0 end), 0) as type_disagreement_pct,
  100.0 * sum(case when model_risk_label is not null and model_risk_label <> deterministic_risk_label then 1 else 0 end) /
    nullif(sum(case when model_risk_label is not null then 1 else 0 end), 0) as risk_disagreement_pct
from analytics.label_flywheel_events
where ts >= now() - interval '7 day'
group by 1
order by 1;

Keep this maintainable

  • Keep one canonical telemetry table/view for dashboard queries.
  • Version your schema before adding/removing fields.
  • Keep promotion thresholds in this page only (single source of truth).
  • Save one weekly decision note: promote / hold / rollback + reason.
Related:
  • Dojo → Label flywheel canary