Operationalizing Worker Health and Failure Recovery

We strengthened worker observability and recovery patterns so media operations remain stable during spikes and failure scenarios.

Linked project: GLMediaCMS

Problem

Background processing systems can silently degrade without clear health indicators and actionable recovery workflows.

Constraints

- Keep monitoring lightweight and practical. - Detect stalled/failed jobs early. - Support operator intervention without complex tooling. - Avoid noisy alerts that reduce signal quality.

Decisions

- Added clearer worker/job status visibility. - Introduced stale-job detection and recovery paths. - Improved retry/backoff handling for transient failures. - Standardized operational checks for daily review. - Aligned status model with UI for faster diagnosis.

Outcome

Operations teams gained better control over processing reliability and could resolve issues earlier, reducing impact on publishing timelines.

Metrics

- Reduced time-to-detect for processing incidents. - Faster time-to-recovery for failed/stalled jobs. - Higher worker stability during peak load windows. - Fewer publishing delays caused by hidden failures.