On September 3, 2025 (US‑East), we experienced elevated queue times and, in periods, stalled job processing due to Redis connection exhaustion. The issue was stabilized the same day through a hotpatch and configuration changes. This post explains what happened, how we mitigated it, and what we are doing to prevent a recurrence.

Summary

An internal service endpoint leaked Redis connections when heavily exercised. One customer triggered an unusually large burst of requests to this endpoint. Although our traffic shaper isolated the burst to a backup queue, the connection leak propagated pressure across shared Redis servers. Once the servers reached their connection limits, legitimate workers and API paths in the region were starved, resulting in slow queues and intermittent job stalls. We deployed a hotpatch to close leaked connections and adjusted limits, after which processing returned to normal.

Impact

  • Timeframe: September 3, 2025
    • 12:00 PM UTC: Slow encoding in US‑East with many queues
    • 19:50 PM UTC: Elevated queue times and too many queueing connections observed
    • 21:24 PM UTC: Systems stabilized; continued root cause investigation
    • 23:06 PM UTC: Culprit identified and hotpatch deployed; recovery confirmed
  • Region: us-east-1
  • User impact:
    • Higher‑than‑normal queue times for Assemblies
    • In some cases, stalled jobs when Redis servers refused additional connections
    • Webhooks/notifications and some API interactions in the region were delayed

Root cause

  • A code path in a high‑traffic internal endpoint failed to reliably close Redis connections under specific error conditions.
  • A single customer workload unintentionally over‑exercised this endpoint, increasing the connection churn and exposing the leak.
  • Redis server connection limits were eventually exhausted, blocking new connections from legitimate workers and API nodes.

Contributing factors:

  • Shared Redis pools increased blast radius once connection limits were hit.
  • Connection leak was not surfaced by existing dashboards because success‑path metrics were healthy while error‑path connection accounting was not.

Detection

We detected the issue via queue latency alerts and worker heartbeat anomalies. Engineers correlated spikes in Redis connected_clients, rising connection churn, and ECONNREFUSED/max clients reached errors on affected servers.

Mitigation and recovery

  • Traffic from the offending workload was already being trickled into an isolated backup queue, limiting but not eliminating impact.
  • We temporarily raised Redis connection limits to create headroom during diagnosis.
  • We shipped a hotpatch to ensure connections are always released on both success and error paths for the endpoint in question.
  • We recycled affected processes to drain leaked connections and verified normal operation.

At 21:24 PM UTC, systems stabilized. At 23:06 PM UTC, the leak was confirmed and the hotpatch deployed fleet‑wide. Systems have remained healthy since.

Customer communication

We posted updates on our status page throughout the incident and reached out to affected customers. After mitigation, we advised attempting Assembly replay to recover otherwise lost results where input files were still available.

What we are doing next

  • Add connection‑leak SLOs and dashboards that track connection lifecycle on success and error paths.
  • Introduce per‑endpoint connection budgets and circuit breakers to contain leaks.
  • Strengthen Redis client pooling with strict timeouts and lintable finally blocks around acquisition.
  • Split critical Redis roles (queues, locks, metadata) across distinct pools with separate limits.
  • Tighten rate limiting on the affected endpoint, with backpressure that degrades gracefully.
  • Add chaos tests to simulate partial Redis outages and connection floods.

If you were affected

  • Re‑run critical jobs where feasible. If your input files are still available, you can replay a failed Assembly via our API, see Replay an Assembly. If you need help identifying impacted jobs, please contact support and we will assist.

Update September 4, 2025

One day later, we experienced another incident. While we succesfully plugged one leak source, it turned out there was another one. This time, we were able to detect it earlier and mitigate in one hour.

We completed additional hardening to reduce the chance of a repeat. We added targeted rate limits to the endpoint involved, improved connection hygiene across the code paths that handle errors, and split some Redis responsibilities into separate pools with clearer limits. Together, these changes shrink blast radius and make the system more resilient under bursty workloads.

We also expanded monitoring and alerting to focus on early signals—such as abnormal connection churn and queueing pressure—so we can react faster if similar patterns emerge. These improvements are now live in US‑East and will be rolled out to other regions as a precaution.

Closing

We are sincerely sorry for the disruption. Reliability is our top priority, and we did not meet our own standards in US‑East on September 3. The fixes listed above are already in motion, and we will report back if there are further material changes.