Root cause
Our cloud networking layer had insufficient capacity to handle the volume of outbound connections from our services. As traffic grew, we periodically exceeded the available connection limits, causing some requests to be dropped.
What we fixed
We restructured our internal network architecture, significantly reducing the load on our outbound connection layer. This freed up capacity.
Ensuring it won’t happen again
We’ve permanently re-architected to prevent this specific issue. We’ve added monitoring + alerts to catch similar network errors.