Disruption to outbound mess...

Resolved
Aug 01 at 12:39pm UTC

Update: Further investigation revealed that a customer, using our single send API, sent a large amount of messages to the same number. The service responsible for sending these messages to the outbound queue did not apply a suitable grouping to these messages, circumventing an architectural protection mechanism.

The messages were therefore made available for processing across all meta-adapter service instances. This in turn triggered our rate limiting protection which stops messages to the same user being sent too quickly - to ensure message order as per our upstream provider’s requirements.

This rate limiter mechanism relies on a cache mechanism that is not event loop safe, thus resulting in all instances of our meta-adapter effectively coming to a stand still. Restarting the pods flushed the state and all the messages had been retried enough times to have been placed into our DLQ. Thus, services could resume as per normal.

Mitigations to be added:
- Improve message grouping key. (Done 31/7/2025)
- Clean up the double retry mechanism in meta-adapter to reduce total delays/retries. (currently in code review)
- Use new event loop safe and asynchronous Redis cache. (currently in code review)

Updated
Jul 31 at 08:21am UTC

Initial investigations suggest a circuit breaker may have been tripped in our message queue management process, usually triggered by a surge in traffic. A restart of the services has restored this and messages are flowing normally. We are investigating and will post further updates.

Created
Jul 31 at 08:05am UTC

We have received reports of outbound messages not being sent. The incident lasted for about 16 minutes. We are investigating and will post further updates.

Disruption to outbound messages