At 10:13 on Thursday 26/07/2018 our monitoring reported a sharp spike in the number of calls being dialled on our platform – the number of call attempts per second – with the rate doubling inside of 20 seconds.
This significantly increased the load on servers performing vital call setup tasks, resulting in these servers becoming too busy to accept new outbound call connections. At 10:20 widespread call failures were experienced.
Investigation began immediately and identified a set of large customers responsible for the increase in traffic. They were contacted immediately, and the traffic stopped, however by this point the load on the servers was already at a critical level.
Calls that failed were re-attempting to connect to the servers, thus maintaining a relatively high number of call attempts per second. Between 10:25 and 10:40 the load gradually decreased, allowing new calls to connect intermittently. At 11:00 the load had decreased sufficiently for the majority of calls to succeed.
The sheer volume of traffic seen in this incident flooded the system faster than our protection measures could respond and we have identified further bottlenecks in database capacity that meant that this incident caused a wide spread impact.
We are working internally to deploy our VoIP platform into multiple data-centres and are in customer testing for a new VoIP environment.