
Central Queue Server knocked offline
Started 3 May at 12:09pm IST, last updated 3 May at 03:56pm IST.
Updated
Post-mortem
At 1310 hrs due to a manual switch in network configuration, the central RabbitMQ server went offline for a duration of 4-5 mins. The situation was identified immediately and corrective measures were taken to divert the traffic to the standby queue. In the intervening period, since the API servers couldn't establish connection to the queue, their CPU's got blocked on network I/O.
As soon as the standby queue kicked in, the situation auto-recovered in a few seconds.
Corrective actions
- it was found that the primary queue server had no termination protection associated with it, which is an incorrect setup for any critical piece of infra.
- a lot of similarly named servers were also found to be in the stopped state. The cleanup of these servers ended up triggering the issue. These servers have now been removed.
Resolved
Aggregator API Services recovered.

Re-appeared
Aggregator API Services went down.
Resolved
Meraki API Services recovered.
Updated
Aggregator API Services recovered.
Updated
Merchant API Services recovered.

Updated
Merchant API Services went down.
Updated
Aggregator API Services went down.

Created
Meraki API Services went down.