Get in touch
Back
Downtime

Central Queue Server knocked offline

Started 3 May at 12:09pm IST, last updated 3 May at 03:56pm IST.

Aggregator API Services Merchant API Services
Updated

Post-mortem

At 1310 hrs due to a manual switch in network configuration, the central RabbitMQ server went offline for a duration of 4-5 mins. The situation was identified immediately and corrective measures were taken to divert the traffic to the standby queue. In the intervening period, since the API servers couldn't establish connection to the queue, their CPU's got blocked on network I/O.

As soon as the standby queue kicked in, the situation auto-recovered in a few seconds.

Corrective actions
- it was found that the primary queue server had no termination protection associated with it, which is an incorrect setup for any critical piece of infra.
- a lot of similarly named servers were also found to be in the stopped state. The cleanup of these servers ended up triggering the issue. These servers have now been removed.

Posted 3 May at 03:56pm IST.
Resolved

Aggregator API Services recovered.

Posted 3 May at 01:21pm IST.
Re-appeared

Aggregator API Services went down.

Posted 3 May at 01:13pm IST.
Resolved

Meraki API Services recovered.

Posted 3 May at 12:16pm IST.
Updated

Aggregator API Services recovered.

Posted 3 May at 12:15pm IST.
Updated

Merchant API Services recovered.

Posted 3 May at 12:14pm IST.
Updated

Merchant API Services went down.

Posted 3 May at 12:10pm IST.
Updated

Aggregator API Services went down.

Posted 3 May at 12:09pm IST.
Created

Meraki API Services went down.

Posted 3 May at 12:09pm IST.