Today, one of our cloud providers experienced major connectivity issues affecting most of our servers over several datacenters. This lead to our API services being affected and access being degraded between 7:15AM UTC and 9:37AM UTC.
While we built our API to withstand outages and datacenter connectivity issues, the incident today involved the simultaneous drop of 2 groups of datacenters operated by one of our providers. The first group went down at 6:17AM UTC due to electrical outages. At this stage, our services continued to be operational and used the 2nd group of datacenters as a backup. Then at 7:15AM UTC, the second group of datacenters became unreachable because of connectivity issues on the provider's optical network. At this stage, too many of our processing servers were unreachable for the API to continue working normally.
While we worked to restore access as soon as possible, it took until 9:37AM UTC for the API to be back to normal again. We will now take a step back and rethink some of the fallback choices we made to make sure we can withstand the simulatenous failure of two separate groups of datacenters. to make sure this does not happen again.
This is, by far, the largest incident we have had and we would like to sincerely apologize for the impact this may have had on your operations. Customers under SLA agreements will receive refunds as stated in our SLA terms.