On May 31, 2018 between 21:53 UTC and 23:22 UTC we experienced a major outage of the metrics APIs for both AppOptics and Librato products. The API was returning a 5xx for a majority of incoming metric data and this impacted APM metrics and traces submitted from AppOptics APM. Alerts were also impacted and many notifications were delayed. We realize customers rely on us to monitor their infrastructure and expect a better experience from their monitoring provider. This is a description of what happened in our infrastructure, our understanding of the causes of the incident and how we plan to prevent this from occurring again.
This outage correlated with a power event in one of AWS’s availability zones in the US-East-1 region. The AppOptics and Librato infrastructure runs across multiple availability zones, including one of the availability zones impacted and we briefly lost access to nodes in several critical pieces of infrastructure. We run services across multiple AZs to safeguard against this type of incident, but discovered that due to a Kafka/Zookeeper failure mode, we didn’t have enough isolation to avoid an outage during this event.
Our metrics pipeline, which includes alerting, is driven by a stream processing model built on top of Apache Kafka and all incoming data is written to one or more topics in Kafka. During this incident we lost one node in the Zookeeper cluster that is used to coordinate the Kafka brokers in our Kafka clusters. That node was the leader at the time, which caused the ZK ensemble to perform a leader election to find a new leader. We manually restarted one node in the cluster to invoke a leader election and the ensemble regained quorum. By 22:01 UTC we had verified the ZK ensemble was active again with a newly elected leader.
At the same time several Kafka brokers in the impacted availability zone were impacted and the brokers were restarted. This should normally be fine as we run Kafka with replicas and the replicas should be able to take over as the partition leaders. However, in this scenario many partitions did not fail over to their replicas automatically and many partitions were immediately offline. This meant that writes and reads to the Kafka topics in our metrics pipeline were offline and we dropped requests with failed 500 codes. It’s not immediately clear what impact losing quorum in the ZK clusters at the same time had on the recoverability of the Kafka cluster.
At 22:07 UTC we had identified that many of our Kafka partitions were offline and did not have a leader. We did not see that they were recovering on their own and decided to start a rolling restart process of each Kafka broker. As we started to restart each broker and observe its behavior we saw slight improvement in our availability as more partitions came online. It wasn’t until 23:45 UTC that we had fully restarted each broker and verified that all partitions were back online. At this point metric posts were successful, APM data was being successfully processed, and alert notifications were able to be processed again.
We run more than one cluster of Kafka brokers with the intent of maintain a high-availability configuration if a given cluster were offline. Producers are able to failover clusters if they don’t find a given partition is online when they go to write a record. Unfortunately, this did not isolate us from failure in this scenario as several partitions between clusters had ended up in the same availability zones. The Kafka clusters also shared a Zookeeper ensemble so coordination was broken on both clusters when the ZK quorum was briefly lost.
This is a failure we expect ourselves to be isolated from but we were not in this scenario. One of the most clear action items from this incident is that we need to better isolate our Kafka clusters across availability zones and even regions and that ZK ensembles should not be shared across Kafka clusters. We believe that Kafka should have regained partition leadership without manual intervention once the ZK cluster had established quorum again. We plan to gameday this scenario more and identify if this is fixed in newer versions of Kafka.
There are also additional operational monitoring opportunities that we could have used in firefighting this incident that would have led to a faster recovery. We’ve put in place more checks that would notify us earlier to subsystems that required additional assistance in recovering from this outage.
We know our customers rely on us to provide always available monitoring, especially during large cloud operational incidents such as this one. We expect better of ourselves and are working to ensure we deliver a better product.