Investigating increased latency and failures

Incident Report for Librato

Postmortem

At Librato we strive to deliver a service that our customers can depend on to monitor their infrastructure. If our service is not available or functioning properly, we haven’t met the standard expected by our customers or ourselves. We take our customers’ success seriously and regret that we did not meet that standard on Friday, December 16th from approximately 22:00 until December 17th, 06:15 UTC. We are continually and incrementally working on improvements to the Librato product and infrastructure. Last Friday, we still fell short.

At approximately 2200 hours our alerting systems triggered to inform us of a number of problems in the Librato infrastructure. There were problems with the alerting system, our RDS database replicas, and possibly Cassandra problems. We were able to quickly eliminate Cassandra as a system having any problems, and we discovered after some investigations that the alerting notifications could also ultimately be traced back to the RDS database system.

Things that went well:

We were able to isolate the different subsystems so that metrics storage, alerting, and others were partitioned from the problem. At no point were we at any risk of losing customer data and we were able to continue to send out alerts on the data received.
We’ve been improving our visibility into our infrastructure through the use of our “dogfood” environment and it performed very well so that we maintained complete visibility into our infrastructure throughout the incident.
We have database tools that continually monitor our RDS instances and we were able to use those to characterize a number of aspects of the outage.

Things that did not go well:

The API used for querying metrics from Librato is scaled using a per-process request model that requires a greater number of connections to the database. This server model led to increased turnover of connections as processes were respawned that exhausted the Mysql process list on the server.
We have been planning a move to AWS Aurora for a number of months because like the API, the RDS MySQL instances have been with us for a long time and after being upgraded a number of times, have some lingering issues (like live in-place index creation) that can only be fixed by a brand new database.
We’ve made great strides in improving the distribution of reads across read replicas but like the old adage about too much of a good thing, we may have spread our read load a bit too evenly. When we had a problem with the API that affected the read replicas in a major way, it affected multiple subsystems as a result. Only through redistributing read load around were we able to achieve our desired isolation.
Despite our efforts, our RDS MySQL monitoring was still inadequate as a number of things (e.g. the process list) could only be discovered by manually inspecting things directly on the MySQL instances themselves.

Once we were able to fully understand the relationship between the API and the MySQL problems, we were able to flush the process lists such that the database was no longer overwhelmed by stale connections.

We sincerely apologize for this incident and are taking steps to make sure this doesn't happen again. We want our customers to continue to rely on the same highly available service they've grown accustomed to with Librato. Specifically, we've taken the following measures:

We will be moving to AWS Aurora to allow us apply some indices that will help in reducing load from some of the more complex queries.
We will do a more measured isolation of related subsystems to ensure both a mix of high availability and performance as well as isolating damage when it occurs.
We will be investigating an improved API runtime server that will allow us to pool DB connections better across multiple concurrent requests.
We will be adding additional monitoring of MySQL internals that we will feed into Librato so that abnormal situations can be detected and avoided more readily.

We take the success of your business very seriously at Librato. We've returned to our normal levels of reliability and will continue to work every day to earn the level of trust our customers place in us.

Posted Dec 21, 2016 - 18:22 UTC

Resolved

All systems continue to be fully operational. After some additional investigation and checks we believe the immediate incident to be resolved. We will publish a public post-mortem once we've completed a comprehensive review.

Posted Dec 17, 2016 - 06:50 UTC

Monitoring

Alerts with a "stops reporting" threshold are now functioning again. At this point all functionality is operational and we are continuing to monitor the situation.

Posted Dec 17, 2016 - 06:29 UTC

Update

We believe we have identified the set of causes intersecting to cause this incident. Some initial changes have been applied and both the API and web application performance has improved as a result. We are continuing to monitor the performance and investigate. Alerts with a "stops reporting" threshold are still unavailable, all other alert types are functional.

Posted Dec 17, 2016 - 06:15 UTC

Update

Still working on reducing latency on API routes..

Posted Dec 17, 2016 - 05:05 UTC

Update

Continuing to work through high latency on the API routes, we will continue to update this issue.

Posted Dec 17, 2016 - 04:25 UTC

Update

We are continuing to work through high latency on certain API routes. Data submission is still being accepted and the alerting pipeline is isolated from current latency.

Posted Dec 17, 2016 - 03:40 UTC

Identified

We are investigating a regression in API behavior that is leading to failures.

Posted Dec 17, 2016 - 01:52 UTC

Update

Alerts are processing again and we have isolated problematic resources.

Posted Dec 17, 2016 - 00:52 UTC

Update

Alert processing delay has increased, we are still investigating.

Posted Dec 16, 2016 - 23:56 UTC

Monitoring

Alert processing has returned to real time. We are continuing to monitor the situation.

Posted Dec 16, 2016 - 23:31 UTC

Identified

We have identified the source of the issue and are working on a fix.

Posted Dec 16, 2016 - 22:58 UTC

Update

Alerts may be delayed.

Posted Dec 16, 2016 - 22:26 UTC

Investigating

We are currently investigating this issue.

Posted Dec 16, 2016 - 22:19 UTC