Metrics, rolled up metric data and alerts are delayed
Incident Report for Librato
Postmortem

Summary

A subset of measurements for several legacy (single dimension e.g. source) Time Series aggregation workloads (1 minute, 15 minute and 1 hour) as well as associated alerts on these measurements were delayed on Jan 4th from 20:15:00 UTC to 23:30:00 UTC. One workload in particular (1 minute aggregations) took an additional 1.75 hours to complete processing backlog and finished at Jan 5th 01:15:00 UTC. During this time historical rollups for the aforementioned periods were not available in the UI. Real time measurements (Last 60 minutes views) remained operational.

Trigger and Root Cause

While performing a rolling restart of one of our primary 0.10.2.0 Kafka clusters we encountered https://issues.apache.org/jira/browse/KAFKA-5413 (Log cleaner fails due to large offset in segment file) on 2 brokers. On restart this caused consumers on some topics and partitions to be blocked for nearly 40 minutes while the Kafka Log Cleaner compacted a large backlog of _consumeroffset segment logs.

Actions

All Kafka clusters are being upgraded to Kafka 0.11.2 We are setting up alerts on several key metrics that will provide warning when log compaction stops running. We plan on providing a more detailed blog post in the coming days.

Posted Jan 11, 2018 - 17:28 UTC

Resolved
This incident has been resolved.
Posted Jan 05, 2018 - 01:25 UTC
Identified
We are working on the backlog.
Posted Jan 04, 2018 - 22:33 UTC
Update
Metrics, rolled up metric data and alerts are delayed
Posted Jan 04, 2018 - 21:47 UTC
Investigating
We are currently investigating this issue.
Posted Jan 04, 2018 - 21:29 UTC