Degraded performance on customer pages and the primary application.

Write-up

Detailed Incident Report: Service Outages on Aug 22-23

At LaunchNotes, we strive to offer the most efficient and reliable software service possible. We are committed to full transparency with our user community, which is why we're sharing a comprehensive report detailing our recent outage.

Background & Prelude to Incident

In an effort to improve the performance and accuracy of our convent viewer analytics, we've recently invested time and resources in reworking the queries and underlying database schema. Additionally, a significant portion of this transition involved moving our workload from the primary Postgres database to a more scalable, managed TimescaleDB instance that is better suited to the type of analytics data we’re storing and reporting.

As a result of our investigation, we uncovered a few areas for improvement:

Several surprisingly inefficient queries slowed down our systems, notably unnecessary joins, and problematic indices on content views.
A misconfiguration of caching was leading to unnecessary requests and load to our databases.
At times of higher load, this would cause us to bump up against our database connection limits. This would occasionally lead to dropped connections and a poor user experience.

Over the past few days, we’ve seen some diminished performance with our embeddable widget, and served as a revealing gauge of the lurking problems. We started looking deeper into it and noticed a few things:

A bloat in our Redis cache key count and storage, and slower read & write times.
Excessive analytics queries sought to capture both the initial and concluding viewed timestamps for every announcement in both the call and the embed. This amounted to 21 queries, a sizable load by default.
More programmatic traffic than expected, as we noticed several agents with automated pings to this endpoint.

However, we were still up and running, largely due to an aggressive caching strategy at this endpoint. Our Redis instance was bailing us out, ensuring that a majority of requests did not directly burden the database, providing ongoing, albeit brittle, stability.

The Outage

The incident expanded in scope and impact on Monday, August 21. As a part of the previously described ongoing effort, we pushed a change on the evening of the 21st that altered the cache key for the primary call for the embed. This call was responsible for every piece of data that renders on the embed, and fires on page load, so the load was significant.

With this, we started seeing an increase in cache misses, and with it, a significant load increase on our Timescale database. As it was late in the day, this did not yet take any of our systems down.

The next day (Tuesday, Aug 22), as the majority of our traffic started to come online, we started to experience intermittent downtime, which culminated in a 23-minute hard down period for all service early in the afternoon (EDT). By this time, an incident had already been declared, and the engineering team was all hands on deck, responding to the outage.

This scenario exposed a series of overlapping issues:

TimescaleDB, though robust, was ill-prepared to manage the raw query volume and the demanding connection count without the cache's protective layer.
Redis, our in-memory data structure store, began to lag. An overextended configuration and the intermittent absence of Time-To-Live (TTL) specifications for keys meant that the system wasn't evicting keys efficiently and predictably.
Our liveness & health checks on a subset of our servers were misconfigured and were aggressively killing instances as they would try to boot under load.

The result of all of these factors left us struggling to spin up new resources to combat the incoming load. To get out from under this, we had to do several things.

Actions Taken

Our immediate response encompassed several strategic and technical steps:

Traffic Management: To alleviate the frontend pod strain, we redirected traffic from the primary embed endpoint to a 404 error page, ensuring our remaining services had room to spin up and recover.
Infrastructure Scaling: Our Timescale instance was promptly scaled up, doubling the CPU cores to better accommodate the uptick in load. In response to this, we have implemented connection pooling, which will allow us to support a much larger volume of connections.
Rate Limiting: We introduced some temporary, though lenient rate limiting. We did not want to take an aggressive approach here without advance communication with our customer base. In the future, we’ll be revisiting these limits, but plan to first consult with customers and will communicate these changes ahead of time so as to not disrupt service.
Embed Refactoring: A pivotal action we took was to segrgate the embed data into two distinct calls. This split drastically reduced the DB hits, streamlining data retrieval while ensuring efficient caching.
As previously mentioned, the (now) legacy version of the Embed relied on a single call on page load to query all of the data for the service. This created both an unnecessary large volume of calls, and pulled more data than potentially necessary, especially considering many of these calls were automated on a short interval.
With this came a new release of our Embed, version 1.0.0 . This version is more selective and performant about when and what data is pulled, and we recommend all clients upgrade to this version. We will still maintain backward compatibility for older versions until we notify otherwise.
Content Analytics Service Revamp: With a focus on bulk data processing, the refactored service utilizes far fewer database connections, minimizing Timescale DB connections considerably. Improvements to error handling and retry logic were also made, which will bolster data quality. Finally, we identified a missing index on our content viewers table that significantly reduced query time on the query that powers the unread indicator. This was rolled out on Wednesday evening (Aug 23).
Cache Management: We have upgraded the capacity of our Redis instance, ensured TTLs will be applied to all cache keys, and adopted a more aggressive eviction policy to avoid bloat and ensure faster read/write times. We have also purged all old and/or irrelevant records in the cache, decreasing our total cache size by nearly 5x.
Health Check Overhaul: A deep dive revealed our health checks were counterproductively aggressive. We extended check intervals and integrated a different liveness check to provide our infrastructure with more leniency when attempting to instantiate under increased load.

Impact

While the majority of the impact was felt in the embedded widget, we also experienced ripple effects in our other services:

Short, random bursts of downtimes were experienced across the application, causing intermittent service disruptions across all of our main services.
The legacy embed version bore the brunt of this outage, remaining inaccessible for a near-daylong duration.
The analytics feature within the management app took a hit. As Timescale grappled with connection hiccups, users were left with a compromised dashboard experience. We have since ensured that all page views and clicks during this time have been recorded and backfilled. All analytics should be correct and complete at this point.

Conclusion & Forward Path

We have implemented several updated processes to enhance the overall performance of our applications. Our engineering and support teams have elevated the prioritization of ensuring enhanced visibility and monitoring across our infrastructure, such as improvements to proactively identify and address any potential issues at an earlier stage. And to keep our users better informed, we plan to increase the frequency of updates on our Statuspage and provide a greater level of detail in the event of any future incidents.