On December 6, the thing we always dread the most happened. At 13:23, we noticed capacity issues with our platform. It soon turned out to be a serious issue which resulted in downtime for many use cases. We are extremely sorry that we allowed this to happen. Uptime and reliability are essential for the service we provide to our customers, so we work hard to make sure that these outages are as few and far between as possible.
Instead of just offering you an apology, though, we’d like to give you a look behind the scenes and show you exactly what we did to fix the issue as quickly as we could.
Let’s dive into a detailed post-mortem of events.
Timeline of events:
13:12 - We deployed preparations for switching to a VPC-based infrastructure. This included new environment variables, holding configurations — such as the subnets, security groups, load balancers to add machines to — specific to the new VPC we’re creating for each region. In addition to that, it contained changes to GoInstance — a tool we wrote in Go that manages launching and activating machines, and taking them away again. The change did not embody the actual switch to VPC, it just laid the groundwork, while being backwards compatible with our current non-VPC setup. Going VPC allows us to improve security, and leverage AWS VPC-only features, such as instance types that have a higher performance at lower cost. The change was thoroughly tested, both locally and in CI, but due to environment vars being different in production than dev or staging, a mismatch could slip through unseen.
13:23 - The first pager hit came in, indicating we’re running on fewer uploader instances than we should be. The issue went unnoticed for a while because loading new environment requires new processes, it took a while for our blue/green deployment to replace all active processes.
13:26 - Emergency Response Team (ERT) dispatched.
13:32 - It was established that we couldn't launch new machines due to invalid environment variables (mismatch in AMI-IDs and Security Group names). GoInstance was referring to unexisting values in some cases.
13:39 - As old machines were rotated away or became unreachable, we now ran on a single machine per region, which autoscalers refused to rotate away.
13:40 - The ERT fixed the mismatch and kicked a build.
13:45 - Twitter and statuspage updated.
13:49 - The build was deployed.
13:51 - Due to an unrelated bug in how GoInstance is built, it turned out the changes did not go live.
13:55 - Puzzled by this, the team tried to roll back the bad build completely.
14:01 - Due to the same bug, the rollback did not activate the changes either. As it later turned out, if we had rolled back further in time, it would have worked. To explain, a few months ago when we switched our stack to Nix, we introduced a bug that would only rebuild GoInstance if changes to it were bundled with changes to other parts of our stack. When we tried to patch GoInstance, changes weren’t picked up and deployed to production. It took the ERT some time to figure this out.
14:08 - The ERT patched the aforementioned issue in our Nix setup and we were able to roll out changes again. We kicked another build and deployed.
14:17 - Both EU & US fleets were at their desired capacity again. Recently, a big piece of stack was added which had not been added to our AMIs yet, meaning machine launches were slower than we are used to.
14:34 - All services verified to be restored, all related customer conversations resolved, updated Twitter and statuspage.
What steps have we taken?
New AMIs have been built for all regions and machine types, making our machine launch times approach three minutes again
The bug in our Nix setup was fixed, so that changes to GoInstance are always reflected in our build
The mismatch in our environment was fixed
Together, these measures allowed for everything to run smooth again. We realize, however, that going forward, we'll need additional measures to make sure something like this can never happen again.
What further steps will we take?
As part of our deploy, we have autoscalers launch machines, this should have stopped the deploy in its tracks before causing damage. Unfortunately, it turns out that due to an unrelated issue, this preflight test was run on the previous environment, and not the currently deployed one. We’ll address this so that this test will work reliably going forward, and be able to catch issues like this.
Other parts of our automated test suite could have also have caught the invalid environment, but they weren't testing for it because the staging environment differs from the production environment. We're currently investigating whether we can make our environments more similar.
We are sorry
Again, we are very sorry about the trouble we have caused you. Hopefully, this post-mortem was able to shine some light on what exactly caused the outage and what we are doing to make sure it never happens again. If you were affected by this outage, please reach out to our support team and we’ll try to make things right!