In the midst of an outage at our IaaS provider (AWS) – I reflect on what a crazy year it has been and how fortunate we are to go through it together.
A fitting end to 2020
Sometimes the world throws something at you that, despite your best preparation and skills training, you get hit square in the face and fall on your ass.
– 2020 in a nutshell.
In what is now month 9 or 10 (depending on who you are asking) of a global pandemic, we’ve been doing more non-traditional Zoom meetings as much of the world has; cocktail hours, games of Among Us, and the tournament-style Dad Joke contest that our SupportOps Manager orchestrated last month for the team.
Zoom has become the substitute for so much.
A recent Friday morning was our first yoga-over-zoom session. Not everyone could make it, however, those that did followed along for around an hour as the instructor demonstrated various stretching and breathing exercises. It was a touch of team building, with a dose of self care, and should have been a relaxing ride into the weekend.
Everyone on the planet has had to evaluate and examine old notions and learn new coping strategies this year as we wrestle with the metric ton of chaos around us.
Nearly the exact moment the yoga session was ending, PagerDuty goes bananas. Alerts bubble into Slack and all the calm in the room evaporates as the team springs into action. An Aurora RDS database cluster has failed, and failed hard. Even the Amazon cloud is going to fail sometimes – and when it does, most of the world feels it. This failure was isolated and seemed to only be affecting us, and even then, just a single cluster of many that we manage.
To the best of our abilities, we’ve architected around the inevitable failure; nothing is ever 100% and we account for that in our stack design. We’ve made decisions that result in a higher cost, but yield automated recovery and multiple redundancies to shelter the assets of our customers.
- Our automated recovery systems: Failed
- Our manual recovery attempts: Failed
- AWS’s automated recovery routines: Failed
.. and the clock is ticking along.
To make a long story and a 3-hour outage event short: It was a failure/bug in the AWS software that runs the AWS Aurora database platform that had the cluster stuck in a repeating loop of: segfault -> recovery attempt -> segfault.
All our training and our disaster recovery gamedays prepared us – because sometimes even the world’s largest cloud provider has an unpatched bug in their codebase – we started executing our plan B’s, C’s, and D’s.
See the status thread here: https://status.pagely.com/incidents/22ggyk6m3vg1
Thankfully the AWS engineers were able to recover the cluster and get everything back online.
Sometimes, managed database services fail. Despite your best preparation and skills training, you get knocked around a bit – but your team is there with you – working through the problem in real time
– our Friday in a nutshell
Did I mention we did yoga earlier that day?
Sally and I are in the enviable position, and thankfully have been for sometime now – where we are most certainly NOT the most important people in our company. So while this was all going down we chimed in a bit here or there, but mostly we just watched our #warroom slack channel and marveled at the expertise and professionalism on display.
There was no single hero of the day – there was only a team of peers dynamically working together as a single unit – and it was glorious to behold.
I’ll save you the rest of the corporate platitudes and thought leadership BS and just say that who you work with makes all the difference in the end.
In a global context this year could have gone any number of ways – sadly it went about as poorly as imaginable. In the center of the bad and the occasional bright spots of good – were people.
2020 has taught me that the people we choose to lead us, the people we choose to follow, the people we choose as business partners, the people we surround ourselves with, and the people we invest our time in are the difference – for better or worse.
I am humbled and grateful that our people here at Pagely are ones I am proud and thankful to work with on a daily basis, and I look forward to seeing what opportunities 2021 will bring us.
I wish all of you a safe and happy holiday season and a bright beginning to the New Year ahead.
— Joshua Strebel