Over the weekend Feedly, the RSS aggregator tool for millions, had to perform a massive upgrade. Feedly announced on their blog and twitter that this would be occurring. Just shortly before the upgrade began, Feedly announced on Twitter, “Starting maintenance. The http://feedly.com and the Feedly mobile applications will be offline for the next 2 hours. Thank you for your patience.”
This is sweet and simple, load WordPress sites in as fast as 37ms with Cloudways!
And of course things didn’t go quite as planned. As a very avid Feedly user, I’m surprised I hadn’t heard about the upgrade until, well, nothing worked. I don’t use the official Feedly apps, instead 3rd party RSS readers. So from the get go this was an unpleasant surprise. I wonder why there wasn’t a default message that could’ve been pushed as a feed that the upgrade was occurring?
The bigger problem is that, while realtime updates were provided, the original estimate for downtime was off. Way off! From two hours to approximately seven hours. I can only imagine the slightly insane chaos that was occurring when trying to get service back up. I really feel for all those involved.
Twitter was not so kind. Many of the comments, not worth reprinting, were mean spirited, derogatory, and just unnecessary. I’m going to guess that the worst of the comments came from people who aren’t even paying for the service. As far as I know Feedly doesn’t monetize any user generated content (i.e. what feeds I’m subscribing to, which ones I actually read), I could be wrong but at the moment a fair assumption. I’m going to ignore those overall rants but Feedly could’ve done a number of things to mitigate the response.
The smartest thing to do when attempting any upgrade is to communicate very clearly the service outage timespan. There are two things that should be communicated. The first is the service/migration/upgrade window. This is the timespan in which outages may occur. Typically one would say that there will be an eight hour upgrade window on a specific day. There is no reason to be conservative here. The shortest upgrade window I have every used in business is four hours. You can’t go wrong by over estimating the amount of time required. Just be reasonable. A 24 hour window isn’t fair but an eight hour window is definitely ok.
How do you go about estimating the service window? I think a good rule of thumb is actual estimated outage time times four. In the Feedly example, if they planned on having everything unavailable for two hours, then the total service window should be eight hours. Hey if you get done early, everyone is even happier. It’s about setting the expectations. Oh and if you go over the two hours, guess what, you’ve got plenty of leeway since the whole service window is eight hours, and well, things happen.
So the original message should’ve been:
Hello customers, we’ll be upgrading our servers starting 10am PT. The upgrade window is eight hours and we expect no more than two hours of downtime, though delays may occur and we wanted to set expectations accordingly with the larger service window. We will update on Twitter and our expected completion is 6pm PT.
Now that would’ve been the right way to do it.
Last but not least, if everything goes well (and definitely if things don’t), be very very clear about what you did that took your SaaS offline. In fact this “document” should be pretty much a re-hash of the original and extra clear description of what’s going to happen. Most people will skip, but geeks will appreciate the effort. Any issues that arose and effected the timing should be overly communicated, i.e. Scott tripped over the cable and temporarily turned off the internet. Again, the people who care to check in will appreciate and respect the care that’s gone into communicating the details. The Feedly Update Log doesn’t cut the mustard, Maintenance on March 27th, 2021 completed: “Thanks again for your patience. The service is back online. A big thank you to the DevOps team for pulling off this major upgrade. Have a wonderful weekend!”
Oh, yes this isn’t just about SaaS companies 😉 As a WordPress or other digital agency you should apply the same Service Window and Expected Outage formula when working with your clients.