When I get paged, my first step is to calmly(-ish) asses the situation. What is the problem? Our app metrics have, in many cases, disappeared. Identify and confirm it: yep, bunch of dashboards are gone.
Usually I start debugging at this point. What are the possible reasons for that? Did someone deploy a change? Maybe an update to the metrics libraries? Nope, too early: today’s deploy logs are empty. Did app servers get scaled up, which might cause rate-limiting? Nah, all looks normal. Did our credentials get changed? Doesn’t look like it, none of our tokens have been revoked.
All of that would have been a waste of time. Our stats aggregation & dashboard service, Librato, was affected by a wide-scale DNS outage. Somebody DDoS’d Dyn, one of the largest DNS providers in the US. Librato had all kinds of problems, because their DNS servers were unavailable.
We figured that out almost immediately, without having to look for any potential problems with our system. It’s easy for me to forget to check status pages before diving into an incident, but I’ve found a way to make it easier. I made a channel in our Slack called #statuspages . Slack has a nifty slash command for subscribing to RSS feeds within a channel. Just type
/feed subscribe http://status.whatever.com/feed-url.rss
and boom! Any incident updates will appear as public posts in the channel.
Lots of services we rely on use StatusPage.io, and they provide RSS and Atom feeds for incidents and updates. The status pages for Heroku and AWS also offer RSS feeds - one for each service and region in AWS’ case. I subscribed to everything that might affect site and app functionality, as well as development & business operations - Github, npm, rubygems, Atlassian (Jira / Confluence / etc), Customer.io etc.
Every time one of these services reports an issue, it appears almost immediately in the channel. When something’s up with our app, a quick check in #statuspages can abort the whole debugging process. It can also be an early warning system: when a hosted service says they’re experiencing “delayed connections” or “intermittent issues,” you can be on guard in case that service goes down entirely.
Unfortunately not all status pages have an RSS feed. Salesforce doesn’t provide one. Any status page powered by Pingdom doesn’t either: it’s not a feature they provide. I can’t add Optimize.ly because they use Pingdom. C’mon y’all - get on it!
I’ve “pinned” links to these dashboards in #statuspages so they’re at least easy to find. Theoretically, I could use a service like IFTTT to get notified whenever the page changes - I haven’t tried, but I’m betting that would be too noisy to be worth it. Some quick glue code in our chat bot to scrape the page would work, but then the code has to be maintained, and who has time?
We currently have 45 feeds in #statuspages . It’s kind of a disaster today with all the DNS issues, but it certainly keeps us up-to-date. Thankfully Slack isn’t down for us - that’s a whole different dumpster fire. But I could certainly use an RSS service as an alternative, such as my favorite Feedbin. That’s the great things about RSS: the old-school style of blogging really represented the open, decentralized web.
I’m not the first person to think of this, I’m sure, but hopefully it will help & inspire some of you fine folks out there.