Twitter | Search | |
John Allspaw Jan 16
Replying to @jhscott
Yep definitely don’t page on signals indicating ‘user pain’ is likely to come soon; you should 100% wait for actual users to have pain before reacting to it. Don’t try to anticipate the pain as a means of heading off and mitigating the path of pain—users need to feel it!
Reply Retweet Like
Jacob Jan 16
Replying to @allspaw
I deserve this :-D. To be clear, metrics are not the whole story of user happiness, safety, or reliability, which are discussed at length in the blog posts linked! The point I was trying to summarize was to page on signals that are closely linked to user pain. (1/X)
Reply Retweet Like
Jacob Jan 16
Replying to @allspaw
Or perhaps, are observable to users (latency is observable to users, CPU usage is not). In the general case non-user-observable signals can be quite noisy -- I can of course think of counter examples though, like disk becoming full (or OOMs, like the incident linked) (2/X)
Reply Retweet Like
Jacob Jan 16
Replying to @allspaw
In general, I agree that prevaling sentiment that noisy pages are bad, because it sabotages trust in the system. I think high precision is important on pages and am willing to tradeoff some recall, especially since you can't depend on pages for everything anyway. (3/X)
Reply Retweet Like
Jacob Jan 16
Replying to @allspaw
I'm super interested if you think this argument in general has giant holes, or if the critique is "when users are unhappy" vs "on signals tied to user paid". And of course in my subsequent tweet, I mention reliability beyond SLOs and tag ACL :-P.
Reply Retweet Like
Current Mortem Jan 16
Replying to @jhscott @allspaw
How does this apply to something like data loss? It seems crazy to wait for data loss to occur before paging but maybe that is not a good example.
Reply Retweet Like
Jacob Jan 16
Replying to @Petey5K @allspaw
what does data loss look like, I guess? Related, how does this apply to security? I can think of cases where there's a clear limit you're going to hit (cert expiration, I mentioned disk free above).
Reply Retweet Like
Jacob Jan 16
Luckily for me I mention that "[SLOs] won't save you from everything" -- perhaps it boils down to heavily favoring precision over recall? Pages that are 100% actionable every time, even if not user visible, might not be that bad? cc
Reply Retweet Like
Liz Fong-Jones (方禮真) Jan 16
Data loss usually has leading indicators you can set an SLO on e.g. underreplication.
Reply Retweet Like
Shelby Spees (she/her) Jan 16
This might be oversimplifying, but I feel like system resiliency can follow a similar pattern to the traditional advice for commenting your code: "Write your code so it doesn't need comments, and then comment it anyway." (Not trying to start a flame war here, bear with me.)
Reply Retweet Like
Shelby Spees (she/her)
's thought experiment was essentially that: Instrument and observe your code as if you don't have pager alerts. Then add pager alerts back in.
Reply Retweet Like More
Shelby Spees (she/her) Jan 16
Alerts, like comments, are static. It takes extra cognitive resources to validate their value and correctness (compared to code, which can at least be run and tested). Comments and documentation don't get updated when code does. The same is true for many pager alerts.
Reply Retweet Like
Shelby Spees (she/her) Jan 16
My team has been getting a weird alert all week, numInputRows is too low. Systems were behaving fine, data was streaming fine. Each day when it was triggered we'd query the DB to make sure data was arriving, and it was.
Reply Retweet Like