Scaling to a Billion (Part 2)
Know Your Enemy
As the scale of your systems increase, the number and types of exceptional or unexpected behaviors increase. At the same time, your understanding of what the system is actually doing decreases.
As the system scales, there is a need for more complete and more actionable logging. Those two requirements are sometimes in conflict - the more verbose logs are, the harder they are to understand. To address this problem, invest in log filtering and analysis tools.
One approach is to implement Structured Logging, where all logging statements have essentially the same format and are therefore machine-readable. A close cousin to logging is an effective monitoring system. Some events require human investigation and intervention, but involving the team can be distracting, demoralizing, and expensive. A good monitoring system requires a low false positive rate (do not alarm when nothing is wrong) and a very low false negative rate (do not miss alarms), but tuning alarm criteria to meet both of those goals is difficult. Every customer order is important, but it is impractical and demoralizing to wake up a developer at 2:00am to fix a single order, so monitoring systems must prioritize events based on the number of customers or requests impacted.
As the system grows, manual handling of exceptional events becomes less reasonable. A machine failure is a likely event in any fairly-large cluster, so it should not require immediate human intervention. Failure of some less-trustworthy components should be treated similarly and the system should maintain SLAs without immediate intervention. As the cluster sizes grow, more classes of errors should fall into the ‘automatic handling’ category. In specific, a failure of one request (or one type of request) should be automatically isolated to not impact future requests, or at least not impact requests of a different type. If the system suffers from known issues (which may take weeks or months to address), the automatic handling system should be able to easily add appropriate mitigation. For example, if a new bug is introduced where purchase orders with more than 50 different products fail more often, but the bug is not immediately fixed - since those are rare, the team might want such orders (or such failing orders) to be automatically ‘parked’ and handled during normal business hours only, rather than trigger the existing alarms, since those failures are not an indication of a new problem.