Scaling to a Billion (Part 2)
Know Your Enemy
As
the scale of your systems increase, the number and types of exceptional or
unexpected behaviors increase. At the same time, your understanding of what the
system is actually doing decreases.
As
the system scales, there is a need for more complete and more actionable
logging. Those two requirements are sometimes in conflict - the more verbose
logs are, the harder they are to understand. To address this problem, invest in
log filtering and analysis tools.
Monitoring
One approach is to implement Structured
Logging, where all logging statements have essentially the same format and are
therefore machine-readable. A close cousin to logging is an effective
monitoring system. Some events require human investigation and intervention,
but involving the team can be distracting, demoralizing, and expensive. A good
monitoring system requires a low false positive rate (do not alarm when nothing
is wrong) and a very low false negative rate (do not miss alarms), but tuning
alarm criteria to meet both of those goals is difficult. Every customer order
is important, but it is impractical and demoralizing to wake up a developer at
2:00am to fix a single order, so monitoring systems must prioritize events
based on the number of customers or requests impacted.
Expansion
As the system grows,
manual handling of exceptional events becomes less reasonable. A machine
failure is a likely event in any fairly-large cluster, so it should not require
immediate human intervention. Failure of some less-trustworthy components
should be treated similarly and the system should maintain SLAs without immediate
intervention. As the cluster sizes grow, more classes of errors should fall
into the ‘automatic handling’ category. In specific, a failure of one request
(or one type of request) should be automatically isolated to not impact future
requests, or at least not impact requests of a different type. If the system
suffers from known issues (which may take weeks or months to address), the
automatic handling system should be able to easily add appropriate mitigation.
For example, if a new bug is introduced where purchase orders with more than 50
different products fail more often, but the bug is not immediately fixed -
since those are rare, the team might want such orders (or such failing
orders) to be automatically ‘parked’ and handled during normal business hours
only, rather than trigger the existing alarms, since those failures are not an
indication of a new problem.
No comments:
Post a Comment