Scaling to a Billion (Part 4)
Nothing is Free
Understanding architectural tradeoffs is
important. For example, consider queuing solutions and Enterprise Message Bus
to enhance reliability or enable batching, but use direct Web Service calls
when turnaround time is meaningful.
Exceptional Service
With
millions of transactions, each involving multiple touchpoints, the unexpected
is the norm. Subtle bugs may impact only a small percentage of request and skip
testing; requirement analysis or implementation may miss some input combinations;and
timing issues or hardware failures may cause unexpected states to manifest. All
these unexpected issues require good, tiered. exception handling.
Firstly,
code must be constructed with error handling in mind. Exceptions need to be
caught and handled, issues should be automatically isolated to the minimal set
of related requests, and nothing should bring down the system or stop the
processing train.
But
even with the best coding practices in time, issues will creep up at the most
inconvenient times. All large scale systems require a tiered level of off-hours
support (on-call), and the participation of trained engineers and programmers
in the process. Alarm levels should be set appropriately to balance reducing
system risk and support personnel workload. However, handling one-off errors
and investigating suspected issues can waste precious time, cause employee
dissatisfaction and work-life balance issues, and increase the churn in the
team. Best people practices include investing in continuously improving the
system, automatically delaying handling of issues impacting only a few
transactions to business hours (e.g., by setting high threshold for alarms),
and allowing employees to receive comp time if on-call issues resulted in off
business-hours work.
No comments:
Post a Comment