Wednesday, August 14, 2013

Scaling to a Billion - Part IV

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 4)

Nothing is Free

Understanding architectural tradeoffs is important. For example, consider queuing solutions and Enterprise Message Bus to enhance reliability or enable batching, but use direct Web Service calls when turnaround time is meaningful.

Exceptional Service

With millions of transactions, each involving multiple touchpoints, the unexpected is the norm. Subtle bugs may impact only a small percentage of request and skip testing; requirement analysis or implementation may miss some input combinations;and timing issues or hardware failures may cause unexpected states to manifest. All these unexpected issues require good, tiered. exception handling.
Firstly, code must be constructed with error handling in mind. Exceptions need to be caught and handled, issues should be automatically isolated to the minimal set of related requests, and nothing should bring down the system or stop the processing train.
But even with the best coding practices in time, issues will creep up at the most inconvenient times. All large scale systems require a tiered level of off-hours support (on-call), and the participation of trained engineers and programmers in the process. Alarm levels should be set appropriately to balance reducing system risk and support personnel workload. However, handling one-off errors and investigating suspected issues can waste precious time, cause employee dissatisfaction and work-life balance issues, and increase the churn in the team. Best people practices include investing in continuously improving the system, automatically delaying handling of issues impacting only a few transactions to business hours (e.g., by setting high threshold for alarms), and allowing employees to receive comp time if on-call issues resulted in off business-hours work.

No comments: