Saturday, August 17, 2013

Scaling to a Billion - Part V

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 5)

Nothing is Free

Optimizing engineering metrics is never sufficient. To support a high transaction value system, the architects must understand and optimize the metrics the business cares about. In some businesses, low error rate is crucial. In others, lowe turnaround time. Some businesses care about average response time while others cannot tolerate slower response time even for a smaller fraction of the requests. The metrics measured - average latency or P90 (a measure of the experience of the worst 10% of requests) latencies, request error rate or down time, must fit the business the company is in.
And above all, understand your assumptions and state them clearly. Handling ‘five times the order volume’ may sound specific - but does your tests assume that each order has less than 10 items? that orders arrive in a constant rate over an hour rather than in a bursty manner due to batching in other systems? or that other systems do not cause load or lock contention on the database? Misunderstanding your requirements or assumptions may result in perfectly engineered systems that do not help the business grow as planned.
And above all, remember that scaling a system is hard. Systems do not scale linearly, and in many cases, handling twice the load requires more than twice the computing resources, developer time, and stabilization period. Advantages to scale exist - but they take time to materialize.

About the Author

Yaniv Pessach is a software architect living in Bellevue, WA. He worked for multiple SP500 as well as smaller firms throughout the years, and received his graduate degree from Harvard University, where his research focused on distributed systems. You can find more about Yaniv on his website, www.yanivpessach.com, or contact him through his linkedin page at www.linkedin.com/in/yanivpessach‎

Wednesday, August 14, 2013

Scaling to a Billion - Part IV

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 4)

Nothing is Free

Understanding architectural tradeoffs is important. For example, consider queuing solutions and Enterprise Message Bus to enhance reliability or enable batching, but use direct Web Service calls when turnaround time is meaningful.

Exceptional Service

With millions of transactions, each involving multiple touchpoints, the unexpected is the norm. Subtle bugs may impact only a small percentage of request and skip testing; requirement analysis or implementation may miss some input combinations;and timing issues or hardware failures may cause unexpected states to manifest. All these unexpected issues require good, tiered. exception handling.
Firstly, code must be constructed with error handling in mind. Exceptions need to be caught and handled, issues should be automatically isolated to the minimal set of related requests, and nothing should bring down the system or stop the processing train.
But even with the best coding practices in time, issues will creep up at the most inconvenient times. All large scale systems require a tiered level of off-hours support (on-call), and the participation of trained engineers and programmers in the process. Alarm levels should be set appropriately to balance reducing system risk and support personnel workload. However, handling one-off errors and investigating suspected issues can waste precious time, cause employee dissatisfaction and work-life balance issues, and increase the churn in the team. Best people practices include investing in continuously improving the system, automatically delaying handling of issues impacting only a few transactions to business hours (e.g., by setting high threshold for alarms), and allowing employees to receive comp time if on-call issues resulted in off business-hours work.

Sunday, August 04, 2013

Scaling to a Billion - Part III

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 3)

Up Up and Away

When asked to scale a solution, the first concepts that leaps to the minds of developers is momentary load. Measured in ‘‘requests per second’, scaling consists in handling more requests in the same time.But how many more? and when? it is likely that the load on your system varies through the day, between weekdays and weekends, and between rush times and ordinary days. For drugstore, like many other eCommerce retailers, Black Friday (and Cyber Monday) represented an annual peak. Yours may differ - but you should know when your requests hit max load, and what load you expect to handle. Backend requests can often be queued, but queuing introduces delays - your service development must be guided by your SLA (Service Level Agreement): is it OK to delay processing some requests for 15 seconds on peak times? how about 15 minutes?

Tips

A few tricks to help optimize systems quickly are
- Minimize data flow. Moving data around in memory or from disk takes time. Check the columns fetched in your SQL query and slim them. Verify that your database table only contains required rows, and that unnecessary (or old) rows are purged or archived. Considering covering indexes for common queries. And reduce the amount of data (such as unnecessary fields) passed in web service calls.
- Scaling horizontally means adding more machines. It is easier to design a system for horizontal scaling if your services are stateless. Soft state (cached data) is usually OK, but consider a shared or distributed cache if the same data ends up cached on multiple machines.
- Seek out and eliminate all single point of failures.  The same search will likely identify some of your choke points - the servers that have to handle -every- requests. Consider alternatives.
- Scale horizontally or partition your data. Either plan for many machines to process your workload at once, where each machine has access to the entire data, or divide your data into partitions, and have a separate set of machines process each. Either approach has pros and cons - understand your tradeoffs.
- Simplify your solutions. Complex solutions are hard to maintain or even get right. And they are harder to optimize.