Saturday, August 17, 2013

Scaling to a Billion - Part V

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 5)

Nothing is Free

Optimizing engineering metrics is never sufficient. To support a high transaction value system, the architects must understand and optimize the metrics the business cares about. In some businesses, low error rate is crucial. In others, lowe turnaround time. Some businesses care about average response time while others cannot tolerate slower response time even for a smaller fraction of the requests. The metrics measured - average latency or P90 (a measure of the experience of the worst 10% of requests) latencies, request error rate or down time, must fit the business the company is in.
And above all, understand your assumptions and state them clearly. Handling ‘five times the order volume’ may sound specific - but does your tests assume that each order has less than 10 items? that orders arrive in a constant rate over an hour rather than in a bursty manner due to batching in other systems? or that other systems do not cause load or lock contention on the database? Misunderstanding your requirements or assumptions may result in perfectly engineered systems that do not help the business grow as planned.
And above all, remember that scaling a system is hard. Systems do not scale linearly, and in many cases, handling twice the load requires more than twice the computing resources, developer time, and stabilization period. Advantages to scale exist - but they take time to materialize.

About the Author

Yaniv Pessach is a software architect living in Bellevue, WA. He worked for multiple SP500 as well as smaller firms throughout the years, and received his graduate degree from Harvard University, where his research focused on distributed systems. You can find more about Yaniv on his website,, or contact him through his linkedin page at‎

No comments: