Saturday, August 17, 2013

Scaling to a Billion - Part V

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 5)

Nothing is Free

Optimizing engineering metrics is never sufficient. To support a high transaction value system, the architects must understand and optimize the metrics the business cares about. In some businesses, low error rate is crucial. In others, lowe turnaround time. Some businesses care about average response time while others cannot tolerate slower response time even for a smaller fraction of the requests. The metrics measured - average latency or P90 (a measure of the experience of the worst 10% of requests) latencies, request error rate or down time, must fit the business the company is in.
And above all, understand your assumptions and state them clearly. Handling ‘five times the order volume’ may sound specific - but does your tests assume that each order has less than 10 items? that orders arrive in a constant rate over an hour rather than in a bursty manner due to batching in other systems? or that other systems do not cause load or lock contention on the database? Misunderstanding your requirements or assumptions may result in perfectly engineered systems that do not help the business grow as planned.
And above all, remember that scaling a system is hard. Systems do not scale linearly, and in many cases, handling twice the load requires more than twice the computing resources, developer time, and stabilization period. Advantages to scale exist - but they take time to materialize.

About the Author

Yaniv Pessach is a software architect living in Bellevue, WA. He worked for multiple SP500 as well as smaller firms throughout the years, and received his graduate degree from Harvard University, where his research focused on distributed systems. You can find more about Yaniv on his website,, or contact him through his linkedin page at‎

Wednesday, August 14, 2013

Scaling to a Billion - Part IV

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 4)

Nothing is Free

Understanding architectural tradeoffs is important. For example, consider queuing solutions and Enterprise Message Bus to enhance reliability or enable batching, but use direct Web Service calls when turnaround time is meaningful.

Exceptional Service

With millions of transactions, each involving multiple touchpoints, the unexpected is the norm. Subtle bugs may impact only a small percentage of request and skip testing; requirement analysis or implementation may miss some input combinations;and timing issues or hardware failures may cause unexpected states to manifest. All these unexpected issues require good, tiered. exception handling.
Firstly, code must be constructed with error handling in mind. Exceptions need to be caught and handled, issues should be automatically isolated to the minimal set of related requests, and nothing should bring down the system or stop the processing train.
But even with the best coding practices in time, issues will creep up at the most inconvenient times. All large scale systems require a tiered level of off-hours support (on-call), and the participation of trained engineers and programmers in the process. Alarm levels should be set appropriately to balance reducing system risk and support personnel workload. However, handling one-off errors and investigating suspected issues can waste precious time, cause employee dissatisfaction and work-life balance issues, and increase the churn in the team. Best people practices include investing in continuously improving the system, automatically delaying handling of issues impacting only a few transactions to business hours (e.g., by setting high threshold for alarms), and allowing employees to receive comp time if on-call issues resulted in off business-hours work.

Sunday, August 04, 2013

Scaling to a Billion - Part III

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 3)

Up Up and Away

When asked to scale a solution, the first concepts that leaps to the minds of developers is momentary load. Measured in ‘‘requests per second’, scaling consists in handling more requests in the same time.But how many more? and when? it is likely that the load on your system varies through the day, between weekdays and weekends, and between rush times and ordinary days. For drugstore, like many other eCommerce retailers, Black Friday (and Cyber Monday) represented an annual peak. Yours may differ - but you should know when your requests hit max load, and what load you expect to handle. Backend requests can often be queued, but queuing introduces delays - your service development must be guided by your SLA (Service Level Agreement): is it OK to delay processing some requests for 15 seconds on peak times? how about 15 minutes?


A few tricks to help optimize systems quickly are
- Minimize data flow. Moving data around in memory or from disk takes time. Check the columns fetched in your SQL query and slim them. Verify that your database table only contains required rows, and that unnecessary (or old) rows are purged or archived. Considering covering indexes for common queries. And reduce the amount of data (such as unnecessary fields) passed in web service calls.
- Scaling horizontally means adding more machines. It is easier to design a system for horizontal scaling if your services are stateless. Soft state (cached data) is usually OK, but consider a shared or distributed cache if the same data ends up cached on multiple machines.
- Seek out and eliminate all single point of failures.  The same search will likely identify some of your choke points - the servers that have to handle -every- requests. Consider alternatives.
- Scale horizontally or partition your data. Either plan for many machines to process your workload at once, where each machine has access to the entire data, or divide your data into partitions, and have a separate set of machines process each. Either approach has pros and cons - understand your tradeoffs.
- Simplify your solutions. Complex solutions are hard to maintain or even get right. And they are harder to optimize.

Monday, July 15, 2013

Managing Software Architecture & Architects - Webinar

This live webinar is open (and free) to PMI (Project Management Institute) members only - contact me if you are not a PMI member but interested in the subject matter.

Live webinar capacity is limited to 1000 (on a first-join basis), but the session is recorded for later viewing.

Free registration at:

The PM Guide to Managing Architecture in Software Development Projects
Presenter: Yaniv Pessach,  PMP®, CSM

July 24, 2013 • 12 PM - 1 PM EST (UTC -4)
Software architecture differs from other aspects of planning and executing a project in multiple ways. It is critical to the success of complex IT projects, but managing it is less understood than the management of requirements, implementation, and testing, and it involves unique difficulties such as requiring output from senior technical staff members who may be spread across multiple projects. In this presentation, different methods of managing the architecture process and working well with software architects are presented and discussed, relying on techniques from both waterfall and agile principles.
We will discuss:
  • What is software architecture and when should it take place
  • Common approaches to software architectures, and when to pursue them
  • The role of the PM in architecture
  • Difficulties in managing the creation of software architecture
  • Common architecture frameworks

Topics Covered in the Presentation
  • What is software architecture and when should it take place?
    • Architecture has a role in the Initiating, Planning, Executing, and even the Monitoring process groups
  • What are difficulties in managing the creation of software architecture?
    • Interdependence on emerging requirements and technical understanding
    • Working with senior ICs and partially-allocated resources
    • Lack of a crystal ball
  • What are common architecture frameworks?
    • Big Design Up Front
    • Emerging
    • Good Enough Architecture
    • Time Boxed Architecture
  • What are the similarities (and conflicts) between the role of the Architect and the role of the PM?
    • Coordination with multiple stakeholders
    • Overall project oversight

Key Takeaways

  • Why software architecture differs from either planning or execution?
  • What are common approaches to software architectures, and when to pursue them?
  • What is the role of the PM in architecture?

Saturday, July 13, 2013

Scaling to a Billion - Part II

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 2)

Know Your Enemy

As the scale of your systems increase, the number and types of exceptional or unexpected behaviors increase. At the same time, your understanding of what the system is actually doing decreases.
As the system scales, there is a need for more complete and more actionable logging. Those two requirements are sometimes in conflict - the more verbose logs are, the harder they are to understand. To address this problem, invest in log filtering and analysis tools. 


One approach is to implement Structured Logging, where all logging statements have essentially the same format and are therefore machine-readable. A close cousin to logging is an effective monitoring system. Some events require human investigation and intervention, but involving the team can be distracting, demoralizing, and expensive. A good monitoring system requires a low false positive rate (do not alarm when nothing is wrong) and a very low false negative rate (do not miss alarms), but tuning alarm criteria to meet both of those goals is difficult. Every customer order is important, but it is impractical and demoralizing to wake up a developer at 2:00am to fix a single order, so monitoring systems must prioritize events based on the number of customers or requests impacted.


As the system grows, manual handling of exceptional events becomes less reasonable. A machine failure is a likely event in any fairly-large cluster, so it should not require immediate human intervention. Failure of some less-trustworthy components should be treated similarly and the system should maintain SLAs without immediate intervention. As the cluster sizes grow, more classes of errors should fall into the ‘automatic handling’ category. In specific, a failure of one request (or one type of request) should be automatically isolated to not impact future requests, or at least not impact requests of a different type. If the system suffers from known issues (which may take weeks or months to address), the automatic handling system should be able to easily add appropriate mitigation. For example, if a new bug is introduced where purchase orders with more than 50 different products fail more often, but the bug is not immediately fixed -  since those are rare, the team might want such orders (or such failing orders) to be automatically ‘parked’ and handled during normal business hours only, rather than trigger the existing alarms, since those failures are not an indication of a new problem.

Another Blog!

I am opening/branching another blog for distributed computing (and distributed systems) research-related topics. I think that blog appeals to a different, more specialized audience - so the separation makes sense. And, not surprisingly.. the blog name is... drumroll...

Tuesday, July 02, 2013

Scaling to a Billion - Part I

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part I)

In 2011, the late stage startup I was with sold and fulfilled eCommerce orders in an annual rate of half a billion dollars. After being purchased by a major brick-and-mortars retailer, our backend fulfillment systems were enhanced to fulfill not only the orders coming from our own site, but also to handle most of the fulfillment of orders placed on the retailers’ site. With multiple sources of orders, we had to be ready for considerably more orders placed and processed. As the Principal Engineer of the group responsible for Order Processing and Payment systems, I led the technical design of our systems toward handling that elusive and quite challenging goal of enabling our systems to support one billion dollars in annual sales. This article is about some of the lessons that the team and I learned in the process.

Reducing Uncertainty Under Pressure

When your application is directly responsible for revenue, you and your team will find yourself constantly under a microscope. When issues occur, escalations are distracting and time consuming; therefore, part of the effort in designing a system must go towards minimizing outliers and possible escalations. A stable, reliable system may be preferable to a more efficient system that runs in fits and breaks. When looking at performance and SLAs, it is not enough to minimize the average response or cycle time - minimizing the variance of those numbers is crucial.

(To Be Continued)

Sunday, May 26, 2013

Why I prefer to think of myself as a computer -scientist- :)

A guy wanted to know the volume of a red rubber ball.

First he took it to a mathematician, who measured its radius and used the formula V=4/3*pi*r^3 to find its volume.
Next, our guy went to a physicist, who immersed the ball in a bowl full of water. He then measured the amount of water which overflowed and calculated the volume of the ball.

Still not satisfied, our guytakes the ball to a mechanical engineer. The engineer says, "Wait a moment, I got this." He gets up and skims through the books laid out on his shelf. "Ah, this should do it.", he says and pulls out a big fat hard bound book titled - "The Mechanical Engineer's Handbook to Red Rubber Balls"

Sunday, May 19, 2013

Read my article in SDJ: Scaling to a Billion

Software Developer's Magazine just published my article 'Scaling to a Billion' detailing my experience and tips in scaling an eCommerce backend system to be able to process a billion dollar in annual transactions.
Here's a quick outline:

  • Reducing Uncertainty Under Pressure
  • Know Your Enemy
  • Up Up and Away
  • Nothing Is Free
  • Exceptional Service
  • Final Words

more details to come...

Thursday, April 04, 2013

Simplified Software Development Failure Mode and Effect Analysis (FMEA)

Classically, FMEA consists of identifying the components of a system, enumerating their failure modes, the possible causes for each failure, risk, and mitigation.

The same principle applies in software design, but the types failure modes tend to be very different, and may apply per component or per component interaction or call. An example may help.
Consider a system with a web frontend (web page) with a button, a business rules middleware, and database. Let's ignore multiple instances.
Frontend: Failure mode may be 'frontend is nonresponsive', 'frontend returns wrong html' etc.
Middleware: (for the 'buy' interaction): Failure modes may be 'request times out', 'requests non responsive', 'request fails', 'requests in inconclusive state', 'request updates partial data', and 'request updates wrong data'.
For each of those there would be one or more possible causes. for example 'request non responsive' may be due to 'middleware not running', 'database connectivity not available', or 'database query not responsive' - the latter is likely to have a fan out of causes such as 'high request load' and 'table deadlocks' etc. A complete modeling would include impact, error rate, and detection rate as parameters for each cause and occurance.

Comparing Options with Pugh Matrix

This is pretty much the pros-and-cons comparison we all do intuitively, standardized.

Simply write down your 'base' option, and all other options as columns, and all relevant aspects as rows.
Then, for each feature/option combo, specify whether it is better (+) or worse (-) than the baseline, and by how much. The 'base' by definition is '0's

FeatureBaseOption BOption C
Transaction speed+0
Storage requirement

Management of Failing Project - the Busywork Spiral

Managers, like all of us, respond to rewords, and both use and elicit signals.
In most cases, that process is benign, ensuring the managers of an organizations are aligned (through incentives) with the organizational goals.
In failing projects, however, a curious phenomena can be observed. As the project drifts more and more into late territory and employee time becomes more of a rare commodity, more (rather than less) time is being consumed on managerial overhead - meetings, status reports, and similar artifacts.

On explanation for this phenomena is the managers' reasonable desire not to appear neglectful. If the project is late and status was NOT collected, he is at fault, at least in the eyes of his superiors. If all controls were implemented, however, the responsible party is not as clear, and the manager may avoid being penalized personally for the project delay or failure.

And of course, since the Mythical man-month we all know that adding people to a project mid-way will slow it down.. and yet, twelve times out of every dozen projects, management will 'help' delayed project with assigned resources.

Some of this is discussed in Why Software Projects are Terrible and How Not To Fix Them 

Innovation with Morphological Matrix and Copycatting

At the heart of Morphological Matrix is the idea that each solution proposal is composed of solutions to sub-features. First, those are organized into a table, and then all the combinations may be explored.
For example:
ComponentOption AOption BOption C
CommunicationSOAP callQueued/MSMQREST
StorageRelationalKey-Value storeFlat files

With the Morphological Matrix approach, all combinations of features (3*3 in the example above) are explored and evaluated. This can be daunting, so my personal variation is 'copycatting': starting from each proposal, for each feature, consider the alternative implementations proposed in the competing solutions. If any of those are an improvement, adopt them. For example, when trying to improve Option A, analysis may show that using queued calls would improve system behavior. We then create option A', and 'copycat' that feature. We get:
ComponentOption AOption BOption COption A'
CommunicationSOAP callQueued/MSMQRESTQueued/MSMQ
StorageRelationalKey-Value storeFlat filesRelational

Both those approaches allow a methodical way of merging the best elements of competing approaches into a better solution.

Sunday, March 24, 2013

Startup funding ACID test

What does it takes to get a deal funded (Henry H Wang) -
1/ Team. have a team with the right skills, experience, and that gets along.
2/ Address a huge market. Addressing a small market is risky if you miss the perfect sweet spot.
3/ Technology. Your technology must be a barrier to entry either by difficulty or existing IP
4/ Customers - can hyou name anyone who would want your product?
5/ Connections. Do you have any special connections to potential customers
6/ Financials. why would the numbers work?

Saturday, March 16, 2013

Why Good Ideas Fail good ideas is one thing, but gaining the required support within a corporation is quite another.
In most organization, you'd have to 'run the gauntlet' to get your idea accepted.
 Some ideas to make this more likely:

  • Let your opposition speak. They will anyhow. Show respect for their dissenting voice.
  • Do not attack back
  • Expect the common ways your ideas will be attacked
    • Fear-mongering
    • Delay
    • Confusion
    • Ridicule
  • Expect the common not-so-innocent questions meant to derail you, including:
    • Another Problem: "Money is the only real issue"; "What about ?";
    • Inertia: "We're successful, why change?";"The problem is not that bad";"You're implying we've been failing"
    • Solution: "Your proposal goes too far/doesn’t go far enough.";"You’re abandoning our core values.";"No one else does this.";"People have too many concerns.";" It puts us on a slippery slope."
    • and more...
  • Be prepared!
 Interesting read @HBR @Forbes @cbsnews

Tuesday, March 12, 2013

SQL vs. NoSQL (part i)

When is SQL better than NoSQL? It is true that in large deployments, NoSQL may present better performance than SQL solutions. also, most NoSQL solutions are free. On the flip side, NoSQL stores make reporting, alarming, etc. much harder.

Sunday, March 10, 2013

More Innovative and Productive Meetings with the Six Thinking Hats system

The Six Thinking Hats system has been around for a while. It is described in this book by DeBono.
In short- conceptualize six distinct 'thinking modes' (metaphorical hats). Specifically ask the team to 'wear' one at the time, resulting in everyone reasoning from the same 'mode'.
The six hats are:

Gather Information: (White)                   
Express Emotions and Intuition (Red)      
Devils' Advocate/Negative logic (Black)   
Positive Logic/Benefits/Harmony (Yellow) 
Creativity (Green)                                     

Saturday, March 09, 2013

On short range thinking

This short Twitter exchange between me and Kent Beck, and especially his insightful response, says everything you need to know about managerial shortsightedness.

Yaniv Pessach @YanivPessach : @logosity @KentBeck Sometimes 'began solving it' and 'began making it worse' are indistinguishable to the naked eye.
Kent Beck @KentBeck : @YanivPessach true, because you're generally escaping a local optimum.

Friday, March 08, 2013

Also see my other blog on Wordpress : Stochastic Thoughts
Why Stochastic Since a in a stochastic system the next state is determined both by the system's predictable actions and by a random event. A bit like life. Or me. 

Wednesday, March 06, 2013

Innovation Technique: SCAMPER

SCAMPER stands for
Put (to other purpose)

When you have several ideas in the the general areas, trying each of those strategiesto create variants may hit on the nail with a variant that makes sense or works well.