Saturday, August 17, 2013

Scaling to a Billion - Part V

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 5)

Nothing is Free

Optimizing engineering metrics is never sufficient. To support a high transaction value system, the architects must understand and optimize the metrics the business cares about. In some businesses, low error rate is crucial. In others, lowe turnaround time. Some businesses care about average response time while others cannot tolerate slower response time even for a smaller fraction of the requests. The metrics measured - average latency or P90 (a measure of the experience of the worst 10% of requests) latencies, request error rate or down time, must fit the business the company is in.
And above all, understand your assumptions and state them clearly. Handling ‘five times the order volume’ may sound specific - but does your tests assume that each order has less than 10 items? that orders arrive in a constant rate over an hour rather than in a bursty manner due to batching in other systems? or that other systems do not cause load or lock contention on the database? Misunderstanding your requirements or assumptions may result in perfectly engineered systems that do not help the business grow as planned.
And above all, remember that scaling a system is hard. Systems do not scale linearly, and in many cases, handling twice the load requires more than twice the computing resources, developer time, and stabilization period. Advantages to scale exist - but they take time to materialize.

About the Author

Yaniv Pessach is a software architect living in Bellevue, WA. He worked for multiple SP500 as well as smaller firms throughout the years, and received his graduate degree from Harvard University, where his research focused on distributed systems. You can find more about Yaniv on his website,, or contact him through his linkedin page at‎

Wednesday, August 14, 2013

Scaling to a Billion - Part IV

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 4)

Nothing is Free

Understanding architectural tradeoffs is important. For example, consider queuing solutions and Enterprise Message Bus to enhance reliability or enable batching, but use direct Web Service calls when turnaround time is meaningful.

Exceptional Service

With millions of transactions, each involving multiple touchpoints, the unexpected is the norm. Subtle bugs may impact only a small percentage of request and skip testing; requirement analysis or implementation may miss some input combinations;and timing issues or hardware failures may cause unexpected states to manifest. All these unexpected issues require good, tiered. exception handling.
Firstly, code must be constructed with error handling in mind. Exceptions need to be caught and handled, issues should be automatically isolated to the minimal set of related requests, and nothing should bring down the system or stop the processing train.
But even with the best coding practices in time, issues will creep up at the most inconvenient times. All large scale systems require a tiered level of off-hours support (on-call), and the participation of trained engineers and programmers in the process. Alarm levels should be set appropriately to balance reducing system risk and support personnel workload. However, handling one-off errors and investigating suspected issues can waste precious time, cause employee dissatisfaction and work-life balance issues, and increase the churn in the team. Best people practices include investing in continuously improving the system, automatically delaying handling of issues impacting only a few transactions to business hours (e.g., by setting high threshold for alarms), and allowing employees to receive comp time if on-call issues resulted in off business-hours work.

Sunday, August 04, 2013

Scaling to a Billion - Part III

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 3)

Up Up and Away

When asked to scale a solution, the first concepts that leaps to the minds of developers is momentary load. Measured in ‘‘requests per second’, scaling consists in handling more requests in the same time.But how many more? and when? it is likely that the load on your system varies through the day, between weekdays and weekends, and between rush times and ordinary days. For drugstore, like many other eCommerce retailers, Black Friday (and Cyber Monday) represented an annual peak. Yours may differ - but you should know when your requests hit max load, and what load you expect to handle. Backend requests can often be queued, but queuing introduces delays - your service development must be guided by your SLA (Service Level Agreement): is it OK to delay processing some requests for 15 seconds on peak times? how about 15 minutes?


A few tricks to help optimize systems quickly are
- Minimize data flow. Moving data around in memory or from disk takes time. Check the columns fetched in your SQL query and slim them. Verify that your database table only contains required rows, and that unnecessary (or old) rows are purged or archived. Considering covering indexes for common queries. And reduce the amount of data (such as unnecessary fields) passed in web service calls.
- Scaling horizontally means adding more machines. It is easier to design a system for horizontal scaling if your services are stateless. Soft state (cached data) is usually OK, but consider a shared or distributed cache if the same data ends up cached on multiple machines.
- Seek out and eliminate all single point of failures.  The same search will likely identify some of your choke points - the servers that have to handle -every- requests. Consider alternatives.
- Scale horizontally or partition your data. Either plan for many machines to process your workload at once, where each machine has access to the entire data, or divide your data into partitions, and have a separate set of machines process each. Either approach has pros and cons - understand your tradeoffs.
- Simplify your solutions. Complex solutions are hard to maintain or even get right. And they are harder to optimize.

Monday, July 15, 2013

Managing Software Architecture & Architects - Webinar

This live webinar is open (and free) to PMI (Project Management Institute) members only - contact me if you are not a PMI member but interested in the subject matter.

Live webinar capacity is limited to 1000 (on a first-join basis), but the session is recorded for later viewing.

Free registration at:

The PM Guide to Managing Architecture in Software Development Projects
Presenter: Yaniv Pessach,  PMP®, CSM

July 24, 2013 • 12 PM - 1 PM EST (UTC -4)
Software architecture differs from other aspects of planning and executing a project in multiple ways. It is critical to the success of complex IT projects, but managing it is less understood than the management of requirements, implementation, and testing, and it involves unique difficulties such as requiring output from senior technical staff members who may be spread across multiple projects. In this presentation, different methods of managing the architecture process and working well with software architects are presented and discussed, relying on techniques from both waterfall and agile principles.
We will discuss:
  • What is software architecture and when should it take place
  • Common approaches to software architectures, and when to pursue them
  • The role of the PM in architecture
  • Difficulties in managing the creation of software architecture
  • Common architecture frameworks

Topics Covered in the Presentation
  • What is software architecture and when should it take place?
    • Architecture has a role in the Initiating, Planning, Executing, and even the Monitoring process groups
  • What are difficulties in managing the creation of software architecture?
    • Interdependence on emerging requirements and technical understanding
    • Working with senior ICs and partially-allocated resources
    • Lack of a crystal ball
  • What are common architecture frameworks?
    • Big Design Up Front
    • Emerging
    • Good Enough Architecture
    • Time Boxed Architecture
  • What are the similarities (and conflicts) between the role of the Architect and the role of the PM?
    • Coordination with multiple stakeholders
    • Overall project oversight

Key Takeaways

  • Why software architecture differs from either planning or execution?
  • What are common approaches to software architectures, and when to pursue them?
  • What is the role of the PM in architecture?

Saturday, July 13, 2013

Scaling to a Billion - Part II

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 2)

Know Your Enemy

As the scale of your systems increase, the number and types of exceptional or unexpected behaviors increase. At the same time, your understanding of what the system is actually doing decreases.
As the system scales, there is a need for more complete and more actionable logging. Those two requirements are sometimes in conflict - the more verbose logs are, the harder they are to understand. To address this problem, invest in log filtering and analysis tools. 


One approach is to implement Structured Logging, where all logging statements have essentially the same format and are therefore machine-readable. A close cousin to logging is an effective monitoring system. Some events require human investigation and intervention, but involving the team can be distracting, demoralizing, and expensive. A good monitoring system requires a low false positive rate (do not alarm when nothing is wrong) and a very low false negative rate (do not miss alarms), but tuning alarm criteria to meet both of those goals is difficult. Every customer order is important, but it is impractical and demoralizing to wake up a developer at 2:00am to fix a single order, so monitoring systems must prioritize events based on the number of customers or requests impacted.


As the system grows, manual handling of exceptional events becomes less reasonable. A machine failure is a likely event in any fairly-large cluster, so it should not require immediate human intervention. Failure of some less-trustworthy components should be treated similarly and the system should maintain SLAs without immediate intervention. As the cluster sizes grow, more classes of errors should fall into the ‘automatic handling’ category. In specific, a failure of one request (or one type of request) should be automatically isolated to not impact future requests, or at least not impact requests of a different type. If the system suffers from known issues (which may take weeks or months to address), the automatic handling system should be able to easily add appropriate mitigation. For example, if a new bug is introduced where purchase orders with more than 50 different products fail more often, but the bug is not immediately fixed -  since those are rare, the team might want such orders (or such failing orders) to be automatically ‘parked’ and handled during normal business hours only, rather than trigger the existing alarms, since those failures are not an indication of a new problem.

Another Blog!

I am opening/branching another blog for distributed computing (and distributed systems) research-related topics. I think that blog appeals to a different, more specialized audience - so the separation makes sense. And, not surprisingly.. the blog name is... drumroll...

Tuesday, July 02, 2013

Scaling to a Billion - Part I

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part I)

In 2011, the late stage startup I was with sold and fulfilled eCommerce orders in an annual rate of half a billion dollars. After being purchased by a major brick-and-mortars retailer, our backend fulfillment systems were enhanced to fulfill not only the orders coming from our own site, but also to handle most of the fulfillment of orders placed on the retailers’ site. With multiple sources of orders, we had to be ready for considerably more orders placed and processed. As the Principal Engineer of the group responsible for Order Processing and Payment systems, I led the technical design of our systems toward handling that elusive and quite challenging goal of enabling our systems to support one billion dollars in annual sales. This article is about some of the lessons that the team and I learned in the process.

Reducing Uncertainty Under Pressure

When your application is directly responsible for revenue, you and your team will find yourself constantly under a microscope. When issues occur, escalations are distracting and time consuming; therefore, part of the effort in designing a system must go towards minimizing outliers and possible escalations. A stable, reliable system may be preferable to a more efficient system that runs in fits and breaks. When looking at performance and SLAs, it is not enough to minimize the average response or cycle time - minimizing the variance of those numbers is crucial.

(To Be Continued)

Sunday, May 26, 2013

Why I prefer to think of myself as a computer -scientist- :)

A guy wanted to know the volume of a red rubber ball.

First he took it to a mathematician, who measured its radius and used the formula V=4/3*pi*r^3 to find its volume.
Next, our guy went to a physicist, who immersed the ball in a bowl full of water. He then measured the amount of water which overflowed and calculated the volume of the ball.

Still not satisfied, our guytakes the ball to a mechanical engineer. The engineer says, "Wait a moment, I got this." He gets up and skims through the books laid out on his shelf. "Ah, this should do it.", he says and pulls out a big fat hard bound book titled - "The Mechanical Engineer's Handbook to Red Rubber Balls"

Sunday, May 19, 2013

Read my article in SDJ: Scaling to a Billion

Software Developer's Magazine just published my article 'Scaling to a Billion' detailing my experience and tips in scaling an eCommerce backend system to be able to process a billion dollar in annual transactions.
Here's a quick outline:

  • Reducing Uncertainty Under Pressure
  • Know Your Enemy
  • Up Up and Away
  • Nothing Is Free
  • Exceptional Service
  • Final Words

more details to come...

Thursday, April 04, 2013

Simplified Software Development Failure Mode and Effect Analysis (FMEA)

Classically, FMEA consists of identifying the components of a system, enumerating their failure modes, the possible causes for each failure, risk, and mitigation.

The same principle applies in software design, but the types failure modes tend to be very different, and may apply per component or per component interaction or call. An example may help.
Consider a system with a web frontend (web page) with a button, a business rules middleware, and database. Let's ignore multiple instances.
Frontend: Failure mode may be 'frontend is nonresponsive', 'frontend returns wrong html' etc.
Middleware: (for the 'buy' interaction): Failure modes may be 'request times out', 'requests non responsive', 'request fails', 'requests in inconclusive state', 'request updates partial data', and 'request updates wrong data'.
For each of those there would be one or more possible causes. for example 'request non responsive' may be due to 'middleware not running', 'database connectivity not available', or 'database query not responsive' - the latter is likely to have a fan out of causes such as 'high request load' and 'table deadlocks' etc. A complete modeling would include impact, error rate, and detection rate as parameters for each cause and occurance.

Comparing Options with Pugh Matrix

This is pretty much the pros-and-cons comparison we all do intuitively, standardized.

Simply write down your 'base' option, and all other options as columns, and all relevant aspects as rows.
Then, for each feature/option combo, specify whether it is better (+) or worse (-) than the baseline, and by how much. The 'base' by definition is '0's

FeatureBaseOption BOption C
Transaction speed+0
Storage requirement

Management of Failing Project - the Busywork Spiral

Managers, like all of us, respond to rewords, and both use and elicit signals.
In most cases, that process is benign, ensuring the managers of an organizations are aligned (through incentives) with the organizational goals.
In failing projects, however, a curious phenomena can be observed. As the project drifts more and more into late territory and employee time becomes more of a rare commodity, more (rather than less) time is being consumed on managerial overhead - meetings, status reports, and similar artifacts.

On explanation for this phenomena is the managers' reasonable desire not to appear neglectful. If the project is late and status was NOT collected, he is at fault, at least in the eyes of his superiors. If all controls were implemented, however, the responsible party is not as clear, and the manager may avoid being penalized personally for the project delay or failure.

And of course, since the Mythical man-month we all know that adding people to a project mid-way will slow it down.. and yet, twelve times out of every dozen projects, management will 'help' delayed project with assigned resources.

Some of this is discussed in Why Software Projects are Terrible and How Not To Fix Them 

Innovation with Morphological Matrix and Copycatting

At the heart of Morphological Matrix is the idea that each solution proposal is composed of solutions to sub-features. First, those are organized into a table, and then all the combinations may be explored.
For example:
ComponentOption AOption BOption C
CommunicationSOAP callQueued/MSMQREST
StorageRelationalKey-Value storeFlat files

With the Morphological Matrix approach, all combinations of features (3*3 in the example above) are explored and evaluated. This can be daunting, so my personal variation is 'copycatting': starting from each proposal, for each feature, consider the alternative implementations proposed in the competing solutions. If any of those are an improvement, adopt them. For example, when trying to improve Option A, analysis may show that using queued calls would improve system behavior. We then create option A', and 'copycat' that feature. We get:
ComponentOption AOption BOption COption A'
CommunicationSOAP callQueued/MSMQRESTQueued/MSMQ
StorageRelationalKey-Value storeFlat filesRelational

Both those approaches allow a methodical way of merging the best elements of competing approaches into a better solution.

Sunday, March 24, 2013

Startup funding ACID test

What does it takes to get a deal funded (Henry H Wang) -
1/ Team. have a team with the right skills, experience, and that gets along.
2/ Address a huge market. Addressing a small market is risky if you miss the perfect sweet spot.
3/ Technology. Your technology must be a barrier to entry either by difficulty or existing IP
4/ Customers - can hyou name anyone who would want your product?
5/ Connections. Do you have any special connections to potential customers
6/ Financials. why would the numbers work?

Saturday, March 16, 2013

Why Good Ideas Fail good ideas is one thing, but gaining the required support within a corporation is quite another.
In most organization, you'd have to 'run the gauntlet' to get your idea accepted.
 Some ideas to make this more likely:

  • Let your opposition speak. They will anyhow. Show respect for their dissenting voice.
  • Do not attack back
  • Expect the common ways your ideas will be attacked
    • Fear-mongering
    • Delay
    • Confusion
    • Ridicule
  • Expect the common not-so-innocent questions meant to derail you, including:
    • Another Problem: "Money is the only real issue"; "What about ?";
    • Inertia: "We're successful, why change?";"The problem is not that bad";"You're implying we've been failing"
    • Solution: "Your proposal goes too far/doesn’t go far enough.";"You’re abandoning our core values.";"No one else does this.";"People have too many concerns.";" It puts us on a slippery slope."
    • and more...
  • Be prepared!
 Interesting read @HBR @Forbes @cbsnews

Tuesday, March 12, 2013

SQL vs. NoSQL (part i)

When is SQL better than NoSQL? It is true that in large deployments, NoSQL may present better performance than SQL solutions. also, most NoSQL solutions are free. On the flip side, NoSQL stores make reporting, alarming, etc. much harder.

Sunday, March 10, 2013

More Innovative and Productive Meetings with the Six Thinking Hats system

The Six Thinking Hats system has been around for a while. It is described in this book by DeBono.
In short- conceptualize six distinct 'thinking modes' (metaphorical hats). Specifically ask the team to 'wear' one at the time, resulting in everyone reasoning from the same 'mode'.
The six hats are:

Gather Information: (White)                   
Express Emotions and Intuition (Red)      
Devils' Advocate/Negative logic (Black)   
Positive Logic/Benefits/Harmony (Yellow) 
Creativity (Green)                                     

Saturday, March 09, 2013

On short range thinking

This short Twitter exchange between me and Kent Beck, and especially his insightful response, says everything you need to know about managerial shortsightedness.

Yaniv Pessach @YanivPessach : @logosity @KentBeck Sometimes 'began solving it' and 'began making it worse' are indistinguishable to the naked eye.
Kent Beck @KentBeck : @YanivPessach true, because you're generally escaping a local optimum.

Friday, March 08, 2013

Also see my other blog on Wordpress : Stochastic Thoughts
Why Stochastic Since a in a stochastic system the next state is determined both by the system's predictable actions and by a random event. A bit like life. Or me. 

Wednesday, March 06, 2013

Innovation Technique: SCAMPER

SCAMPER stands for
Put (to other purpose)

When you have several ideas in the the general areas, trying each of those strategiesto create variants may hit on the nail with a variant that makes sense or works well.

Tuesday, March 05, 2013

Innovation Technique: HIT matrix

HIT stands for Heuristic Ideation Technique.

The critical observation is that new products are often a combination of two (or more) of existing products.

How to:

  • Choose existing products (not too similar).
  • List Characteristics / features that make out the item.
  • Create a matrix. For each 2 - feature combination, evaluate the benefits of combining the features.

E.g. Scratch resistant pan + Car paint leads to considering 'scratch resistant car paint'

Innovation Technique: E/R/A

Eliminate, Reason, Alternatives.

For an existing process or product, choose several important features or aspects, and answer:
1. Can we eliminate the feature?
example: "credit card and signature is carried with you"
Usually - no
2. Reason - why can't we eliminate it? what is the crucial benefit provided?
Example: "Credit cad possession and signature used to authorize purchases"
3. Alternative(s): How else can wee achieve that goal?
Example: "use fingerprint to identify credit carrier"

Doing so for multiple features may hint at areas of innovation.

Monday, March 04, 2013

What is 'Rude Q&A'

Rude Q&A is a process I commonly encountered in Microsoft.
The idea is to prepare for difficult and possible hostile questions as part of the review of a product, design, or presentation.
Ask yourself - what are the meanest, smartest people going to ask you?

Rude Q&A is not just about testing the quality of your ideas. Rude Q&A forces you to think though your feature set, requirements, customers, and market assumptions.


  • If you introduce 'rude Q&A' in someone elses' design review preparation, be sure to explain the "rules of the game" to not surprise by standers.
  • You may want to reserve part of the Q&A time for 'Rude Q&A' after other questions have been answered.
  • Make sure to include questions that are unfair or based on erroneous information
  • The hardest part can be coming up with the questions.

The challenges of a software API Design

Designing APIs poses multiple challenges, among them:
  • APIs are targeted at developers
  • APIs have 'stickiness'. A bad APIs is a nightmare for years as many internal and extenal apps depend on it.
  • APIs are shared by many applications, so problems can impact multiple applications (differently)
  • APIs must be discoverable, intuitive, and use consistent conventions
  • Versioning and backwards compatibility on change
  • Good (and updated/correct) documentation is required
  • Likewise, automated testing is required

But there are benefits for exposing APIs even internally:

  • hide implementation
  • reuse code
  • reduce duplication
  • easier to optimize
Exposing APIs externally has many market benefits that we'll discuss separately. Just note that all major platforms (Facebook, Linkedin, Windows/Win32) are really APIs; and being a 'platform' can be very beneficial to a provider.

Saturday, March 02, 2013

Good Writing using the Madman Architect Carpenter Judge system

Write in four phases

- jot down all ideas
- look at your ideas. try to think of a sensible order. identify three main points
- now you have an outline
- rapidly, write down paragraphs in support of the outline
- don't edit in this phase
- Now review everything you wrote, and try to judge it
- How would someone who is not friendly to me look at it.
- Would it look self serving? insincere? am I making unjustified claims? etc..

HT to guide to better business writing

Optimize Your Life with Lean Software Development

Lean (a concept in manufacturing optimization, and now in software development) has some general 'life-lessons', focusing on "Add Nothing But Value - eliminate waste".

* Eliminate Overproduction
In software, unneeded features.
In life - eliminate hoarding, interruptions, unneeded focus? or is this stretching the analogy too far?

Friday, March 01, 2013

Innovation Technique: Job Scoping

For a specific Job To Be Done, try to answer:
1. What is a broader problem, and why it is important
2. What is a narrower problem, and what is the barrier that makes this narrower problem important to the JTBD

Do those answers provide any useful insights?

Internal vs. External Quality

Internal vs external quality OR why software design matters.
 Internal quality (known bugs, code quality, design, extensibility, refactoring) doesn't matter to the customer NOW. The customer may not be able to conduct any (blockbox) tests to determine the internal quality of the software anyhow, so it may as well not exist.
So... marketing/management may be tempted to concur.
They would be wrong... because, in time code quality impacts the bug rates, performance, cost of development of new features, and time to market for new features. The current code base results in debt. Like all debt, you can pay off the principal early or pay off the interest on an ongoing base. Either decision is fine in some circumstances, as long as it is the right decision for the organization
==> more important to keep the changing and critical areas 'clean'.

Sources of tech debt
=> some debt is taken by design
==> some by being reckless
===> and some debt is only visible in retrospective as we learn more about the problem space.

Predicting the next new song

A few interesting questions in the cross of 'big data' and music:
1/ Predicting the next song a -particular- user will like. (e.g. Pandora)
2/ Predicting the next song that will become popular
 Last I heard, Pandora uses analysis of the music itself. But there is so much data out there:
* The singer
* The Lyrics
* Social networking and google mentions
* Radio plays
* The label (is a valid datapoint)

Thursday, February 28, 2013

Innovation technique: Nine Windows

consider the matrix {past,present,future}x{supersystem,system,subsystem}. Fill in center(present,system). Complete the grid. Observe for insights.

Innovation Technique: Outcome Expectation

1. Identify Jobs To Be Done 2. Identify 4 quadrants: desired/undesired outcomes to client/provider 3. Create outcome statement: include the direction of action (minimize/maximize), unit (time/cost/etc.) object, and context. e.g.: "Decrease the likelihood of delayed customer orders" Software Industry Relevance: Often there are 3 players: the end user (functionality), the provider as a financial entity (monitezation) and the provider as a technical entity (design integrity). You need to recognize those disparate needs and identify what each role/entity preferences are. Based in part on The Innovator Toolkit

Wednesday, February 27, 2013

Innovation Technique: JTBD (Jobs To Be Done)

Jobs To Be Done: Instead of focusing on what you are doing, focus on the jobs your customers (internal or external) are trying to accomplish. Instead of improving the lawnmowers you are manufacturing, maybe you should genetically engineer grass? (the job to be done: keep lawn looking tidy). Software industry relevance: similar to User Stories, but much more higher level than most people use User Stories. Based in part on The Innovators Toolkit

Monday, February 25, 2013

Saturday, February 23, 2013

Twitter - at last.

For short and succinct words of wisdom... use a dictionary. For everything else, there's @yanivpessach

Saturday, February 16, 2013

Distributed Storage eBook available

Good News Everyone! My 'Distributed Storage: Concepts, Algorithms, and Implementation' eBook is now available on Amazon. The eBook is a short introduction to the topic and is targeted at the academic level as an introduction to the topic.