Thursday, April 04, 2013

Simplified Software Development Failure Mode and Effect Analysis (FMEA)

Classically, FMEA consists of identifying the components of a system, enumerating their failure modes, the possible causes for each failure, risk, and mitigation.

The same principle applies in software design, but the types failure modes tend to be very different, and may apply per component or per component interaction or call. An example may help.
Consider a system with a web frontend (web page) with a button, a business rules middleware, and database. Let's ignore multiple instances.
Frontend: Failure mode may be 'frontend is nonresponsive', 'frontend returns wrong html' etc.
Middleware: (for the 'buy' interaction): Failure modes may be 'request times out', 'requests non responsive', 'request fails', 'requests in inconclusive state', 'request updates partial data', and 'request updates wrong data'.
For each of those there would be one or more possible causes. for example 'request non responsive' may be due to 'middleware not running', 'database connectivity not available', or 'database query not responsive' - the latter is likely to have a fan out of causes such as 'high request load' and 'table deadlocks' etc. A complete modeling would include impact, error rate, and detection rate as parameters for each cause and occurance.

Comparing Options with Pugh Matrix

This is pretty much the pros-and-cons comparison we all do intuitively, standardized.

Simply write down your 'base' option, and all other options as columns, and all relevant aspects as rows.
Then, for each feature/option combo, specify whether it is better (+) or worse (-) than the baseline, and by how much. The 'base' by definition is '0's

FeatureBaseOption BOption C
Transaction speed+0
Storage requirement

Management of Failing Project - the Busywork Spiral

Managers, like all of us, respond to rewords, and both use and elicit signals.
In most cases, that process is benign, ensuring the managers of an organizations are aligned (through incentives) with the organizational goals.
In failing projects, however, a curious phenomena can be observed. As the project drifts more and more into late territory and employee time becomes more of a rare commodity, more (rather than less) time is being consumed on managerial overhead - meetings, status reports, and similar artifacts.

On explanation for this phenomena is the managers' reasonable desire not to appear neglectful. If the project is late and status was NOT collected, he is at fault, at least in the eyes of his superiors. If all controls were implemented, however, the responsible party is not as clear, and the manager may avoid being penalized personally for the project delay or failure.

And of course, since the Mythical man-month we all know that adding people to a project mid-way will slow it down.. and yet, twelve times out of every dozen projects, management will 'help' delayed project with assigned resources.

Some of this is discussed in Why Software Projects are Terrible and How Not To Fix Them 

Innovation with Morphological Matrix and Copycatting

At the heart of Morphological Matrix is the idea that each solution proposal is composed of solutions to sub-features. First, those are organized into a table, and then all the combinations may be explored.
For example:
ComponentOption AOption BOption C
CommunicationSOAP callQueued/MSMQREST
StorageRelationalKey-Value storeFlat files

With the Morphological Matrix approach, all combinations of features (3*3 in the example above) are explored and evaluated. This can be daunting, so my personal variation is 'copycatting': starting from each proposal, for each feature, consider the alternative implementations proposed in the competing solutions. If any of those are an improvement, adopt them. For example, when trying to improve Option A, analysis may show that using queued calls would improve system behavior. We then create option A', and 'copycat' that feature. We get:
ComponentOption AOption BOption COption A'
CommunicationSOAP callQueued/MSMQRESTQueued/MSMQ
StorageRelationalKey-Value storeFlat filesRelational

Both those approaches allow a methodical way of merging the best elements of competing approaches into a better solution.