Thursday, April 04, 2013

Simplified Software Development Failure Mode and Effect Analysis (FMEA)

Classically, FMEA consists of identifying the components of a system, enumerating their failure modes, the possible causes for each failure, risk, and mitigation.

The same principle applies in software design, but the types failure modes tend to be very different, and may apply per component or per component interaction or call. An example may help.
Consider a system with a web frontend (web page) with a button, a business rules middleware, and database. Let's ignore multiple instances.
Frontend: Failure mode may be 'frontend is nonresponsive', 'frontend returns wrong html' etc.
Middleware: (for the 'buy' interaction): Failure modes may be 'request times out', 'requests non responsive', 'request fails', 'requests in inconclusive state', 'request updates partial data', and 'request updates wrong data'.
For each of those there would be one or more possible causes. for example 'request non responsive' may be due to 'middleware not running', 'database connectivity not available', or 'database query not responsive' - the latter is likely to have a fan out of causes such as 'high request load' and 'table deadlocks' etc. A complete modeling would include impact, error rate, and detection rate as parameters for each cause and occurance.

No comments: