Monday, July 15, 2013

Managing Software Architecture & Architects - Webinar

This live webinar is open (and free) to PMI (Project Management Institute) members only - contact me if you are not a PMI member but interested in the subject matter.

Live webinar capacity is limited to 1000 (on a first-join basis), but the session is recorded for later viewing.

Free registration at:
http://is.vc.pmi.org/Community/Blogs/tabid/2423/entryid/1968/ISCOP-July-2013-webinar.aspx

The PM Guide to Managing Architecture in Software Development Projects
Presenter: Yaniv Pessach,  PMP®, CSM

July 24, 2013 • 12 PM - 1 PM EST (UTC -4)
Software architecture differs from other aspects of planning and executing a project in multiple ways. It is critical to the success of complex IT projects, but managing it is less understood than the management of requirements, implementation, and testing, and it involves unique difficulties such as requiring output from senior technical staff members who may be spread across multiple projects. In this presentation, different methods of managing the architecture process and working well with software architects are presented and discussed, relying on techniques from both waterfall and agile principles.
We will discuss:
  • What is software architecture and when should it take place
  • Common approaches to software architectures, and when to pursue them
  • The role of the PM in architecture
  • Difficulties in managing the creation of software architecture
  • Common architecture frameworks

Topics Covered in the Presentation
  • What is software architecture and when should it take place?
    • Architecture has a role in the Initiating, Planning, Executing, and even the Monitoring process groups
  • What are difficulties in managing the creation of software architecture?
    • Interdependence on emerging requirements and technical understanding
    • Working with senior ICs and partially-allocated resources
    • Lack of a crystal ball
  • What are common architecture frameworks?
    • Big Design Up Front
    • Emerging
    • Good Enough Architecture
    • Time Boxed Architecture
  • What are the similarities (and conflicts) between the role of the Architect and the role of the PM?
    • Coordination with multiple stakeholders
    • Overall project oversight

Key Takeaways

  • Why software architecture differs from either planning or execution?
  • What are common approaches to software architectures, and when to pursue them?
  • What is the role of the PM in architecture?

Saturday, July 13, 2013

Scaling to a Billion - Part II

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part 2)

Know Your Enemy

As the scale of your systems increase, the number and types of exceptional or unexpected behaviors increase. At the same time, your understanding of what the system is actually doing decreases.
As the system scales, there is a need for more complete and more actionable logging. Those two requirements are sometimes in conflict - the more verbose logs are, the harder they are to understand. To address this problem, invest in log filtering and analysis tools. 

Monitoring

One approach is to implement Structured Logging, where all logging statements have essentially the same format and are therefore machine-readable. A close cousin to logging is an effective monitoring system. Some events require human investigation and intervention, but involving the team can be distracting, demoralizing, and expensive. A good monitoring system requires a low false positive rate (do not alarm when nothing is wrong) and a very low false negative rate (do not miss alarms), but tuning alarm criteria to meet both of those goals is difficult. Every customer order is important, but it is impractical and demoralizing to wake up a developer at 2:00am to fix a single order, so monitoring systems must prioritize events based on the number of customers or requests impacted.

Expansion

As the system grows, manual handling of exceptional events becomes less reasonable. A machine failure is a likely event in any fairly-large cluster, so it should not require immediate human intervention. Failure of some less-trustworthy components should be treated similarly and the system should maintain SLAs without immediate intervention. As the cluster sizes grow, more classes of errors should fall into the ‘automatic handling’ category. In specific, a failure of one request (or one type of request) should be automatically isolated to not impact future requests, or at least not impact requests of a different type. If the system suffers from known issues (which may take weeks or months to address), the automatic handling system should be able to easily add appropriate mitigation. For example, if a new bug is introduced where purchase orders with more than 50 different products fail more often, but the bug is not immediately fixed -  since those are rare, the team might want such orders (or such failing orders) to be automatically ‘parked’ and handled during normal business hours only, rather than trigger the existing alarms, since those failures are not an indication of a new problem.

Another Blog!

I am opening/branching another blog for distributed computing (and distributed systems) research-related topics. I think that blog appeals to a different, more specialized audience - so the separation makes sense. And, not surprisingly.. the blog name is... drumroll... http://distributedcomputingresearch.blogspot.com/

Tuesday, July 02, 2013

Scaling to a Billion - Part I

(An expanded version of this article first published in Software Developer's Journal)

Scaling to a Billion (Part I)

In 2011, the late stage startup I was with sold and fulfilled eCommerce orders in an annual rate of half a billion dollars. After being purchased by a major brick-and-mortars retailer, our backend fulfillment systems were enhanced to fulfill not only the orders coming from our own site, but also to handle most of the fulfillment of orders placed on the retailers’ site. With multiple sources of orders, we had to be ready for considerably more orders placed and processed. As the Principal Engineer of the group responsible for Order Processing and Payment systems, I led the technical design of our systems toward handling that elusive and quite challenging goal of enabling our systems to support one billion dollars in annual sales. This article is about some of the lessons that the team and I learned in the process.


Reducing Uncertainty Under Pressure

When your application is directly responsible for revenue, you and your team will find yourself constantly under a microscope. When issues occur, escalations are distracting and time consuming; therefore, part of the effort in designing a system must go towards minimizing outliers and possible escalations. A stable, reliable system may be preferable to a more efficient system that runs in fits and breaks. When looking at performance and SLAs, it is not enough to minimize the average response or cycle time - minimizing the variance of those numbers is crucial.

(To Be Continued)