Driving continuous improvement with PDCA and measurements
Continuous improvement has always been an implicit part of being agile and with the growing awareness of work by
Deming, amongst others, plus the increasing popularity of Lean thinking, it is rightly becoming the centre of attention. And yet, I suspect much of the continuous improvement that happens may or may not be actual improvement. I wonder if most of it proves to be negligible in the grand scheme of things.
When thinking about continuous improvement it's important to understand the work as a system. Systems exist within other systems so decide where to define the boundaries. At
Energized Work we work in product streams. Each
product stream is a system of production geared to the market and includes the customers. Most people are probably aware of Deming's
Plan-Do-Check-Act cycle. But in many cases it's probably used only partially. Mostly change is planned without defining measurable expectations for the outcome. Almost never is the outcome checked against those expectations and rarely is the outcome understood in terms of its causes. The vital step of checking is often missed, and planning and doing without checking is just decision-making that’s not validated. Checking reveals all kinds of things we should know about and it helps us understand and remove root causes of problems. But without a way to measure the impact of changes on the system we can only rely on a qualitative assessment as to whether an improvement was achieved. Qualitative assessment is useful but having data allows us to conduct analysis and gain better insight. Measuring the impact of changes using quantitative instrumentation provides us with knowledge that enables us to learn faster and more effectively. Planning decisions are transformed into experiments that help us determine the optimal operating conditions of the system.
Systems exhibit both exceptional and routine variation. When using
PDCA listen to the '
voice of the system' (
Wheeler’s
voice of the process) because it defines what can be obtained from the system.
Statistical process control charts can help understand the system's behavior and identify signals amongst the noise.
Exceptional (special cause) variation
Exceptional variation occurs beyond the natural limits and signals
special causes, i.e. things outside the system that are significant enough to skew its output. It indicates dominant cause-and-effect relationships that affect the system that are not being controlled. According to
Shewhart's definition, with the presence of such exceptional variation the system is unpredictable. The system should not be changed to accommodate special causes. Rather the special causes should be eliminated to make the system predictable. Get timely data, react quickly to the signals and conduct root cause analysis to identify the causes. Then take immediate and effective action to remove the causes forever, which may involve making changes to the system (or more likely to the system outside the system :).
The first chart below shows a single data point beyond a natural limit signaling a special cause. The second chart isn't a signal as such but when 3 out of 4 consecutive data points are closer to one of the natural limits than they are to the average it may suggest a possible special cause. Certainly it warrants a closer look.
Joiner proposes additional tests for special causes hidden within the natural limits. These are shown in the last three charts: 8 successive data points on the same side of the average may indicate the system has been 'bumped'; 14 or more data points alternating up and down may indicate the system is moving back and forth with regularity (e.g. perhaps due to on and off-shore teams working opposing hours); 6 or more points in a row steadily rising or falling suggests the emergence of a trend (although Joiner says you seldom detect a special cause with this test).
Signals of possible special cause variations
Originally uploaded by energizr
Routine (common cause) variation
Routine variation occurs within the natural limits over a reasonably long period of time, appears well behaved and is entirely due to
common causes that typically reside deep within the system. In this situation the system is predictable within those limits and, unless something is changed, the system continues to behave this way. A predictable system can only be improved by changing the actual system in some fundamental manner, e.g. by introducing new process settings or a new technology or new ways to do the work. Whether to act on common causes of variation is a judgment call. Although the system is stable it may be worthwhile reducing common cause variation to improve predictability even further and increase what the system can deliver. Unfortunately, unlike special causes, there are no obvious signals for common causes of variation. Therefore to reduce its effects run small experiments using PDCA to pinpoint the common causes and then develop changes in the system to counteract them.
Labels: common cause variation, pdca, special cause variation, systems thinking
Links to this post
Dealing with organizational complexity goes in the 'too hard' box
Our purpose is to improve the quality of service for customers. Quite simply, our goal is to delight customers. But Goldratt said 'the goal of every company is to make money'. Making money is mandatory but fixation on profit and obsession with costs is a sure way to become detached from customers. Our goal is not do delight shareholders. Delighted customers become loyal customers and loyal customers provide repeat business. They even do marketing for us. They tell their friends and family who then give us their business and they’re delighted so they tell their friends and family.
Making money is a consequence of delighted customers!
There are lots of people who can be considered customers and they have different needs so satisfying them all is difficult. Lets not get hung up on the definition of the word customer. Just think of these people as deriving some form of value from what we deliver. So our deliveries must somehow balance their needs. This isn’t easy but we have a chance when they share a vision, have common goals and are able to provide consensus on priorities.
There aren't many delighted customers out there!
But we know this. And yet, change is all the rage. IT is awash with improvements. It still doesn't feel like it's any easier to get stuff done. Consider for a moment, all the things that, in your experience, might be needed to get something out the door. There's a lot of different things right?
Organizations are over-organized!
We get separated by specialty. These days it's common to have core platforms and infrastructure with centralized teams providing services like build, testing and release management. The problem with centralization is that it comes at the cost of complexity introducing dependencies, bottlenecks, waste, etc. It's difficult to understand the big picture when all we see is part of the product through a window in our silo. Living in a silo we just end up chasing local targets creating local optimizations and not really helping the product as a whole.
It's all too complicated!
Most changes we've observed organizations making attempt to tackle symptoms. The underlying problems remain and consequently people go back to how they were doing things before like zombies back to the shopping mall. Dealing with complexity goes in the too hard box. Complexity is not a setup for success.
The unspoken truth is that failing conventionally is the status quo!Labels: complexity, status quo
Links to this post
Come and see us at LSSC10 in Atlanta on 21st April
Come and hear us talk at
Lean Software and Systems Conference, Atlanta. Our session is at 2:30pm on Wednesday 21st April in the Business and Management track. We'll be talking about
Product Development in the Land of the Free.
Abstract:
Creating and sustaining a system for effective product development is neither easy nor commonplace. If we were to pull together the lessons we've learned from eXtreme Programming and Scrum with systems approaches such as Lean Thinking and the Theory of Constraints what would such a system look like? Where would we start? How would we organize ourselves? And what would be our approach?
The fact that so many information technology projects are still failing tells us that we should be doing something very different. This session will explore some of the things we've been doing beyond the agile comfort zone to improve the effectiveness and throughput of product development and realize business agility.
Links to this post
Effectiveness of a real product stream
I've pulled together some data for the first year of a
product stream we created and plotted it as charts for throughput, rework and effectiveness.
The first chart shows the weekly rework. I've talked about
rework previously so I won't cover it again here. The blue line indicates the remaining technical debt and the blue bars the repaid technical debt. The pink line indicates the remaining defects and the pink bar the fixed defects. Week by week, it can be seen that defects were fixed as soon as they were discovered to reduce the remaining defect count to zero. Also the technical team continuously repaid technical debt to keep the remaining amount of rework small.
The second chart shows the amount of throughput every week.
The peak at week ending 18/12, without any throughput during the 8 preceding weeks, demonstrates a flush of inventory amounting to 104 cards and corresponds to the alpha release of the product. In the rework chart, a small increase in fixed defects can be observed during the same week. Inventory again builds up for two weeks, as improvements are made to the automated deployment system, until the next peak at week ending 15/01. At which point 48 completed cards are flushed. Releases then occurred every week and while some variation is observed the throughput remained stable.
To improve the performance of the product for the beta release during week ending 26/02, 7 technical debt cards were completed. As the system experienced more rigorous use by editorial users, 12 defects were fixed plus a further 8 the following week due to increased traffic. The official launch was completed in the week ending 18/03. During that week some data inconsistencies were encountered following migration from the old content management system, which resulted in 9 defects. In response to traffic loads the load balancers were tuned with 5 technical debt cards. This effort continued the following week with a further 8 technical debt cards and 7 fixed defects as traffic increased to approximately 180 million page impressions and 3.7 million unique users per month with an average page weight of 2Mbytes. Further peaks in technical debt of 20, 16 and 10 can be seen during the weeks ending 06/05, 17/06 and 24/06, respectively. This work concentrated on the expansion of the product with reconfiguration of the production environment to support additional channels.
It's worth me recapping on
effectiveness. Effectiveness is used as a measure of the product stream's ability to sustain throughput and minimize failure demand, which allows capacity to focus on meeting value demand. It was inspired by the First-Time-Through (FTT) measurement used in Lean manufacturing to measure the effectiveness of a cell's standardized work as a percentage of product made without any need for rework or scrap.
The effectiveness of the product stream is defined as:
Effectiveness = ( Throughput - Rework ) / Throughput
where
Throughput = the number of cards released to production (excluding completed rework)
and
Rework = the number of technical debt and defect cards in inventory and work-in-process
The final chart shows the weekly effectiveness of the product stream.
The lows at weeks ending 29/01, 25/03 and 01/04 can be attributed to marked dips in throughput. At 29/01, 12 cards were queued as inventory whereas a small increase in the amount of remaining rework was present at 25/03 and 01/04. Clearly the product stream is most effective when the completed rework was small compared to the throughput and was enough to keep the remaining rework small compared to the throughput.
Labels: effectiveness, product stream, rework
Links to this post
Come and see us at QCon London on 11th March
Come and hear us talk at
QCon London. Our session is at 3pm on Thursday 11th March in the
Agile Evolution track. We'll be talking about
Product Development in the Land of the Free.
Abstract:
Creating and sustaining a system for effective product development is neither easy nor commonplace. If we were to pull together the lessons we've learned from eXtreme Programming and Scrum with systems approaches such as Lean Thinking and the Theory of Constraints what would such a system look like? Where would we start? How would we organize ourselves? And what would be our approach?
The fact that so many information technology projects are still failing tells us that we should be doing something very different. This session will explore some of the things we've been doing beyond the agile comfort zone to improve the effectiveness and throughput of product development and realize business agility.
PS. Thanks to
Jesper for inviting us along.
PPS. 33 people attended. 21 provided feedback. 19 voted green and 2 yellow, where green = great presentation, yellow = average presentation, red = poor presentation.
Links to this post
Visual Hospital
We've been working with the boys from
Visual Healthcare Solutions (who are faculty members at the
Lean Enterprise Academy) to create a touchscreen solution called
Visual Hospital, which supports the methods they talk about in their book
Making Hospitals Work. It's been a lot of fun using an iterative approach to building the interaction design with the users.
We're very excited to collaborate with these lean consultants as they continue their work in healthcare.
Links to this post
Inevitable and avoidable rework
Without really thinking about it until now, I've been seeing two types of technical debt. The first is the quick solution implemented with dirty code. I consider this to be irresponsible. That's not to say I won't do it, just that if I decide I should do it I make sure the necessary people understand the consequences and that it's an irresponsible action to take.
The second is a natural byproduct of emergent design and
YAGNI(yet) decisions. It's the debt that surfaces when a system outgrows implementations resulting from previous decisions, which were the right ones to make at the time based on the information available (because they did not compromise quality or the health of the code in any way). Irresponsible debt creates avoidable rework; it's failure demand. It's bad, it smells and it needs to be cleaned up because, if left to fester, it's going to slow us down and divert capacity away from meeting the value demand.
The debt that surfaces because the system is maturing creates inevitable rework. It's necessary to do this rework on a regular basis to keep the emergent design relevant, the code habitable, to prevent obsolescence, perhaps increase reuse, and reduce risks and medium to long-term costs. I think most people try to roll this debt into feature cards and that's the right policy. We prefer to do that if we can. However, we've become too good at
writing cards to be less than 2 days (which helps smooth the flow) and sometimes it's not possible to absorb inevitable rework into a feature card and keep it under 2 days (the way we like it). And of course, sometimes, the rework just doesn't relate to any features, e.g. upgrading to the latest Grails framework. So this gets
written on a blue card. By definition this is failure demand too. But that's harsh, don't you think? I have a weird take on this because I insist that the system is recognized and treated as a stakeholder, and as such it values certain things and makes its own demands. One of the things it values is to be kept healthy. But I'm not hung up on this rework being classified as failure demand providing it's being managed effectively.
As I mentioned in my
previous post, completing some inevitable rework on a regular basis (and assuming you're not being irresponsible ;) helps reduce the remaining rework. We can see this in action in the chart below.
Rework
Originally uploaded by energizr
The blue and pink lines show the remaining technical debt and defects that are either work-in-process or queued inventory (i.e. completed but not released). The blue and pink bars show the technical debt that has been repaid and the defects that have been fixed. Think of these bars pulling the remaining rework down keeping it small and preferably fairly steady. And, of course, assuming there's throughput satisfying the value demand then the team is
effective.
It's useful to track the remaining technical debt and defects in statistical process control charts. The natural process limits help to distinguish signs of system instability from normal variation. When the limits are breached investigate what's happened to understand how the system may have changed. Watch for trending beyond the breach as it's likely to reveal more information to help you. Use these events to identify improvements.
(Incidentally, the process limits were calculated between weeks 7 and 14 because week 6 saw the system change. Up to the end of week 6 all the software completed became queued inventory. This was then flushed to throughput and released, enabling a weekly release from then on.)
Labels: rework, spc, technical debt
Links to this post
A simple measure of effectiveness
In the Lean manufacturing world there's a measurement called First-Time-Through (FTT), which monitors whether a cell is making products right the first time. It's a measurement of the effectiveness of the cell's standardized work and shows the percentage of product made without any need for rework or scrap.
FTT = ( Total units processed - Rejects or Reworks ) / Total units processed
If the standardized work is adhered to, the product will be made right first time and FTT will be 100%. However, flawed materials, faulty components and operator error all contribute to rework and scrap.
Who cares about parallels between manufacturing and software development? I was just interested to read about FTT because I've been thinking for a while now about the effectiveness of software teams ... at an operational level, let's say. I've long considered an effective team as one that is able to sustain throughput (i.e. the number of cards released to production that deliver value) while fixing defects immediately and repaying technical debt to keep the amount of rework small.
I consider technical debt and defects to be rework, and technical debt to be a natural byproduct of software development. It stems from earlier decisions, based on what we knew at the time, and requires attention later when the system has outgrown the outcomes of those decisions. It is necessary rework that keeps the emerging design relevant and the software healthy and habitable, reducing risks and medium to long-term costs. Defects are basically mistakes. They happen. How we create software determines whether we have a small and manageable amount of rework or a crippling amount of rework. If we're responsible, skilled and bake quality into code we can minimize rework to technical debt and occasional defects. If we're irresponsible and cut corners, or we're rubbish and write crap code, then rework can become so large that the only viable option is to cancel or start again.
Technical debt requires careful management and continuous investment while defects should be fixed as soon as they are found. A proportion of a team's capacity is therefore always expended doing an amount of rework. That's a good thing providing:
- the completed rework is small compared to the throughput so that capacity mostly focuses on value demand, and
- the completed rework is enough to keep the remaining rework small compared to the throughput, thus minimizing further failure demand.
(Throughput excludes repaid technical debt and fixed defects that went live).
On a weekly basis then, the throughput in relation to the remaining technical debt and defects might be a useful measure of a team's effectiveness.
Effectiveness = ( Throughput - Rework ) / Throughput
where
Throughput = Number of cards released to production that deliver value
Rework = Number of technical debt and defect cards in inventory and work-in-process
I’ve pushed various teams’ data through and the charts seem to correlate with the events described in my historical notes. Here's a chart based on a small, experienced team working on a small project for 3 months.
Effectiveness
Originally uploaded by energizr
You can see there wasn't any throughput in the first 4 weeks as completed cards queued up in inventory. In week 5 that inventory was flushed to became throughput as the first cut was released. Effectiveness then varied with the weekly releases until week 10, which saw the team 100% effective with no rework cards in inventory or work-in-process. In week 12, however, effectiveness dropped to -33% because 1 technical debt card was work-in-process and 3 fixed defects were queued in inventory while only 3 cards were released.
Although it's perhaps a simplistic indicator do you think it's useful as a measure for effectiveness (i.e. a team's ability to deliver value and stay healthy)? Or is it utter tosh? Can it be refined (without complicating it)?
Links to this post