Problem Of The Month

November 2002---Crow/AMSAA Reliability Growth Plots

                                                                                  

Download this problem of the month as a PDF file (450 KB) .

Overview:

Reliability growth plots have a variety of names known as: Duane plots, Crow plots, Crow AMSAA plots, Crow-AMSAA plots, Crow/AMSAA plots, C/A plots, and C-A plots.  They are log-log plots showing reliability trends of improvement, deterioration, or no-change (no improvement or deterioration). The most common plot is cumulative failures versus cumulative time.  Often the Y-axis is transformed to plot cumulative mean time versus cumulative time which makes it easy to interpret—when the line slope is upward and to the right, reliability is improving; likewise when it is trending downward and to the right, reliability is deteriorating.

 

The plots are “show me, don’t tell me” how failures are occurring with time.  You can use your maintenance data records to forecast future failures.  Also you can see the results of improvement programs and easily calculate the changes from the straight lines and the cusps produced by improvement programs.

 

Reliability growth plots showing how reliability changes over time with simple graphics plotted in a log-log format.  Fortunately, the trend lines often have straight line segments, and this makes predictions of future failures a simple matter.  See Figure 1 for an example of a simple plot of cumulative failures versus cumulative time made using WinSMITH Visual software. 

 

In Figure 1, the literal value of beta>1 may mean failures are increasing or it may also mean that for practical purposes, the system shows no improvement or no deterioration.  Crow/AMSAA plots with their key indicator slopes function like yardsticks/meter sticks rather than as a micrometer. 

 

Your explanations are never going to be simpler than cumulative failures versus cumulative time shown in Figure 1.  The straight trend line offers a methodology for making fearless forecasts of future failures even when your data contains mixed failure modes.

 

In “real life” things change from improvements or deteriorations which are made to the system.  We need artifacts and analogs to show us what’s happening in some simple manner.  For example, we need thermometers to indicate rise/fall in temperature.  We need scales/balances to show changes in physical mass.  We need relationships that provide an analog of physical and human experiences.  Thus we the need a reliability growth plot to give us clues about changes in failure rates.  Consequently we can use the reliability growth tool to make “fearless forecasts” about when future failures will occur on the cumulative failure versus cumulative time plot by simply extrapolating the trend line into the future.  As reliability engineers, our task is to put a cusp on the trend line by making cost effective improvements so the cumulative failure versus cumulative time trend line has a flatter slope (i.e., beta is less than 1).  Thus reliability growth plots are helpful for reflecting the changes in failure modes by “digesting” the data from mixed failure modes.

 

Frequently, reliability changes usually occur in steps.  Reliability improvements involve elapsed time and failures.  Longer elapsed times between failures results in reliability improvements and components and/or the system will then displays more reliability.  When cumulative time (plotted on the X-axis) and cumulative failures (plotted on the Y-axis) are plotted on uniformly divided graphs they provide us an analogy of physical experiences as a curved plot. 

 

Experience shows conversion of curved analog plots of improvement efforts can often be transformed into straight line analog plots by use of simple two-axis logarithmic plots.  With the log-log plot, reliability improvements can often be observed as a straight line—when engineers have a X-Y plot with a straight line, they have a fundamental grasp of what’s happening in the real world and can explain the phenomena. 

 

The task of most reliability engineers is to force cusps (a break in the straight line trend line of cumulative failures on the Y-axis versus cumulative time on the X-axis) on the reliability growth lines so that longer intervals of time occur between failures.  Reliability improvement efforts should occur until the cost of making improvements is no longer justified or until objectives of the client have been reached.

 

Why do Crow/AMSAA plots produce straight lines on log-log plots when cumulative failures are plotted versus cumulative time?  The forerunner of the concept has parallel roots in manufacturing and has been exhaustedly demonstrated as true log-log phenomena.  It’s a natural occurrence of learning/improving.  Consider the following parallel.

T. P. Wright (1936) pioneered an idea that improvements in the time to manufacture an airplane could be described mathematically--a very helpful concept for management production planning.  Wright’s findings showed that, as the quantity of airplanes were produced in sequence, the direct labor input per airplane decreased in a mathematical pattern that forms a straight line when plotted on log-log paper.  If the rate of improvement is 20% (the learning percentage is 80%) and thus when large processes and complicated operations production quantity is doubled, the time required for completing the effort is 20% less. Thus a unit of production will decrease by a constant percentage each time the production quantity is doubled. 

Wright’s method in the 1940’s was a helpful concept for the USA War Production Board in estimating the number of airplanes that can be produced for a given complement of men and machines.  After the end of World War II, the US Government employed the Stanford Research Institute (SRI) to validate improvement curve concepts.  SRI studied all USA airframe WWII production data (see table at bottom of this page) to validate the concept and SRI developed a slightly different version than the simple case offered by Wright (DOD 2003) which also plotted on a log-log plot as a straight line.   

Today Wright’s log-log concept is known as learning curves, cost improvement curves, progress function, Crawford curves (J. R. Crawford was on the SRI validation team—Crawford’s model is considered less technical than Wright’s model), Boeing curves, Northrop curves and so forth to represent the findings of each manufacturer of airframes who each developed a variation on T. P. Wright’s simple equation.

The simple improvement curve was Y =AXB which will produce a straight line on log-log paper where Y is the unit cost (hours/unit or $’s/unit), X is the unit number, A is a theoretical cost of the first unit (hours or $’s) and B is a line slope constant that is related to the rate of improvement [B is literally equal to ln(learning percent)/ln(2) where the learning percent = 100-(rate of improvement)].  For example if the first unit took 100 hours to complete (A=110) and if we had an improvement rate of 20% the learning percentage would be 80%, so that B = ln(1.00-0.20)/ln(2) and B= -0.32193.  Thus we would expect production of the 2nd item would require 80 hours and the 4th item produced would require 64 hours, and so forth, as the production quantity doubles we shave 20% from the production time.  Some typical learning curve slopes are described at the NASA Cost Estimating Website (NASA 2003) and the learning % varies from a low of 96% for raw materials to a high of 75% for repetitive electrical operations with most values around 80-90%.  The plots have three different formats: 1) hours/unit or $/unit versus cumulative production, 2) cumulative (hours or $’s) versus cumulative production, or 3) cumulative average (hours or $’s) versus cumulative production.

Learning curves were used extensively by General Electric, and a GE reliability engineer made log-log plots of cumulative MTBF versus cumulative time which gave a straight line for reliability issues (Duane 1964).  Duane argued that all failure data should be used on complex electromechanical systems.  He recommended the Y-axis should be Y = (cumulative failures)/(cumulative time) = KT-a where the value K is a constant which is dependent upon equipment complexity, design margins, and design objectives for reliability, the value for a » 0.5 with the expectations that some designs would be better (meaning a > 0.5) and some would be less (meaning a < 0.5) and T is cumulative time.  Duane drew his conclusions from studying 5 different data sets and found remarkable similarly in patterns for the curves (meaning the line slopes were about the same).  Duane also rearranged his equations and showed cumulative failures F = KT(1-a) which allowed forecasting of future failures based on past results.  James Duane had a deterministic postulate for monitoring failures and failure rates of a complex system over time using a log-log plot with straight lines.

At the US Army Material Systems Analysis Activity during the mid 1970’s Larry Crow converted Duane’s postulate into a mathematical and statistical proof via Weibull statistics in MIL-HDBK-189 (DOD 1981). The military handbook addressed:

reliability growth-The positive improvement in a reliability parameter over a period of time due to changes in product design or the manufacturing process., and
            reliability growth management-The systematic planning for reliability achievement as a function of time and other resources, and controlling the ongoing rate of achievement by reallocation of resources based on comparisons between planned and assessed reliability values.

The ultimate goal of the improvement program was to make reliability grow so as to meet the system reliability and performance requirements by managing the development program.  The management effort required making reliability: 1)  visible, and 2)  a manageable characteristic.  Reliability growth program required goals and forecast of progress.  The failure data usually produced straight line segments on log-log plots with N(t) = ltb where N is the expected number of failures, l is the failure rate at time t = 1, t is cumulative time, and b is the line slope for cumulative failures versus cumulative time (and b = 1 - a from Duane’s equation).  Scientific principles determine that failure data fit N(t) = ltb and thus failure data trends can produce a straight line on log-log paper.  

Data from maintenance failure databases on a log-log plot, will build a Crow/AMSAA relationship for finding the Y-axis intercept at t=1 for l and the slope of the line will define b changes in the programs.  Thus future failures can be forecasted and cusps on the data trends will tell if the system is improving (failures are coming more slowly, b<1), deteriorating (failures are coming more quickly, b>1), or if the system is without improvement/deterioration (failures rates are unchanged, b»1).

Recently AMSAA has updated the information from Military Handbook MIL-HDBK-189 and produced the AMSAA Reliability Growth Guide TR-652 (DOD 2000).

Two excellent documents on the subject of reliability growth are:

MIL-HDBK-189, Reliability Growth Management, 13 February 1981 
       Download from ASSIST Quick Search as a PDF file using the title for the search.

       This PDF is 8.1 Meg in size.
The AMSAA Technical Report No. TR-652 is called the AMSAA Reliability Growth Guide.  You can download this September 2000 document as a PDF from this site:

            Cover pages through Section 1-Introduction: Pages Cover-24 (1.6 Meg)

            Section 2-Reliabilty Growth Planning: Pages 18-47 (2.1 Meg)

            Section 3-Reliability Growth Tracking: Pages 48-86 (2.2 Meg)

            Section 4-Reliability Growth Projection: Pages 87-133 (2.5 Meg)

            Appendix A-Background: Pages A1-A5 (0.3 Meg)

            Appendix B-Tables For Section 2: Pages B1-B43 (3.2 Meg)

            Appendix C-Derivations For Section 2: Pages C1-C8 (0.2 Meg)

            Appendix D-Derivations For Section 4: Pages D1-D12 (0.4 Meg)

            Appendix E-Distribution List: Pages E1-D3 (0.1 Meg)

This publication is not listed at http://www.ntis.gov although space exists in the reference to HDBK-A-1 documentation for inclusion of TR-652.  Here is the abstract for TR-652:

Reliability growth is the improvement in a reliability parameter over a period of time due to changes in product design or the manufacturing process.  It occurs by surfacing failure modes and implementing effective corrective actions.  Reliability growth management is the systematic planning for reliability achievement as a function of time and other resources, and controlling the ongoing rate of achievement by reallocation of these resources based on comparisons between planned and assessed reliability values.  To help manage these reliability activities throughout the development life cycle, AMSAA has developed reliability growth methodology for all phases of the process, from planning to tracking to projection.  The report presents this methodology and associated reliability growth concepts.

      

Both MIL-HDBK-189 and TR-652 are methodologies and concepts to assist in reliability growth planning.  They provide a structured approach for reliability growth assessments.  In general, they are considered from the standpoint that you must begin with some new components and grow the reliability of a system with a development program.

 

Another source of reliability growth information is IEC 61164 (this document was previously numbered as IEC 1164) Reliability Growth-Statistical Test and Estimation Methods.  The IEC-61164 document is a product of TC-56 work group which has provided about 50 documents pertaining to reliability and dependability.

 

You can also download a reliability growth paper from the Internet which was written by David W. Coit with the title: “Economic allocation of test times for subsystem-level reliability growth testing” which was published in 1998 in the IIE Transactions (1998) 30, 1143-1151.  This paper contains numerous references.

 

For plant equipment and operation, the reliability details described below are a little different:

·       The primary purpose of our business activities is to run our production facilities to make money and not to make an improvement program

·       We have old equipment that can only be improved at specific time intervals IF the improvement is truly cost effective

·       Reliability growth occurs on the device as we using the equipment for its primary purpose as a link in the money making machine without time or resources for validating claims for improvements. 

·       We lack staff and we lack verified knowledge that our planned improvements will function as forecasted

·       We need to forecast when the next failure is expected so we can plan for replacement/enhancements during schedule turnarounds

·       The reliability improvement process competes for limited funding with every other program within the production/maintenance organization

In real production plants the reliability improvement program is clearly a question of which comes first the chicken or the egg.  This requires reliability engineers to have numbers and then sell the numbers to management is 60 second sound bites based on 1) Describe the issue and 2) Tell how we will resolve the issue in time and money.  The 60 second sound bites requires that we have good sales tools and the graphics of Crow/AMSAA plots help us sell the program.

Several simple examples of Crow/AMSAA plots-

Consider the following simple discrete examples to illustrate the plotting and calculation concept. 

Example 1: Suppose we had a system that failed every 60 days for a total of 5 failures.  Each corrective maintenance action was a repair (replacement components have the same length of life).  Following the fifth failure, we added a fix (replacement with a longer life component) with a life of 300 days/failure.  Subsequent failures will also be replaced with longer life components.  The data and calculations are shown in Table 1.

 

Of course in real life, the failures would not occur at the same time interval.  Thus real life results lack the clarity of Table 1.  You should expect to see much variability in ages to failure as they will occur with randomness from their family of failure characteristics.

 

Fortunately Crow/AMSAA plots allow mixed failure modes.  This is unlike Weibull plots which require singular failure modes for analysis.

 

Figure 2 shows a plot of cumulative failures versus cumulative time.  The altitude of this curve always rises according to the equation N(t) = ltb where N is cumulative failures, l is the y-intercept at time = 1, and b is the indicator of reliability improvements (b<1), reliability deterioration (b>1), or no reliability change (b=1). 

 

The simple equation N(t) = ltb can be used to make a “fearless forecast” of when the next failure will occur (that is failure number 11 for this case):  t = (11/0.1645)(1/0.548) = 66.8691.8248 = 2141.28 cumulative time.  The “fearless forecast” of the next failure is Dt = 2141.28 – 1800 = 341 days compared to the 300 days expected from the discrete data in Table 1 (remember in real life you would not have discrete data!).  Crow/AMSAA plots are very useful for predicting future failures based on your data.  The technique provides a methodology, the equations are simple, the failure forecast is based on your data, and you can make reasonable forecast of future events.  Remember, out task as reliability engineers is to make improvements so that we do not incur the predicted future failures!

 

The object of our reliability improvement is to find ways to prevent failures.  When you know the approximate time for the next failure (based on the fearless forecast) you need to find ways to prevent the failure. 

 

The first human reaction to Figure 2 is you cannot forecast failures.  The second human reaction based on actual experience is wow—this technique really works.  The third human reaction is to search for which item will fail next.  Finally the human reaction is to “get with the improvement program” to prevent failures. 

 

Unfortunately, it takes considerable time for humans to “buy into” the improvement program because they fail to acknowledge that such a simple equation can be a reasonably good predictor for single or mixed failure modes.  [As a side note, please recognize that many equations describing physical phenomena have simplistic equations: F = ma, E = mC2, S = F/A, etc.  Since most of you cannot derive or explain the theory behind these well known equations why would you doubt that N(t) = ltb also describes important physical relationships in the field of reliability.]

 

Figure 3 shows the data in Figure 2 transformed by dividing the cumulative time by the cumulative failures.  In Figure 3, notice the clarity of the change in Cum-MTBF from the earlier plateau. 

 

The altitude of Figure 3 can go up (reliability improves), down (reliability deteriorates), or sideways (reliability is not changing).  Note that Figure 3 still carries the statistics from Figure 2.  The actual line slope of Figure 3 is usually represented by a = 1-b.  The y-intercept of Figure 3 is 1/l.  Also the trend line in Figure 3 can also be used for making “fearless forecasts” into the future to establish goals for the cumulative MTBF.

Table 2 shows many important equations describing various aspects of Crow/AMSAA plots.  The equations are described in The New Weibull Handbook.  The Crow/AMSAA plots are made with WinSMITH Visual software.

 

Notice each equation is described by means of the specific option in WinSMITH Visual software.  Which equation you use depends upon your specific interest and need.  I find the cumulative failure events is most useful for my interest followed by the cumulative MTBF plots.  Of course you should remember that with my clients, their primary interest is producing a product for sale and use of these techniques is a secondary interest in predicting the expected failure rate and making a decision about how to interpret the statistics.

 

Other practitioners will have different needs, different interests, and thus will use different equations.

 

Consider the three precise trend lines in Figure 4.  All three trend lines have the first failure occurring at the same time. 

·  The line of no improvement/deterioration of course carries a beta = 1 for the line slope. 

·  The second line with beta < 1 shows an improvement as cumulative time data is stretched to longer time intervals. 

·  The third line with beta > 1 shows deterioration and the cumulative time data is compressed to shorter time intervals. 

These thoughts will be useful for a Monte Carlo model which will produce random times to failure as it is generally considered easy to create a model for beta =1 but not so easy to create models for betas different than 1.

 

Table 3 quantifies the multipliers for the stretch/compression in cum times.  The key to this method is taking a beta =1 (The use of random numbers for the case of beta = 1 is easy to produce) and transforming the simple case into other beta values by

stretching or compressing the results. The method shown in Table 3 will avoid hooked cumulative curves.

 

Other Monte Carlo Crow/AMSAA simulation models are available at http://www.itl.nist.gov/div898/handbook/apr/section1/apr191.htm#Simulating NHPP Power Law Data, and in particular, the simulation details for Crow/AMSAA plots is given specifically at http://www.itl.nist.gov/div898/handbook/apr/section1/apr191.htm.

 

You can see the detailed simulation of the stretch-compress method and the NIST method by downloading an Excel spreadsheet Crow/AMSAA simulation which has both methods illustrated.  The spreadsheet will allow you to examine a data set of 10 data points and 100 data points.  Clearly the NIST Crow/AMSAA simulation method is easier to use than the stretch-compression.  As you watch the Crow/AMSAA simulations you will see:

  1. The rank regression method is highly susceptible to the early failures which cause the slope to vary substantially from the true (precise) value—this is one reason some cases in MIL-HDBK-189 drop the first few data points from the regression analysis—more to follow on this subject at a later date. 
  2. “Unbiased” MLE methods (using less than say 500 data points) are biased—again, more to follow on this subject at a later date.
  3. Often (say 1 time in ~10) the simulation develops cusps on the cumulative trend line which says in real life cusps need to be evaluated carefully to distinguish between random events and significant events which should remind you of the TV advertisement “Is it real or is it Memorex?”.
  4. Usually the data points are clustered tightly around the trend line so that the confidence limits will be fairly tight, however, the confidence limits on the trend line will be very wide as the lines bounce around the true (precise) value which drives the simulation—again some of this bias can be removed by eliminating a few of the early data points from the regression so the line is allowed to gain “mass” so the inertia of the data points provides stabilization—stay tuned for more details to follow from a “kabillion” Monte Carlo Crow/AMSAA simulations.
  5. This type of Crow/AMSAA simulation, when automated, will generate the details needed for confidence intervals and critical correlation coefficients as published in The New Weibull Handbook.  More information to follow in subsequent Problems Of The Month.
  6. WinSMITH Visual analysis of the data sets in the Excel spreadsheet Crow/AMSAA simulation match the results obtained by Excel and provide an excellent Monte Carlo simulation (Version 4.0T and above) which can be observed with fidelity even on the demonstration software.

 

 

USA Aircraft production delivered during WWII is reported at http://www.cr.nps.gov/nr/travel/aviation/modernaviation.htm as:

Type

1940

1941

1942

1943

1944

1945

Total

Very Heavy Bombers

0

0

4

91

1,147

2,657

3,899

Heavy Bombers

19

181

2,241

8,695

13,057

3,681

27,874

Medium Bombers

24

326

2,429

3,989

3,636

1,432

11,836

Light Bombers

16

373

1,153

2,247

2,276

1,720

7,785

Fighters

187

1,727

5,213

11,766

18,291

10,591

47,775

Reconnaissance

10

165

195

320

241

285

1,216

Transports

5

133

1,264

5,072

6,430

3,043

15,947

Trainers

948

5,585

11,004

11,246

4,861

825

34,469

Communication/Liaison

0

233

2,945

2,463

1,608

2,020

9,269

Total

1,209

8,723

26,448

45,889

51,547

26,254

160,070

Return to text click here.

Refer to the caveats on the Problem Of The Month Page about the limitations of the solution above. Maybe you have a better idea on how to solve the problem. Maybe you find where I've screwed-up the solution and you can point out my errors as you check my calculations. E-mail your comments, criticism, and corrections to: Paul Barringer by     clicking here.   Return to the top of this problem.

Technical tools are only interesting toys for engineers until results are converted into a business solution involving money and time. Complete your analysis with a bottom line which converts $'s and time so you have answers that will interest your management team!

You can download a PDF copy of this Problem Of The Month by clicking here.

Return to Barringer & Associates, Inc. homepage

Last revised September 2, 2013
© Barringer & Associates, Inc. 2002-2004