Problem Of The Month
November
2002---Crow/AMSAA Reliability Growth Plots
Download this problem of the month as a PDF file (450 KB)
.
Overview:
Reliability growth plots have a variety of names known as: Duane plots, Crow plots, Crow AMSAA plots, Crow-AMSAA plots, Crow/AMSAA plots, C/A plots, and C-A plots. They are log-log plots showing reliability trends of improvement, deterioration, or no-change (no improvement or deterioration). The most common plot is cumulative failures versus cumulative time. Often the Y-axis is transformed to plot cumulative mean time versus cumulative time which makes it easy to interpret—when the line slope is upward and to the right, reliability is improving; likewise when it is trending downward and to the right, reliability is deteriorating.
The plots are “show me, don’t tell me” how failures are occurring with time. You can use your maintenance data records to forecast future failures. Also you can see the results of improvement programs and easily calculate the changes from the straight lines and the cusps produced by improvement programs.
Reliability growth plots showing how reliability changes over time with simple graphics plotted in a log-log format. Fortunately, the trend lines often have straight line segments, and this makes predictions of future failures a simple matter. See Figure 1 for an example of a simple plot of cumulative failures versus cumulative time made using WinSMITH Visual software.
In Figure 1, the literal value of beta>1 may mean failures are increasing or it may also mean that for practical purposes, the system shows no improvement or no deterioration. Crow/AMSAA plots with their key indicator slopes function like yardsticks/meter sticks rather than as a micrometer.
Your explanations are never going to be simpler than cumulative failures versus cumulative time shown in Figure 1. The straight trend line offers a methodology for making fearless forecasts of future failures even when your data contains mixed failure modes.
In “real life” things change from improvements or deteriorations which are made to the system. We need artifacts and analogs to show us what’s happening in some simple manner. For example, we need thermometers to indicate rise/fall in temperature. We need scales/balances to show changes in physical mass. We need relationships that provide an analog of physical and human experiences. Thus we the need a reliability growth plot to give us clues about changes in failure rates. Consequently we can use the reliability growth tool to make “fearless forecasts” about when future failures will occur on the cumulative failure versus cumulative time plot by simply extrapolating the trend line into the future. As reliability engineers, our task is to put a cusp on the trend line by making cost effective improvements so the cumulative failure versus cumulative time trend line has a flatter slope (i.e., beta is less than 1). Thus reliability growth plots are helpful for reflecting the changes in failure modes by “digesting” the data from mixed failure modes.
Frequently, reliability changes usually occur in steps. Reliability improvements involve elapsed time and failures. Longer elapsed times between failures results in reliability improvements and components and/or the system will then displays more reliability. When cumulative time (plotted on the X-axis) and cumulative failures (plotted on the Y-axis) are plotted on uniformly divided graphs they provide us an analogy of physical experiences as a curved plot.
Experience shows conversion of curved analog plots of improvement efforts can often be transformed into straight line analog plots by use of simple two-axis logarithmic plots. With the log-log plot, reliability improvements can often be observed as a straight line—when engineers have a X-Y plot with a straight line, they have a fundamental grasp of what’s happening in the real world and can explain the phenomena.
The task of most reliability engineers is to force cusps (a break in the straight line trend line of cumulative failures on the Y-axis versus cumulative time on the X-axis) on the reliability growth lines so that longer intervals of time occur between failures. Reliability improvement efforts should occur until the cost of making improvements is no longer justified or until objectives of the client have been reached.
Why do Crow/AMSAA plots produce straight lines on log-log
plots when cumulative failures are plotted versus cumulative time? The forerunner of the concept
has parallel roots in manufacturing and has been exhaustedly demonstrated as
true log-log phenomena. It’s a natural
occurrence of learning/improving.
Consider the following parallel.
T. P. Wright (1936)
pioneered an idea that improvements in the time to manufacture an airplane
could be described mathematically--a very helpful concept for management
production planning. Wright’s findings
showed that, as the quantity of airplanes were produced in sequence, the direct
labor input per airplane decreased in a mathematical pattern that forms a
straight line when plotted on log-log paper.
If the rate of improvement is 20% (the learning percentage is 80%) and
thus when large processes and complicated operations production quantity is
doubled, the time required for completing the effort is 20% less. Thus a unit
of production will decrease by a constant percentage each time the production
quantity is doubled.
Wright’s method in the 1940’s was a helpful concept for the USA War
Production Board in estimating the number of airplanes that can be produced for
a given complement of men and machines.
After the end of World War II, the US Government employed the Stanford
Research Institute (SRI) to validate improvement curve concepts. SRI studied all USA airframe WWII production
data (see table at bottom of this page) to validate the
concept and SRI developed a slightly different version than the simple case
offered by Wright (DOD 2003) which also plotted on a log-log plot as a straight
line.
Today Wright’s
log-log concept is known as learning curves, cost improvement curves, progress
function, Crawford curves (J. R. Crawford was on the SRI validation
team—Crawford’s model is considered less technical than Wright’s model), Boeing
curves, Northrop curves and so forth to represent the findings of each
manufacturer of airframes who each developed a variation on T. P. Wright’s
simple equation.
The simple
improvement curve was Y =AX^{B} which will produce a straight line on
log-log paper where Y is the unit cost (hours/unit or $’s/unit), X is the unit
number, A is a theoretical cost of the first unit (hours or $’s) and B is a line
slope constant that is related to the rate of improvement [B is literally equal
to ln(learning percent)/ln(2) where the learning percent = 100-(rate of
improvement)]. For example if the first
unit took 100 hours to complete (A=110) and if we had an improvement rate of
20% the learning percentage would be 80%, so that B = ln(1.00-0.20)/ln(2) and
B= -0.32193. Thus we would expect
production of the 2^{nd} item would require 80 hours and the 4^{th}
item produced would require 64 hours, and so forth, as the production quantity
doubles we shave 20% from the production time.
Some typical learning curve slopes are described at the NASA Cost
Estimating Website (NASA 2003) and the learning % varies from a low of 96% for
raw materials to a high of 75% for repetitive electrical operations with most
values around 80-90%. The plots have
three different formats: 1) hours/unit or $/unit versus cumulative production,
2) cumulative (hours or $’s) versus cumulative production, or 3) cumulative
average (hours or $’s) versus cumulative production.
Learning curves
were used extensively by General Electric, and a GE reliability engineer made
log-log plots of cumulative MTBF versus cumulative time which gave a straight
line for reliability issues (Duane 1964).
Duane argued that all failure data should be used on complex
electromechanical systems. He
recommended the Y-axis should be Y = (cumulative failures)/(cumulative time) =
KT^{-a} where the value K is a constant which is dependent upon
equipment complexity, design margins, and design objectives for reliability,
the value for a » 0.5 with the expectations that some designs would be better (meaning a > 0.5) and some would be less (meaning a < 0.5) and T is cumulative time.
Duane drew his conclusions from studying 5 different data sets and found
remarkable similarly in patterns for the curves (meaning the line slopes were
about the same). Duane also rearranged
his equations and showed cumulative failures F = KT^{(1-a)} which allowed forecasting of future failures based on past
results. James Duane had a deterministic
postulate for monitoring failures and failure rates of a complex system over
time using a log-log plot with straight lines.
At the US Army
Material Systems Analysis Activity during the mid 1970’s
Larry Crow converted Duane’s postulate into a mathematical and statistical
proof via Weibull statistics in MIL-HDBK-189 (DOD 1981). The military
handbook addressed:
reliability growth-The positive improvement in a reliability parameter over a period of
time due to changes in
product design or the manufacturing process., and
reliability growth management-The
systematic planning for reliability achievement as a function of time and other
resources, and controlling the ongoing rate of achievement by reallocation of
resources based on comparisons
between planned and assessed reliability values.
The ultimate goal
of the improvement program was to make reliability grow so as to meet the
system reliability and performance requirements by managing the development
program. The management effort required
making reliability: 1) visible, and
2) a manageable characteristic. Reliability growth program required goals and
forecast of progress. The failure data
usually produced straight line segments on log-log plots with N(t) = lt^{b} where N is the expected number
of failures, l is the failure rate at time t = 1, t is cumulative time,
and b is the line slope for cumulative failures versus
cumulative time (and b = 1 - a from Duane’s equation). Scientific principles determine that failure
data fit N(t) = lt^{b} and thus failure
data trends can produce a straight line on log-log paper.
Data from
maintenance failure databases on a log-log plot, will build a Crow/AMSAA
relationship for finding the Y-axis intercept at t=1 for l and the slope of the line will define b changes in
the programs. Thus future failures can
be forecasted and cusps on the data trends will tell if the system is improving
(failures are coming more slowly, b<1),
deteriorating (failures are coming more quickly, b>1), or
if the system is without improvement/deterioration (failures rates are
unchanged, b»1).
Recently AMSAA has
updated the information from Military Handbook MIL-HDBK-189 and produced the
AMSAA Reliability Growth Guide TR-652 (DOD 2000).
Two excellent documents on the subject of reliability growth are:
MIL-HDBK-189, Reliability Growth Management, 13 February 1981
Download from ASSIST Quick Search as a PDF file using
the title for the search.
This PDF is 8.1 Meg
in size.
The AMSAA Technical Report No. TR-652
is called the AMSAA Reliability Growth
Guide. You can download this
September 2000 document as a PDF from
this site:
Cover pages through Section 1-Introduction: Pages Cover-24 (1.6 Meg)
Section 2-Reliabilty Growth Planning: Pages 18-47 (2.1 Meg)
Section 3-Reliability Growth Tracking: Pages 48-86 (2.2 Meg)
Section 4-Reliability Growth Projection: Pages 87-133 (2.5 Meg)
Appendix A-Background: Pages A1-A5 (0.3 Meg)
Appendix B-Tables For Section 2: Pages B1-B43 (3.2 Meg)
Appendix C-Derivations For Section 2: Pages C1-C8 (0.2 Meg)
Appendix D-Derivations For Section 4: Pages D1-D12 (0.4 Meg)
Appendix E-Distribution List: Pages E1-D3 (0.1 Meg)
This publication is not listed at http://www.ntis.gov although space exists in the reference to HDBK-A-1 documentation for inclusion of TR-652. Here is the abstract for TR-652:
Reliability
growth is the improvement in a reliability parameter over a period of time due
to changes in product design or the manufacturing process. It occurs by surfacing failure modes and
implementing effective corrective actions.
Reliability growth management is the systematic planning for reliability
achievement as a function of time and other resources, and controlling the
ongoing rate of achievement by reallocation of these resources based on
comparisons between planned and assessed reliability values. To help manage these reliability activities
throughout the development life cycle, AMSAA has developed reliability growth
methodology for all phases of the process, from planning to tracking to
projection. The report presents this
methodology and associated reliability growth concepts.
Both MIL-HDBK-189 and TR-652 are methodologies and concepts to assist in reliability growth planning. They provide a structured approach for reliability growth assessments. In general, they are considered from the standpoint that you must begin with some new components and grow the reliability of a system with a development program.
Another source of reliability growth information is IEC 61164 (this document was previously numbered as IEC 1164) Reliability Growth-Statistical Test and Estimation Methods. The IEC-61164 document is a product of TC-56 work group which has provided about 50 documents pertaining to reliability and dependability.
You can also download a reliability growth paper from the Internet which was written by David W. Coit with the title: “Economic allocation of test times for subsystem-level reliability growth testing” which was published in 1998 in the IIE Transactions (1998) 30, 1143-1151. This paper contains numerous references.
For plant equipment and operation, the reliability details described below are a little different:
· The primary purpose of our business activities is to run our production facilities to make money and not to make an improvement program
· We have old equipment that can only be improved at specific time intervals IF the improvement is truly cost effective
· Reliability growth occurs on the device as we using the equipment for its primary purpose as a link in the money making machine without time or resources for validating claims for improvements.
· We lack staff and we lack verified knowledge that our planned improvements will function as forecasted
· We need to forecast when the next failure is expected so we can plan for replacement/enhancements during schedule turnarounds
· The reliability improvement process competes for limited funding with every other program within the production/maintenance organization
In real production plants the reliability improvement program is clearly a question of which comes first the chicken or the egg. This requires reliability engineers to have numbers and then sell the numbers to management is 60 second sound bites based on 1) Describe the issue and 2) Tell how we will resolve the issue in time and money. The 60 second sound bites requires that we have good sales tools and the graphics of Crow/AMSAA plots help us sell the program.
Several simple
examples of Crow/AMSAA plots-
Consider the following simple discrete examples to illustrate the plotting and calculation concept.
Example 1: Suppose we had a system that failed every 60 days for a total of 5 failures. Each corrective maintenance action was a repair (replacement components have the same length of life). Following the fifth failure, we added a fix (replacement with a longer life component) with a life of 300 days/failure. Subsequent failures will also be replaced with longer life components. The data and calculations are shown in Table 1.
Of course in real life, the failures would not occur at the same time interval. Thus real life results lack the clarity of Table 1. You should expect to see much variability in ages to failure as they will occur with randomness from their family of failure characteristics.
Fortunately Crow/AMSAA plots allow mixed failure modes. This is unlike Weibull plots which require singular failure modes for analysis.
Figure 2 shows a plot of cumulative failures versus cumulative time. The altitude of this curve always rises according to the equation N(t) = lt^{b} where N is cumulative failures, l is the y-intercept at time = 1, and b is the indicator of reliability improvements (b<1), reliability deterioration (b>1), or no reliability change (b=1).
The simple equation N(t) = lt^{b} can be used to make a “fearless forecast” of when the next failure will occur (that is failure number 11 for this case): t = (11/0.1645)^{(1/0.548)} = 66.869^{1.8248} = 2141.28 cumulative time. The “fearless forecast” of the next failure is Dt = 2141.28 – 1800 = 341 days compared to the 300 days expected from the discrete data in Table 1 (remember in real life you would not have discrete data!). Crow/AMSAA plots are very useful for predicting future failures based on your data. The technique provides a methodology, the equations are simple, the failure forecast is based on your data, and you can make reasonable forecast of future events. Remember, out task as reliability engineers is to make improvements so that we do not incur the predicted future failures!
The object of our reliability improvement is to find ways to prevent failures. When you know the approximate time for the next failure (based on the fearless forecast) you need to find ways to prevent the failure.
The first human reaction to Figure 2 is you cannot forecast failures. The second human reaction based on actual experience is wow—this technique really works. The third human reaction is to search for which item will fail next. Finally the human reaction is to “get with the improvement program” to prevent failures.
Unfortunately, it takes considerable time for humans to “buy into” the improvement program because they fail to acknowledge that such a simple equation can be a reasonably good predictor for single or mixed failure modes. [As a side note, please recognize that many equations describing physical phenomena have simplistic equations: F = ma, E = mC^{2}, S = F/A, etc. Since most of you cannot derive or explain the theory behind these well known equations why would you doubt that N(t) = lt^{b} also describes important physical relationships in the field of reliability.]
Figure 3 shows the data in Figure 2 transformed by dividing the cumulative time by the cumulative failures. In Figure 3, notice the clarity of the change in Cum-MTBF from the earlier plateau.
The altitude of Figure 3 can go up (reliability improves), down (reliability deteriorates), or sideways (reliability is not changing). Note that Figure 3 still carries the statistics from Figure 2. The actual line slope of Figure 3 is usually represented by a = 1-b. The y-intercept of Figure 3 is 1/l. Also the trend line in Figure 3 can also be used for making “fearless forecasts” into the future to establish goals for the cumulative MTBF.
Table 2 shows many important equations describing various aspects of Crow/AMSAA plots. The equations are described in The New Weibull Handbook. The Crow/AMSAA plots are made with WinSMITH Visual software.
Notice each equation is described by means of the specific option in WinSMITH Visual software. Which equation you use depends upon your specific interest and need. I find the cumulative failure events is most useful for my interest followed by the cumulative MTBF plots. Of course you should remember that with my clients, their primary interest is producing a product for sale and use of these techniques is a secondary interest in predicting the expected failure rate and making a decision about how to interpret the statistics.
Other practitioners will have different needs, different interests, and thus will use different equations.
Consider the three precise trend lines in Figure 4. All three trend lines have the first failure occurring at the same time.
· The line of no improvement/deterioration of course carries a beta = 1 for the line slope.
· The second line with beta < 1 shows an improvement as cumulative time data is stretched to longer time intervals.
· The third line with beta > 1 shows deterioration and the cumulative time data is compressed to shorter time intervals.
These thoughts will be useful for a Monte Carlo model which will produce random times to failure as it is generally considered easy to create a model for beta =1 but not so easy to create models for betas different than 1.
Table 3 quantifies the multipliers for the stretch/compression in cum times. The key to this method is taking a beta =1 (The use of random numbers for the case of beta = 1 is easy to produce) and transforming the simple case into other beta values by
stretching or compressing the results. The method shown in Table 3 will avoid hooked cumulative curves.
Other Monte Carlo Crow/AMSAA simulation models are available at http://www.itl.nist.gov/div898/handbook/apr/section1/apr191.htm#Simulating NHPP Power Law Data, and in particular, the simulation details for Crow/AMSAA plots is given specifically at http://www.itl.nist.gov/div898/handbook/apr/section1/apr191.htm.
You can see the detailed simulation of the stretch-compress method and the NIST method by downloading an Excel spreadsheet Crow/AMSAA simulation which has both methods illustrated. The spreadsheet will allow you to examine a data set of 10 data points and 100 data points. Clearly the NIST Crow/AMSAA simulation method is easier to use than the stretch-compression. As you watch the Crow/AMSAA simulations you will see:
USA Aircraft production delivered during WWII is reported at http://www.cr.nps.gov/nr/travel/aviation/modernaviation.htm as:
Type |
1940 |
1941 |
1942 |
1943 |
1944 |
1945 |
Total |
Very Heavy Bombers |
0 |
0 |
4 |
91 |
1,147 |
2,657 |
3,899 |
Heavy Bombers |
19 |
181 |
2,241 |
8,695 |
13,057 |
3,681 |
27,874 |
Medium Bombers |
24 |
326 |
2,429 |
3,989 |
3,636 |
1,432 |
11,836 |
Light Bombers |
16 |
373 |
1,153 |
2,247 |
2,276 |
1,720 |
7,785 |
Fighters |
187 |
1,727 |
5,213 |
11,766 |
18,291 |
10,591 |
47,775 |
Reconnaissance |
10 |
165 |
195 |
320 |
241 |
285 |
1,216 |
Transports |
5 |
133 |
1,264 |
5,072 |
6,430 |
3,043 |
15,947 |
Trainers |
948 |
5,585 |
11,004 |
11,246 |
4,861 |
825 |
34,469 |
Communication/Liaison |
0 |
233 |
2,945 |
2,463 |
1,608 |
2,020 |
9,269 |
Total |
1,209 |
8,723 |
26,448 |
45,889 |
51,547 |
26,254 |
160,070 |
Refer to the caveats on the Problem Of The Month Page about the limitations of the solution above. Maybe you have a better idea on how to solve the problem. Maybe you find where I've screwed-up the solution and you can point out my errors as you check my calculations. E-mail your comments, criticism, and corrections to: Paul Barringer by clicking here. Return to the top of this problem.
Technical tools are only interesting toys for engineers until results are converted into a business solution involving money and time. Complete your analysis with a bottom line which converts $'s and time so you have answers that will interest your management team!
You can download a PDF copy of this Problem Of The Month by clicking here.
Return to Barringer & Associates, Inc. homepage