Mean Time Between
Failures,
Confidence Intervals,
Number of Samples to Test,
Warranty Failures, Weibull Analysis, and
Monte Carlo Test Results
Things that
can be repaired have a metric called mean time between failures (MTBF). Things
that can’t be repaired have a metric called mean time to failure (MTTF).
In casual conversations we often mix the two terms together. For example
consider a powered chain saw for cutting trees: 1) for the cutting edges on the
chain, we sharpen dull cutters on the chain periodically for MTBF, whereas 2)
for the actual links in the chain we have MTTF for the physical links in the
chain because we run the chain until if fails and we don’t replace one link at
a time. Of course both issues are failures because of the loss of
utility.
Confidence intervals
are a statistic representing a type of interval estimate of a population based
on sample to sample variation representing uncertainty of the results.
One sample tested will have very wide confidence intervals. If 1000
independent samples are tested the confidence interval will be much smaller
because of more evidence in variability by repeating the test results.
Confidence intervals (some may prefer confidence level) describe the
probability of the true value lying between two zones based on the testing
evidence. Many test results will have smaller intervals than just a few
tests based on factual test evidence.
A 99%
confidence interval for a specific test will be wider than 90% confidence
intervals based on the inherent variability of the test outcomes for the same
number of tests. The uncertainty for a 99% confidence interval will have
a zone of 0% to 0.5% expected uncertainty on the low side and 99.5% to 100% expected
uncertainty on the high side. 80%, 90%, 95%, 99% or 99.9% confidence
intervals have small probabilities of failures falling outside the confidence
interval. Small zones of uncertainty may be selected for medical issues
because of the high $risk consequences from failure to achieve the expected
results whereas large zone of uncertainty are used for small $consequences for
each failure.
For casual
industrial applications 80% confidence intervals would have 10% uncertainty on
the low side and 10% uncertainty on the high side. A more common
industrial application would be 90% confidence with 5% uncertainty on the low
side and 5% uncertainty on the high side. This means the true value of a
single test result is expected to lie within the confidence interval on the
basis of chance. The 80%, 90%, 95%, 99%, or 99.9% confidence interval
represents the expected reliability of the true value lying within the
confidence limits established—it is not a guarantee of never finding results
outside the confidence interval based on repeated results.
More testing
is required for proving results lie within the range expressed for higher
confidence intervals because each test results in some variability. More
testing requires higher expenses for proving the case because of the
variability of actual results. Testing can become a wickedly expensive
venture for proving the case about variability in results.
Monte Carlo
techniques and Weibull
statistics can help provide the details we need for making the life
calculations. Modern versions of Excel have an excellent random number
generator. Using the random number generator in Excel to draw random chances
to put into Weibull calculations will give us life of a component.
From the strong law
of large numbers (LLN) formulated by Cardano in
the 1500’s, we can get a good approximation of the expected life values for
ages to failure. The LLN has two variations:
1) the strong law of large numbers which converges to
the expected average value (we’ll use this) and
2) the weak law of large numbers which converges in
probability towards the expected value.
Random
numbers can (as the Wichman and Hill details explained
by Carl Tarum’s demonstration) vary from 0 to 1 as does the Excel random number
generator. The Weibull cumulative distribution function (CDF) also
varies from 0 to 1 which makes the use of Excel’s random numbers easy to
calculate ages to failure. The CDF for the 2parameter Weibull
distribution as a function of time is:
F(t) = 1e^(t/η)^β
where F(t) = CDF(t) varies from 0 to 1, t is time to
failure, β is the shape factor (Weibull plot line slope) and η is the
characteristic value (location of the Weibull slope at 63.2%) and e is
mathematical constant of 2.71828 18284 59045… .
Solving the
Weibull CDF equation for time to failure, t, we get:
t = η*(ln(1/(1CDF)))^(1/β)
where the CDF is selected as
RAND() for Excel Monte Carlo simulations to get, t, for age to failure.
The MTBF is the (summation of failure times)/(summation
of failures). The MTTF for most
applications is reasonably close to MTBF.
As you will see later, the MTBF from the simulation is actual the median
value for ease of calculation.
By
considering the different Weibull failure modes and using a series system model
we can find the life values for each test of the system. When the
cumulative life is stored for sample size of 1, sample
size of 2,..and sample size of 50 then we will, sort the mean life data
from low to high for each of the sample size of 1,
sample size of 2,..and sample size of 50.
From the
sorted data of 1 to 50 samples columns we will select the midpoint of each
column as the mean, the 5% location of the column of data will represent the
lower 5% data and 95% data will represent the upper data. Thus we will
find the 90% confidence interval and the mean for our confidence data.
The column of data for each of 50 data points will give the typical mean and
confidence intervals when repeated up to say 500,000 iterations for each of say
50 individual average results. If you run less iterations,
say 50 iterations, you’ll get silly information so err on the side of running
larger number of iterations to get better results, say 50,000+ iterations.
You should
expect the confidence intervals to have a curved funnel shaped contour around
the mean as results converge toward the true values. Of course a few data
points (meaning a few Monte Carlo simulations) will not be very accurate with
large variation for each run while a large number of data points (meaning many
Monte Carlo simulations) will converge toward a more accurate value with
smaller variances.
You can
download a Monte Carlo simulation in Excel that will allow 10 different failure
modes and you can run up to 500,000 iterations. Of course you must add
your own Weibull statistics to get your data driven results. Generally
speaking 100,000 iterations will get results to two consistent decimals from
the Monte Carlo Excel model. At 200,000 iterations in the Monte Carlo
Excel model will get your output data accurate to three decimal points and
500,000 iterations the Monte Carlo Excel model will get your results accurate
to maybe 34 decimal points—use your good engineering judgment in these
comments along with reasonable accurate Weibull analysis results. The
Monte Carlo Excel model represents a single device with up to 10 separate
failure modes or up to 10 devices in series with individual primary failure
modes. You do not have to use all 10 failure modes for the Monte Carlo
Excel model to function—just leave the input data for β and η without
numbers.
How do you
do Weibull analysis? Use SuperSMITH Weibull software which is one of three
reliability programs in the SuperSMITH package by Wes Fulton. The input
to SuperSMITH Weibull will include ages to failure,
and suspension ages, i.e., censored data, as explained in The New Weibull Handbook by Dr. Robert B. Abernethy.
From the software you will get the beta (Weibull line slope) and eta values
(Weibull characteristic life value) as shown in Figure 1 for the Monte Carlo
model. Of course if you have and
maintain a Weibull library of your failures, you are ahead of the game in
knowing what to expect in the design and operation of your equipment.

Figure 1: Input Data For Monte Carlo Model With Ten Different Failure Modes 
Figure 1 shows results from the
first 3 simulations of 50 lives for one simulation. Red figures show the
minimum life number for each system tested, which is the pacing age to failure
for the system also collected in the right hand column of Figure 1. The
cumulative MTBF will be 371,212/1 = 371,212 KM for the first system test,
(371,212 + 252,818)/2 = 312,015 KM for the second system tested, (371,212 +
252,818 + 232,547)/3 = 285,526 KM for the third system tested and so on up to a
count of 50 for the first iteration. This continues on for Figure 1 until
the required number of simulations equal to 250,000 has been reached which
tells you (250,000*50) = 12,500,000 ages to failure will be stored for this
Monte Carlo solution in Figure 1.
In Figure 1
the first warranty failure has occurred in the third system tested with the
culprit as failure mode 7 since the failure age is less than the warranty
period. Warranty failures tell that product quality has not
been achieved for reaching minimum guaranteed life. Warranty failures become an added cost for
the manufacturer plus an irritant for the end user for taking the system out of
service because of failure to meet the minimum life guarantee. Generally
speaking, manufactures don’t want to oversupply life values because of high
manufacturing expense. Manufactures also don’t want warranty costs
to be too high as this destroys both reputations and profits. Of course
manufacturers can also lose their market place reputation for furnishing too
many short life problems covered by warranties.
End users
must also consider risk and cost of both early failures and late failures. Failures are not free of costs. The
risk issue is $Risk = (probability of failure)*($consequence of failure).
Manufactures must appreciate end user costs to control warranty costs impacts
both for their facility and the end users facility. Over 100 years ago
John Ruston summarized the cost issue when he wrote: “It’s foolish to spend too
much money but it’s unwise to spend too little.”
For
throwaway consumer products there may not be a warranty period. For some,
inexpensive, low grade
consumer products the warranty period may expect ~10% failures. For
expensive products the warranty period may be planned to be <1%
failures. For high risk failures, including human injury or pollution problems
the warranty failures must be <0.1%. For fatal human failures during
the warranty period the condition may require warrant failures to be less than
<0.01% to <0.001%. In short: engineers, don’t make foolish
decisions about the consequences of failures within the warranty period as you
may be the named responsible party in legal procedures where you are named as
the responsible party by signing your approval to drawings and when that
happens ask, how will your family and neighbors be affected?
Figure 2
shows the percentage of failures occurring inside the warranty period of
250,000 KM. Can you and your company
afford this? What are your design
criteria for percent of warranty failures allowed? How will the warranty failures affect you and
your company’s reputation?

Figure 2: Warranty Percentage 
Figure 3 shows the failure modes driving
the percentages of warranty failures. The beta values are generally driven
by the physics of failure, and they are difficult to significantly change the
beta values. The eta values are driven by the robustness of the
resistance to failure. Generally, you can add or take away the strength
of the resistance to the failure mode to make changes. It is far easier
to change the strength of resistance to failure, η, than to change
β. If these components in this case are used in industrial
applications or consumer applications the large number of warranty failures
clearly indicates greater strength is necessary.

Figure 3: Warranty Failures By Failure Mode 
Figure 4 shows the mean time between failures
as the center trend line with the confidence boundaries for 90% confidence
intervals. It’s clear; when you base projections of life on only one test
item, you can have huge variations, which means very wide confidence intervals
as you can see from Figure 4. However testing 50 items can give good
estimates of MTBF life with small variations based on the data. For many
expensive and long test times this may be time and money
incompatible. So get ready for the necessary compromises for
budgets and time to market decisions.

Figure 4: Overall Cumulative MTBF With Confidence Intervals 
Figure 5 shows confidence interval
uncertainty is seven times larger for testing one item than testing 50
items. The gap decreases rapidly to a factor of ~2 at 10 to 12 test
items. Then the confidence interval gap slowly decreases from two to the
datum gap of one with 50 test items. In life, there is never enough time
or money to do everything you want to accomplish, thus compromise and trade off
are required.

Figure 6
shows details about the MTBF and confidence intervals for warranty
failures. Note testing 10 to 12 items until 250,000 KM may give a false
sense of security if no failures are discovered as shown by the lower
confidence interval of Figure 6. Adding component robustness will
increase the cumulative MTBF from the current low value of roughly 175, 000 KM
for warranty performance which is far below the warranty period of 250,000KM. Note the Yaxis in Figure 6 commences at zero
because for the warranty ages, many will actually have some zero warranty
failures.
Figure 6: MTBF and Confidence Intervals For
Warranties
Figure 7
shows the robustness of parts that have survived the warranty period with their
MTBF of roughly 450,000 KM. How do we find the larger eta values needed
to reduce the warranty failures? The pragmatic answer is to do a little
experimenting with the simulation by increasing eta values by a factor of say 2
to 10 times. Look at the warranty percentages by running smaller number
of simulations to get a quick (but not accurate) answer. Then make many
simulations for accuracy. You have to make warranty problems
disappear—they seldom disappear on their own!
It’s less expensive to run the computer Monte Carlo simulation than to
fiddle around in the test lab trying to get the numbers right.

Figure 7: MTBF and Confidence Intervals For
NonWarranties 
Figure 8
shows the simulation times for iterations. With only 1,000 iterations you
will get clues (but not definitive answers) about warranty failures. Go
for the facts by way of simulation. Too often engineers dwell on ‘shoulda, coulda, woulda’ is no solution to performance deficiencies, compare
that to running simple Monte Carlo simulations to get facts in a short period
of time. Adding component robustness will increase the cumulative MTBF
from the current low value of roughly 175, 000 KM for warranty performance
which is far below the warranty period of 250,000KM.
Figure 8: Run Times 
Download a copy of this problem as a PDF file.
The size of the Excel file, devoid of stored
data, is 250
Meg and the ZIP version is 25
Meg. It may be too large for your
Internet service so the small ZIP file may assist you—it will take at least 15
minutes for the uncompressed version or 3 minutes for the ZIP version. If you need the CD instead of a download, the
price of the Excel Monte
Carlo simulation for multiple failure modes is US$50 plus shipping cost and
for export the shipment will be marked for payment of duties and taxes on your
end. Order by Email with your credit card
information.
Send your comments or criticisms about the
Excel model or this webpage to Paul
Barringer.
Click here to
return to top of this page
Return to Barringer & Associates, Inc. homepage
Last
revised 3/2/2016
© Barringer & Associates, Inc. 2016