Mean Time Between Failures,
Confidence Intervals,
Number of Samples to Test,
Warranty Failures, Weibull Analysis, and
Monte Carlo Test Results

Things that can be repaired have a metric called mean time between failures (MTBF).  Things that can’t be repaired have a metric called mean time to failure (MTTF).  In casual conversations we often mix the two terms together.  For example consider a powered chain saw for cutting trees: 1) for the cutting edges on the chain, we sharpen dull cutters on the chain periodically for MTBF, whereas 2) for the actual links in the chain we have MTTF for the physical links in the chain because we run the chain until if fails and we don’t replace one link at a time.  Of course both issues are failures because of the loss of utility.

Confidence intervals are a statistic representing a type of interval estimate of a population based on sample to sample variation representing uncertainty of the results.  One sample tested will have very wide confidence intervals.  If 1000 independent samples are tested the confidence interval will be much smaller because of more evidence in variability by repeating the test results.  Confidence intervals (some may prefer confidence level) describe the probability of the true value lying between two zones based on the testing evidence.  Many test results will have smaller intervals than just a few tests based on factual test evidence.

A 99% confidence interval for a specific test will be wider than 90% confidence intervals based on the inherent variability of the test outcomes for the same number of tests.  The uncertainty for a 99% confidence interval will have a zone of 0% to 0.5% expected uncertainty on the low side and 99.5% to 100% expected uncertainty on the high side.  80%, 90%, 95%, 99% or 99.9% confidence intervals have small probabilities of failures falling outside the confidence interval.  Small zones of uncertainty may be selected for medical issues because of the high $risk consequences from failure to achieve the expected results whereas large zone of uncertainty are used for small $consequences for each failure.

For casual industrial applications 80% confidence intervals would have 10% uncertainty on the low side and 10% uncertainty on the high side.  A more common industrial application would be 90% confidence with 5% uncertainty on the low side and 5% uncertainty on the high side.  This means the true value of a single test result is expected to lie within the confidence interval on the basis of chance.  The 80%, 90%, 95%, 99%, or 99.9% confidence interval represents the expected reliability of the true value lying within the confidence limits established—it is not a guarantee of never finding results outside the confidence interval based on repeated results.

More testing is required for proving results lie within the range expressed for higher confidence intervals because each test results in some variability.  More testing requires higher expenses for proving the case because of the variability of actual results.  Testing can become a wickedly expensive venture for proving the case about variability in results.

Monte Carlo techniques and Weibull statistics can help provide the details we need for making the life calculations.  Modern versions of Excel have an excellent random number generator.  Using the random number generator in Excel to draw random chances to put into Weibull calculations will give us life of a component.   From the strong law of large numbers (LLN) formulated by Cardano in the 1500’s, we can get a good approximation of the expected life values for ages to failure.  The LLN has two variations:
     1) the strong law of large numbers which converges to the expected average value (we’ll use this) and
     2) the weak law of large numbers which converges in probability towards the expected value.

Random numbers can (as the Wichman and Hill details explained by Carl Tarum’s demonstration) vary from 0 to 1 as does the Excel random number generator.  The Weibull cumulative distribution function (CDF) also varies from 0 to 1 which makes the use of Excel’s random numbers easy to calculate ages to failure.  The CDF for the 2-parameter Weibull distribution as a function of time is:
     F(t) = 1-e^(t/η)^β
where F(t) = CDF(t) varies from 0 to 1, t is time to failure, β is the shape factor (Weibull plot line slope) and η is the characteristic value (location of the Weibull slope at 63.2%) and e is mathematical constant of 2.71828 18284 59045… .

Solving the Weibull CDF equation for time to failure, t, we get:
     t = η*(ln(1/(1-CDF)))^(1/β)
where the CDF is selected as RAND() for Excel Monte Carlo simulations to get, t, for age to failure.  The MTBF is the (summation of failure times)/(summation of failures).  The MTTF for most applications is reasonably close to MTBF.  As you will see later, the MTBF from the simulation is actual the median value for ease of calculation.

By considering the different Weibull failure modes and using a series system model we can find the life values for each test of the system.  When the cumulative life is stored for sample size of 1, sample  size of 2,..and sample size of 50 then we will, sort the mean life data from low to high for each of the sample size of 1, sample size of 2,..and sample size of 50. 

From the sorted data of 1 to 50 samples columns we will select the midpoint of each column as the mean, the 5% location of the column of data will represent the lower 5% data and 95% data will represent the upper data.  Thus we will find the 90% confidence interval and the mean for our confidence data.  The column of data for each of 50 data points will give the typical mean and confidence intervals when repeated up to say 500,000 iterations for each of say 50 individual average results.    If you run less iterations, say 50 iterations, you’ll get silly information so err on the side of running larger number of iterations to get better results, say 50,000+ iterations.

You should expect the confidence intervals to have a curved funnel shaped contour around the mean as results converge toward the true values.  Of course a few data points (meaning a few Monte Carlo simulations) will not be very accurate with large variation for each run while a large number of data points (meaning many Monte Carlo simulations) will converge toward a more accurate value with smaller variances.

You can download a Monte Carlo simulation in Excel that will allow 10 different failure modes and you can run up to 500,000 iterations.  Of course you must add your own Weibull statistics to get your data driven results.  Generally speaking 100,000 iterations will get results to two consistent decimals from the Monte Carlo Excel model.  At 200,000 iterations in the Monte Carlo Excel model will get your output data accurate to three decimal points and 500,000 iterations the Monte Carlo Excel model will get your results accurate to maybe 3-4 decimal points—use your good engineering judgment in these comments along with reasonable accurate Weibull analysis results.  The Monte Carlo Excel model represents a single device with up to 10 separate failure modes or up to 10 devices in series with individual primary failure modes.  You do not have to use all 10 failure modes for the Monte Carlo Excel model to function—just leave the input data for β and η without numbers.

How do you do Weibull analysis?  Use SuperSMITH Weibull software which is one of three reliability programs in the SuperSMITH package by Wes Fulton.  The input to SuperSMITH Weibull will include ages to failure, and suspension ages, i.e., censored data, as explained in The New Weibull Handbook by Dr. Robert B. Abernethy.  From the software you will get the beta (Weibull line slope) and eta values (Weibull characteristic life value) as shown in Figure 1 for the Monte Carlo model.  Of course if you have and maintain a Weibull library of your failures, you are ahead of the game in knowing what to expect in the design and operation of your equipment.

Figure 1: Input Data For Monte Carlo Model With Ten Different Failure Modes


Figure 1 shows results from the first 3 simulations of 50 lives for one simulation. Red figures show the minimum life number for each system tested, which is the pacing age to failure for the system also collected in the right hand column of Figure 1.  The cumulative MTBF will be 371,212/1 = 371,212 KM for the first system test, (371,212 + 252,818)/2 = 312,015 KM for the second system tested, (371,212 + 252,818 + 232,547)/3 = 285,526 KM for the third system tested and so on up to a count of 50 for the first iteration.  This continues on for Figure 1 until the required number of simulations equal to 250,000 has been reached which tells you (250,000*50) = 12,500,000 ages to failure will be stored for this Monte Carlo solution in Figure 1.

In Figure 1 the first warranty failure has occurred in the third system tested with the culprit as failure mode 7 since the failure age is less than the warranty period.  Warranty failures tell that product quality has not been achieved for reaching minimum guaranteed life.  Warranty failures become an added cost for the manufacturer plus an irritant for the end user for taking the system out of service because of failure to meet the minimum life guarantee.  Generally speaking, manufactures don’t want to oversupply life values because of high manufacturing expense.   Manufactures also don’t want warranty costs to be too high as this destroys both reputations and profits.  Of course manufacturers can also lose their market place reputation for furnishing too many short life problems covered by warranties. 

End users must also consider risk and cost of both early failures and late failures.  Failures are not free of costs.  The risk issue is $Risk = (probability of failure)*($consequence of failure).  Manufactures must appreciate end user costs to control warranty costs impacts both for their facility and the end users facility.  Over 100 years ago John Ruston summarized the cost issue when he wrote: “It’s foolish to spend too much money but it’s unwise to spend too little.” 

For throwaway consumer products there may not be a warranty period.  For some, inexpensive, low grade consumer products the warranty period may expect ~10% failures.  For expensive products the warranty period may be planned to be <1% failures.  For high risk failures, including human injury or pollution problems the warranty failures must be <0.1%.  For fatal human failures during the warranty period the condition may require warrant failures to be less than <0.01% to <0.001%.  In short: engineers, don’t make foolish decisions about the consequences of failures within the warranty period as you may be the named responsible party in legal procedures where you are named as the responsible party by signing your approval to drawings and when that happens ask, how will your family and neighbors be affected?

Figure 2 shows the percentage of failures occurring inside the warranty period of 250,000 KM.  Can you and your company afford this?  What are your design criteria for percent of warranty failures allowed?  How will the warranty failures affect you and your company’s reputation?

Figure 2: Warranty Percentage

Figure 3 shows the failure modes driving the percentages of warranty failures.  The beta values are generally driven by the physics of failure, and they are difficult to significantly change the beta values.  The eta values are driven by the robustness of the resistance to failure.  Generally, you can add or take away the strength of the resistance to the failure mode to make changes.  It is far easier to change the strength of resistance to failure, η, than to change β.   If these components in this case are used in industrial applications or consumer applications the large number of warranty failures clearly indicates greater strength is necessary.

 

Figure 3: Warranty Failures By Failure Mode

 

Figure 4 shows the mean time between failures as the center trend line with the confidence boundaries for 90% confidence intervals.  It’s clear; when you base projections of life on only one test item, you can have huge variations, which means very wide confidence intervals as you can see from Figure 4.  However testing 50 items can give good estimates of MTBF life with small variations based on the data.  For many expensive and long test times this may be time and money incompatible.   So get ready for the necessary compromises for budgets and time to market decisions.

 

Figure 4: Overall Cumulative MTBF With Confidence Intervals

 

Figure 5 shows confidence interval uncertainty is seven times larger for testing one item than testing 50 items.  The gap decreases rapidly to a factor of ~2 at 10 to 12 test items.  Then the confidence interval gap slowly decreases from two to the datum gap of one with 50 test items.  In life, there is never enough time or money to do everything you want to accomplish, thus compromise and trade off are required.


Figure 5: Confidence Interval Gap And Samples Tested

 

Figure 6 shows details about the MTBF and confidence intervals for warranty failures.  Note testing 10 to 12 items until 250,000 KM may give a false sense of security if no failures are discovered as shown by the lower confidence interval of Figure 6.  Adding component robustness will increase the cumulative MTBF from the current low value of roughly 175, 000 KM for warranty performance which is far below the warranty period of 250,000KM.  Note the Y-axis in Figure 6 commences at zero because for the warranty ages, many will actually have some zero warranty failures.

 
Figure 6: MTBF and Confidence Intervals For Warranties

Figure 7 shows the robustness of parts that have survived the warranty period with their MTBF of roughly 450,000 KM.  How do we find the larger eta values needed to reduce the warranty failures?  The pragmatic answer is to do a little experimenting with the simulation by increasing eta values by a factor of say 2 to 10 times.  Look at the warranty percentages by running smaller number of simulations to get a quick (but not accurate) answer.  Then make many simulations for accuracy.  You have to make warranty problems disappear—they seldom disappear on their own!   It’s less expensive to run the computer Monte Carlo simulation than to fiddle around in the test lab trying to get the numbers right.

Figure 7: MTBF and Confidence Intervals For Non-Warranties

Figure 8 shows the simulation times for iterations.  With only 1,000 iterations you will get clues (but not definitive answers) about warranty failures.  Go for the facts by way of simulation.  Too often engineers dwell on ‘shoulda, coulda, woulda’ is no solution to performance deficiencies, compare that to running simple Monte Carlo simulations to get facts in a short period of time.  Adding component robustness will increase the cumulative MTBF from the current low value of roughly 175, 000 KM for warranty performance which is far below the warranty period of 250,000KM.

Figure 8: Run Times

Download a copy of this problem as a PDF file.

The size of the Excel file, devoid of stored data, is 250 Meg and the ZIP version is 25 Meg.  It may be too large for your Internet service so the small ZIP file may assist you—it will take at least 15 minutes for the uncompressed version or 3 minutes for the ZIP version.  If you need the CD instead of a download, the price of the Excel Monte Carlo simulation for multiple failure modes is US$50 plus shipping cost and for export the shipment will be marked for payment of duties and taxes on your end.  Order by Email with your credit card information.

Send your comments or criticisms about the Excel model or this webpage to Paul Barringer.

Click here to return to top of this page

  Return to Barringer & Associates, Inc. homepage

Last revised 3/2/2016
© Barringer & Associates, Inc. 2016