Problem Of The Month

April 2003---Will it last until the next turnaround?



You must know when to hold the course and when to abandon the course.  This problem is about conditional reliability calculations. 

Download this problem of the month as a PDF file (300KB) revised 08/20/03.

The Problem:

We have a new system in operation.  We’ve experienced two failures at 4717 hours and 17221 hours along with the current system which has completed 25000 hours and is still running (this data is a suspension or often referred to as censored information).  Will it survive for one more year (8760 hours) until the next plant turnaround?  What action should we take?

The system is redundant.  If it fails, the backup units will pick up the load.  Thus the failure cost is simply the repair cost as the planned repair cost is $5000 and the unplanned cost is $5000.  The replacement strategy is simply run to failure (this does not mean run the system to failure and then run it into the ground so that final repair costs are higher because of abusive actions due to not removing the system from its failed condition when costs for the replacement are small).

Background Information:

Kececioglu in his Reliability Engineering Handbook, Volume 1 (see the reading list at, page 108, says the five most important functions in reliability engineering are shown in Figure 1.  The equations are written in Weibull format where h is the characteristic life and b is the shape factor or line slope.  Keceicioglu argues that most reliability engineering problems can be solved with these five functions.

The area under the PDF curve, by definition, equals 1.  It shows the shape of the failure distribution curve as would occur if you made a tally sheet of failures occurring at a specific time.  It gives the density of the probability for failure.  The area under the curve will directly give the probability for failure as you go from t1 to t2.

The hazard curve, known as the instantaneous failure rate, describes the failure rate function which gives the relationship between age and the failure frequency as occurs with the composite curve known as the bathtub curve for failure rates.  The Weibull failure rate curve is constant only for the condition where b=1.  The instantaneous failure rate curve declines with time when b<1 and increases with time when b>1.

The reliability function gives the relationship between the age of a unit and the probability for survival given the unit starts the mission with zero time.  It is a measure for the chances for success over the mission period.

The conditional reliability function gives the probability for successfully completing an incremental mission time given the unit already has successfully survived for a given time interval prior to commencing the incremental mission.  Conditional reliability gives the probability for completing the new incremental mission given the successful completion of previous cumulative mission time.  The numerator in the conditional reliability calculation will always be less than the denominator (because the total time in the numerator is always larger than in the denominator).

The mean life function for the Weibull distribution is a fixed value when b=1 and thus h = mean life.  When b<1, the gamma function G(1+1/b) is greater than 1, when b>1, the gamma function is less than 1.  In Excel, the gamma function is “=exp(gammaln(1+1/b))”.  The MTBF for repairable units or the MTTF for non-repairable units gives a measure of the average time of operation to a failure.

The Calculations:

Make the data talk-We don’t have much data to use in the decision making process and this is a very typical engineering problem.  So prepare a Weibull plot using the three pieces of data (two failures and one suspension) to get the WinSMITH Weibull plot in Figure 2.  Figure 2 will help you say something rather than giving a shrug of the shoulders and looking sheepish.  {Please note that if you use the data in the WinSMITH Weibull demo program you will not get the same numbers because the demo program will slightly randomize your input data, however, you will be able to follow the calculations—to get the exact answers, download the authentic file which will allow you to replicate the results.}. 

Notice the X-axis is in ages-to-failure and the Y-axis is given in probability for failure in Figure 2 which is a statement of unreliability.  Convert the Y-axis to reliability by taking the complement of the unreliability.   We need reliability values to determine the conditional reliability.  We can take the complement or in this case we will use the predict feature of the software to give us the results directly. 

What is the probability for survival?-
We need the reliability at t=25,000 hours and at t=25,000+8,760 = 33,760* hours. 

= 21.75% and


The conditional reliability = R(33760)/R(25000)= 11.72%/21.75% = 53.87% probability for surviving one more year from the success of 25,000 hours of survival. 

The conditional reliability shows better than even odds that we’ll survive one more year---given completion of 25,000 hours without failure.  The mean time to failure from the Weibull plot (use the calculator icon in WinSMITH Weibull) is
h*G(1+1/b) = 16,462 hours.
So, we’re clearly on the far side of the mean time to failure before we begin the next mission of one year and thus our odds for failure are increasing.

Run or replace?-Should we take the unit out of service and replace the existing seal given we’re using a really old component from a wear out distribution (the Weibull plot tells us we have a wear out failure mode because b>1), and we know the planned replacement cost equals the unplanned cost of $5000 because we have a spare device to keep the overall system in operation? 

The answer is no.  We have no financial motivation to remove the seal and throw away the unused life.  To make a timed replacement carries these restrictions: 1) you must have a wear out failure mode, 2) the planned replacement cost must be much less than the unplanned replacement cost.  We meet condition 1 but not condition 2.  Therefore run the system to failure.

Repair on overtime?-The next question to usually occur is should we repair the failed unit on overtime for an additional repair cost of $2000 which makes the total costs equal to $7000?  The answer is usually no but it depends upon: 1) how long it takes to get the repaired unit operational, 2) the consequences for having both units in the failed state.

Suppose the spare unit has completed 10,000 hours of service and we can get the failed unit repaired in one week (168 hours of elapsed time).  The probability of the spare system failing during the interval can also be obtained from the conditional reliability calculation methodology.  R(10,000)= 58.27% and R(10,168) = 57.67% or the conditional reliability = 57.67/58.27 = 98.97% which gives the probability of failure = 100-98.97%= 1.03% probability of failure of the spare unit during the repair interval.

Assume the consequence of both units failing simultaneously is $25000.  The money at risk is (probability of failure)*($ consequence) = 1.03%*$25000 = $257.42.  Would you spend $2500 of overtime costs to save $257.42?  Not with my money, but if you’re will to donate $2500 for the unneeded overtime repair, then I’ll accept your donation.

The bottom line:

Continue to run the system with 25000 hours for the next year.  Make the replacement when the unit fails or failure is imminent based on predictive information.  Use normal repair times rather than make the repairs on overtime. 

Will we have better data in the future after we get more failure data?  Of course, if you can live long enough.


Refer to the caveats on the Problem Of The Month Page about the limitations of the following solution. Maybe you have a better idea on how to solve the problem. Maybe you find where I've screwed-up the solution and you can point out my errors as you check my calculations. E-mail your comments, criticism, and corrections to: Paul Barringer by     clicking here.  Return to top of page.

* Thanks to Tom Tuttle of DuPont who identified that I cannot add correctly!  25000 hrs + 8760 hrs = 33760 hrs rather than the earlier incorrect addition that resulted in 32860 hrs which produced the wrong conditional reliability of 57.47%.

Last revised 08/20/2003
© Barringer & Associates, Inc. 2003

Return to Barringer & Associates, Inc. homepage