Problem Of The
Month
April 2003---Will
it last until the next turnaround?
You must know when to hold the
course and when to abandon the course.
This problem is about conditional reliability calculations.
Download this problem of the month as a PDF file
(300KB) revised
The Problem:
We have a new system in
operation. We’ve experienced two
failures at 4717 hours and 17221 hours along with the current system which has
completed 25000 hours and is still running (this data is a suspension or often
referred to as censored information).
Will it survive for one more year (8760 hours) until the next plant
turnaround? What action should we take?
The system is redundant. If it fails, the backup units will pick up
the load. Thus the failure cost is
simply the repair cost as the planned repair cost is $5000 and the unplanned
cost is $5000. The replacement strategy
is simply run to failure (this does not mean run the system to failure and then
run it into the ground so that final repair costs are higher because of abusive
actions due to not removing the system from its failed condition when costs for
the replacement are small).
Background Information:
Kececioglu in his Reliability Engineering
Handbook, Volume 1 (see the reading list at
http://www.barringer1.com/read.htm), page 108, says the five most important
functions in reliability engineering are shown in Figure 1. The equations are written in Weibull format
where h is the characteristic life and b is the shape factor or line slope. Keceicioglu argues that most reliability
engineering problems can be solved with these five functions.
The area under the PDF curve,
by definition, equals 1. It shows the shape
of the failure distribution curve as would occur if you made a tally sheet of
failures occurring at a specific time.
It gives the density of the probability for failure. The area under the curve will directly give
the probability for failure as you go from t1 to t2.
The hazard curve, known as the
instantaneous failure rate, describes the failure rate function which
gives the relationship between age and the failure frequency as occurs with the
composite curve known as the bathtub curve for failure rates. The Weibull failure rate curve is constant
only for the condition where b=1. The instantaneous failure rate curve declines
with time when b<1 and increases with
time when b>1.
The reliability function
gives the relationship between the age of a unit and the probability for
survival given the unit starts the mission with zero time. It is a measure for the chances for success
over the mission period.
The conditional reliability
function gives the probability for successfully completing an incremental
mission time given the unit already has successfully survived for a given time
interval prior to commencing the incremental mission. Conditional reliability gives the probability
for completing the new incremental mission given the successful completion of
previous cumulative mission time. The
numerator in the conditional reliability calculation will always be less than
the denominator (because the total time in the numerator is always larger than
in the denominator).
The mean life function for
the Weibull distribution is a fixed value when b=1 and thus h = mean life. When b<1,
the gamma function G(1+1/b) is greater than 1, when b>1, the gamma function is less than
1. In Excel, the gamma function is
“=exp(gammaln(1+1/b))”. The MTBF for repairable units or the MTTF for
non-repairable units gives a measure of the average time of operation to a
failure.
The
Calculations:
Make the data talk-We don’t
have much data to use in the decision making process and this is a very typical
engineering problem. So prepare a
Weibull plot using the three pieces of data (two failures and one suspension)
to get the WinSMITH Weibull
plot in Figure 2. Figure 2 will help you
say something rather than giving a shrug of the shoulders and looking
sheepish. {Please note that if you use
the data in the WinSMITH Weibull demo program you will not get the same numbers
because the demo program will slightly randomize your input data, however, you
will be able to follow the calculations—to get the exact answers, download the
authentic file which will allow you to replicate the results.}.
Notice the X-axis is in
ages-to-failure and the Y-axis is given in probability for failure in Figure 2
which is a statement of unreliability.
Convert the Y-axis to reliability by taking the complement of the
unreliability. We need reliability
values to determine the conditional reliability. We can take the complement or in this case we
will use the predict feature of the software to give us the results
directly.
What is the probability for survival?-
We need the reliability at t=25,000
hours and at t=25,000+8,760 = 33,760* hours.
= 21.75% and
=11.72%
The conditional reliability =
R(33760)/R(25000)= 11.72%/21.75% = 53.87%
probability for surviving one more year from the success of 25,000 hours of
survival.
The conditional reliability shows
better than even odds that we’ll survive one more year---given completion of
25,000 hours without failure. The mean
time to failure from the Weibull plot (use the calculator icon in WinSMITH
Weibull) is
MTTF = h*G(1+1/b) = 16,462 hours.
So, we’re clearly on the far side of the mean time to failure before we begin
the next mission of one year and thus our odds for failure are increasing.
Run or replace?-Should we
take the unit out of service and replace the existing seal given we’re using a
really old component from a wear out distribution (the Weibull plot tells us we
have a wear out failure mode because b>1),
and we know the planned replacement cost equals the unplanned cost of $5000
because we have a spare device to keep the overall system in operation?
The answer is no. We have no financial motivation to remove the
seal and throw away the unused life. To
make a timed replacement carries these restrictions: 1) you must have a wear
out failure mode, 2) the planned replacement cost must be much less than the unplanned
replacement cost. We meet condition 1
but not condition 2. Therefore run the
system to failure.
Repair on overtime?-The next
question to usually occur is should we repair the failed unit on overtime for
an additional repair cost of $2000 which makes the total costs equal to
$7000? The answer is usually no but it
depends upon: 1) how long it takes to get the repaired unit operational, 2) the
consequences for having both units in the failed state.
Suppose the spare unit has completed
10,000 hours of service and we can get the failed unit repaired in one week
(168 hours of elapsed time). The
probability of the spare system failing during the interval can also be
obtained from the conditional reliability calculation methodology. R(10,000)= 58.27% and R(10,168) = 57.67% or
the conditional reliability = 57.67/58.27 = 98.97% which gives the probability
of failure = 100-98.97%= 1.03%
probability of failure of the spare unit during the repair interval.
Assume the consequence of both units
failing simultaneously is $25000. The
money at risk is (probability of failure)*($ consequence) = 1.03%*$25000 =
$257.42. Would you spend $2500 of
overtime costs to save $257.42? Not with
my money, but if you’re will to donate $2500 for the unneeded overtime repair,
then I’ll accept your donation.
The bottom line:
Continue to run the system with
25000 hours for the next year. Make the
replacement when the unit fails or failure is imminent based on predictive
information. Use normal repair times
rather than make the repairs on overtime.
Will we have better data in the
future after we get more failure data?
Of course, if you can live long enough.
Comments:
Refer to the caveats on the Problem
Of The Month Page about the limitations of the following solution.
Maybe you have a better idea on how to solve the problem. Maybe you find where
I've screwed-up the solution and you can point out my errors as you check my
calculations. E-mail your comments, criticism, and corrections to: Paul
Barringer by
clicking
here. Return
to top of page.
* Thanks to Tom Tuttle of DuPont who identified that I cannot add correctly! 25000 hrs + 8760 hrs = 33760 hrs rather than the earlier incorrect addition that resulted in 32860 hrs which produced the wrong conditional reliability of 57.47%.