Weibull
Analysis of Pump Seal Life |
Two pumps operate functionally in parallel
(most pumps will give distress signals prior to their failure so that the
failed pump can be taken off line with replacement of the unfailed pump that
restores the system to service to prevent system failure—assuming perfect
switching). Pump A is considered the
primary device until it fails. Pump B is
the secondary device that has been standing and waiting until pump A fails, and it is then brought in to service and runs until
it fails. The pumps operate in a 1 out
of 2 configuration. Failures of these pumps are paced by seal failures, and
when the pump is down for seal replacements, other maintenance activities are
performed such as replacing bearings, housings, etc.
Pump A (Runs) 14400 19800 21300 23600 24300 28100 29600 29600 34200 38000 262900 |
Pump B (Stands/waits) 2200 3800 4600 5000 6700 7500 7800 8000 11000 12900
69500 |
Here the seal ages-to-failure are recorded
and used for the Weibull analysis (these are in-service hours—see http://www.barringer1.com/jul07prb.htm
for reliability data demands). The
age-to-failure data are shown below in rank order:
Data
was acquired over a very long time period—almost 38 years and shows a system MTBF
= 332400hours/20 failures = 16620 hours/failure or 1.9 years/failure along with
MTBF_{A} = 26,290 hrs/failure and MTBF_{B} = 69500
hrs/failure.
Note
the data to the right has been rounded to hundreds of hours—this is a minor problem
and not a heart attack. Seldom do you
have a complete set of data and seldom does it span such a long period of
time. Often the data includes
suspensions (censored data) and the quantity of data is smaller—so, in this
case we’re data rich.
All
data was erroneously combined (pooled data) into a single column of 20 data
points to make the Weibull plot in Figure 1.
This
plot show a good curve fit with the PVE (p-value estimate which is a goodness
of fit criteria) as 64.87%. You need a
minimum p-value estimate of 10% for a good curve fit. The characteristic life is 18259 hours (with
a mean life of 16673 hours based on beta and eta) and the beta value (a shape
factor) suggests a wear-out failure mode with a beta greater than 1.
Just because you get
a good Weibull curve fit DOES NOT
mean you have a valid Weibull plot and Figure 1 has fatal flaws. The flaws are shown in Figure 2 where the
flaws of the pooled data stand out. In
short, Figure 1 and Figure 2 produce junk information—or if you’re interested,
this tells you how the system is responding!
Pump A (Runs) 14400_A 19800_A 21300_A 23600_A 24300_A 28100_A 29600_A 34200_A 38000_A |
Pump B (Stands/Waits) 3800_B 4600_B 5000_B 6700_B 7500_B 7800_B 8000_B 11000_B 12900_B |
In Figure 2 the data
has labels assigned to each data point to show the data is stratified, and it
is not homogeneous! If the data were
homogenous, we should have expected the data to be arranged randomly up and
down the trend line with A and B points scattered randomly.
How did we label the
data in Figure 2? The input is shown to
the left and were all put into a single column in WinSMITH Weibull, and under the
magnifying glass icon the Point Symbol Type was toggled to “Point Label” to
active the symbols on the plot.
Note
the stratified data in Figure 2 tells us the data should NOT have been pooled as the B-data fails at a young age while
A-data fails at a much older age—of course, that is also obvious when you look
at the two columns of data.
The seal life data is correctly
plotted in Figure 3. Each seal’s results
must go into a different column to get two different trend lines.
Note, it is roughly h_{A}/h_{B}
=28995/7928 ≈ 4 times more severe service to stand and wait rather than
it is to run.
In Figure 3, the
Y-axis shows the cumulative distribution function, which is a statement of unreliability. The X-axis shows the age-to-failure. In short, Figure 3 tells you what percent of
each population is expected to be dead by a given number of running hours.
Take the data
from the separate lines as shown in Figure 3 and run a formal test of
significance to find if the trend lines are significantly different as inferred
with the segregated data from Figure 2.
Use the likelihood ratio test (See The New Weibull Handbook, 5^{th}
edition) and the likelihood ratio test in WinSMITH Weibull to aid the
decision.
Figure 4 shows
the likelihood ratio test results. Lack
of overlap of the contours shows significant differences at 90% confidence.
You cannot pool datasets with
significant differences! Inside the
contours is a triangle. The triangle
symbol inside the contour lines of Figure 4 represents the top of the likelihood
mountain, and this is the reported beta/eta for each dataset.
Often people want to
see the probability density functions (PDF) of the curves in Figure 3; they are
shown in Figure 5.
Figure 5 shows tally
sheet contours of the number of failures expected to occur at any time. The area under a PDF curve is 1. The Y-axis shows the relative occurrences of
failures and the X-axis shows ages-to-failure.
The long life seal has a rather symmetrical curve but the short life
curve has a long tail to the right.
So we have the
Weibull analysis details, what are we going to do with it?
·
We can quantify average times to failure for the existing system
(which will be different than determined by arithmetic),
·
We can determine how many repairs we will make in a 5-year
interval (43800 hours) as the turnaround period with the existing system (which
will be different than determined by arithmetic),
·
We can determine the strategy for when to switch pumps into/out of
service
·
We can build system models for determining the risk for system
failure and reliability of the system given different conditions and how many
replacements we will need to make in a 5 year turnaround interval.
·
We can make a financial decision regarding making repairs on over
time or at regular time. For example, if
working repairs on overtime achieves restoration of service in 40 hours (total
repair costs = $10,000) or not working overtime and achieving restoration of
service in 730 hours (total repair costs = $5,000)—which is one month, given
the outage cost of the system in $10,000 per hour of downtime. Which course of action should we follow?
Weibull
average times to failure for the existing system (which will be slightly
different that determined by arithmetic).
Pump A’s Mean Time = h*G(1+1/b) = 28665*0.90568 = 25961 hours/failure
Pump B’s Mean Time = h*G(1+1/b) =
7928*0.88562 = 7021 hours/failure
How many repairs to make in a 5-year turnaround interval (43800
hours) with the existing system.
Arithmetic average number of repairs =(25961
hours/failure +7021 hours/failure+10818 hours/partial failure) à
1+1+0.416 = 2.416 failures using the two Weibull curves in a deterministic
fashion
What
strategy should we follow for when to switch pumps into/out of service
The route to longer life (and thus fewer maintenance interventions)
on this system is to rotate the pumps into and out of service on a regular
basis to prevent deterioration in this ammonia system because of standing and
waiting. You rotate equipment into and
out of service primarily for maintaining competences in the workers and secondarily
for the equipment (Think about this: Why
do you do frequent fire drills?)
Successful rotation of equipment into/out of service requires a written procedure
to be maintained and followed for success.
First quartile companies have
written procedures for rotating equipment into/out of service without failure
based on a disciplined approach where employees are carefully drilled in
effective operation of equipment. They
can tolerate longer periods of time between swapping equipment—say every 3 to 4
months.
Second quartile companies need a
little more drill and perhaps they will swap equipment every 2 to 3
months.
Third quartile companies may need
much more frequent drill for refreshers about procedures and processes, so they
may have a cycle of every 1 to 2 months.
Fourth quartile companies seldom
have written procedures. If they have
the written procedures, they often can’t find them. They are lax about carefully following the
written details. Thus, they seldom have
functional standby equipment, which means they often run to system failure with
the higher costs.
Often people are concerned
with rotating equipment into/out of service because if you have two old pieces
of equipment both may die at the same time.
Remember equipment dies in a probabilistic manner, and not in a
deterministic manner. If you truly believe
equipment dies in a deterministic manner, then tell me precisely how much life
remains in each of your pieces of equipment.
Of course you can’t tell me that as only the Old One knows the answer!
We
can build system models for determining the risk for system failure and
reliability of the system given different conditions.
Consider the use of RAPTOR
reliability block diagram modeling systems which provide no-cost
modeling software for small systems.
The RAPTOR model is shown below in Figure 6 and will approach the
details in a probabilistic manner (like real life) rather than a deterministic
manner (like idealistic life). You can
download the models for Figure 6 and Figure 13 by clicking here
for RAPTOR version 7.0.
If you double click on
each block in the RAPTOR model it will open for more details obtained from the
life curves from Figure 3 where
Pump A: life data = Weibull, shape factor (beta) = 3.943, scale factor (eta) =
28995 hours, location (t_{0})* = 0 ; repair data = Lognormal**, mean =
730 hours (MuAL)**, standard deviation = 2.0 (SigF)** in Table 1 below.
See Figure 7.
Pump B: life data = Weibull, shape factor (beta) = 2.145, scale factor (eta) =
7928 hours, location (t_{0})* = 0; repair data = Lognormal, mean = 730
days (MuAl)**, standard deviation
= 2.0 (SigF)**.
See Figure 8.
Set up the standby node so block A starts first (while B is idle), as shown in
Figure 9.
Under the Run icon, choose a simulation mission time = 5 years*8760 hours/year
= 43800 hours. Repeat the simulation
1000 times (without graphics, this requires ~3 minutes to complete). If you use graphics, you can see live
equipment (green), failed equipment (red), and standby equipment (blue)—the simulation requires ~7 minutes with
graphics.
You will find the system availability is very high 99.9979% and the reliability
is also very high, at 0.996 (this means you have a 0.004 chance for failure on
a 5-year mission as shown in Figure 10).
Figure
11 shows how the time is used. About
96.7% of the time dual equipment is available, while ~1020 hours out of the
43800 hour mission you are operating on solo equipment. On the average you loose 0.883 hours each
5-year mission based on 1000 iterations.
Figure 12 shows to expect on the average ~2 maintenance interventions to occur
during the 5-year mission, although in some simulations up to 4 maintenance
interventions were required, while in other cases no interventions
occurred.
If by rotating equipment into and out of service at very little extra cost, to
prevent deterioration of idle devices, we can change the model as shown in
Figure 13.
The results of A = B shows, in Figure 14, substantial improvement compared to
Figure 10 simply by avoiding deterioration by standing and waiting.
The details of how time is spent are shown in Figure 15, which can be
contrasted to Figure 11 where A ≠ B.
While the system does not incur failures, individual pieces of equipment do
require maintenance interventions, as shown in Figure 16 where the assumption
is that the failure data on A and B are the same. Notice the reduction in maintenance
interventions in Figure 12 where A ≠B (compared to Figure 16).
We can make a financial decision:
repairs on overtime or at regular time.
Figures 13, 14, 15, and 16 show excellent results when repair times are, on
the average, 730 hours and when pumps are maintained in superior conditions,
thus, no motivation exists for repairing them on overtime.
Figures 10 and 11 for the good/bad pump condition may look different using the
Monte Carlo simulation risks and working repairs on overtime to achieve
restoration of service in 40 hours (total repair costs = $10,000) or not
working overtime and achieving restoration of service in 730 hours (total
repair costs = $5,000) given the outage cost of the system in $10,000 per hour
of downtime.
From Figure 11, the system is projected to be down 0.883 hours in 5-years which
calculates to a loss of 0.883hr*$10,000/hr = $8830 for a 5 year period. Figure 12 says to expect 1.988 maintenance
events in 5 years with and extra cost of $5,000 per incident which calculates
to 1.988 incidents*$5,000 = $9990.
Put another way: Would you spend $9990 to save at most
$8830 in a 5-year period? The answer is
NO overtime repairs unless the system is down!
Once you have the
statistics, the answers are rather obvious.
Without the statistics, we have many arguments and oftentimes we take
the wrong actions that cost us money.
_________
*Please note:
Use of the 3^{rd}
Weibull parameter, t_{0}, is called the location parameter. Use of this parameter has four strict
requirements for when it can be used (See The New Weibull Handbook, 5^{th}
edition) and all 4 restrictions must be met:
1.
You must have a physical
reason for use of a location offset.
(Simply making a better curve fit to the data is not one of the
reasons!!)
2.
You must see curvature
in the raw age-to-failure data on a Weibull plot
3.
You must have more than 21
failure data—maybe more than 100 for subtle offsets
4.
You must get a better
goodness of fit statistic after use of the t_{0}
** Roughly 85% to 95%
of life data is adequately represented by a Weibull distribution. Similarly 85 to 95% of all repair data is
adequately represented by a Lognormal
distribution. In a Lognormal
distribution the mean value (MuAL) is represented at
50% probability, and the slope of the trend line (SigF)
is determined by a shape factor, which is a measure of how consistently the job
can be repaired.
If the job is
repaired and is always finished in the same amount of time the shape factor
would be a perfect 1.0. A
well-controlled repair time would demonstrate a shape factor = 2. A less orderly repair time would demonstrate
a shape factor of 3. A highly variable
repair time would show a shape factor of 4—or if the organization demonstrates
a Keystone Cops disorder then it’s greater than 4!
Here are some typical
amounts of scatter in repair times shown in Table 1.
Table 1: Lognormal Repair Data Has Long Tails To The
Right |
||||
If The Repair Time (MuAL) Is 8 Hours What’s The Repair Time Scatter? |
||||
Sig
F |
50% Completed Within
(hours) |
80% Completed Within (hours) |
90% Completed Within (hours) |
98% Completed Within (hours) |
1.0 |
8.0
to 8.0 |
8.0
to 8.0 |
8.0
to 8.0 |
8.0
to 8.0 |
1.5 |
6.1
to 10.5 |
4.8
to 13.5 |
4.1
to 15.6 |
3.1
to 20.5 |
2 |
5.0
to 12.8 |
3.3
to 19.4 |
2.6
to 25.0 |
1.6
to 40.1 |
3 |
3.8
to 16.8 |
2.0
to 32.7 |
1.3
to 48.8 |
0.6
to 103 |
4 |
3.1
to 20.4 |
1.4
to 47.3 |
0.8
to 78.3 |
0.3
to 201 |
If The Repair Time (MuAL) Is 730 Hours and SigF is
2 For The Model |
||||
2 |
476
to 1445 |
467
to 2053 |
378
to 2533 |
255
to 3753 |
Comments:
Refer to the caveats on the Problem
of the Month Page about the limitations of the following solution.
Maybe you have a better idea on how to solve the problem. Maybe you find where
I've screwed up the solution and you can point out my errors as you check my
calculations. E-mail your comments, criticism, and corrections to Paul
Barringer by clicking here. Return to top of page.
You can download a
copy of this
page as a PDF file.
Return to Barringer & Associates, Inc. homepage