Weibull Analysis of Pump Seal Life

 

Two pumps operate functionally in parallel (most pumps will give distress signals prior to their failure so that the failed pump can be taken off line with replacement of the unfailed pump that restores the system to service to prevent system failure—assuming perfect switching).  Pump A is considered the primary device until it fails.  Pump B is the secondary device that has been standing and waiting until pump A fails, and it is then brought in to service and runs until it fails.  The pumps operate in a 1 out of 2 configuration. Failures of these pumps are paced by seal failures, and when the pump is down for seal replacements, other maintenance activities are performed such as replacing bearings, housings, etc.

 

  Pump A

   (Runs)

   14400

   19800

   21300

   23600

   24300

   28100

   29600

   29600

   34200

   38000

 262900

     Pump B

(Stands/waits)

         2200

         3800

         4600

         5000

         6700

         7500

         7800

         8000

       11000

       12900

       69500 

Here the seal ages-to-failure are recorded and used for the Weibull analysis (these are in-service hours—see http://www.barringer1.com/jul07prb.htm for reliability data demands).  The age-to-failure data are shown below in rank order:

 

Data was acquired over a very long time period—almost 38 years and shows a system MTBF = 332400hours/20 failures = 16620 hours/failure or 1.9 years/failure along with MTBFA = 26,290 hrs/failure and MTBFB = 69500 hrs/failure. 

 

Note the data to the right has been rounded to hundreds of hours—this is a minor problem and not a heart attack.  Seldom do you have a complete set of data and seldom does it span such a long period of time.  Often the data includes suspensions (censored data) and the quantity of data is smaller—so, in this case we’re data rich.

 

All data was erroneously combined (pooled data) into a single column of 20 data points to make the Weibull plot in Figure 1.

 

This plot show a good curve fit with the PVE (p-value estimate which is a goodness of fit criteria) as 64.87%.  You need a minimum p-value estimate of 10% for a good curve fit.  The characteristic life is 18259 hours (with a mean life of 16673 hours based on beta and eta) and the beta value (a shape factor) suggests a wear-out failure mode with a beta greater than 1. 

 

Just because you get a good Weibull curve fit DOES NOT mean you have a valid Weibull plot and Figure 1 has fatal flaws.  The flaws are shown in Figure 2 where the flaws of the pooled data stand out.  In short, Figure 1 and Figure 2 produce junk information—or if you’re interested, this tells you how the system is responding!

 

 

  Pump A

   (Runs)     

 14400_A

 19800_A

 21300_A

 23600_A

 24300_A

 28100_A

 29600_A 
 29600_A

 34200_A

 38000_A

    Pump B

 (Stands/Waits)
    2200_B

    3800_B

    4600_B

    5000_B

    6700_B

    7500_B

    7800_B

    8000_B

  11000_B

  12900_B

In Figure 2 the data has labels assigned to each data point to show the data is stratified, and it is not homogeneous!  If the data were homogenous, we should have expected the data to be arranged randomly up and down the trend line with A and B points scattered randomly. 

 

How did we label the data in Figure 2?  The input is shown to the left and were all put into a single column in WinSMITH Weibull, and under the magnifying glass icon the Point Symbol Type was toggled to “Point Label” to active the symbols on the plot.

 

Note the stratified data in Figure 2 tells us the data should NOT have been pooled as the B-data fails at a young age while A-data fails at a much older age—of course, that is also obvious when you look at the two columns of data.

 

 

 

 

The seal life data is correctly plotted in Figure 3.  Each seal’s results must go into a different column to get two different trend lines.

 

 

Note, it is roughly hA/hB =28995/7928 ≈ 4 times more severe service to stand and wait rather than it is to run. 

 

In Figure 3, the Y-axis shows the cumulative distribution function, which is a statement of unreliability.  The X-axis shows the age-to-failure.  In short, Figure 3 tells you what percent of each population is expected to be dead by a given number of running hours.

 

Take the data from the separate lines as shown in Figure 3 and run a formal test of significance to find if the trend lines are significantly different as inferred with the segregated data from Figure 2.  Use the likelihood ratio test (See The New Weibull Handbook, 5th edition) and the likelihood ratio test in WinSMITH Weibull to aid the decision. 

 

Figure 4 shows the likelihood ratio test results.  Lack of overlap of the contours shows significant differences at 90% confidence.

 

 

You cannot pool datasets with significant differences!  Inside the contours is a triangle.  The triangle symbol inside the contour lines of Figure 4 represents the top of the likelihood mountain, and this is the reported beta/eta for each dataset.

 

Often people want to see the probability density functions (PDF) of the curves in Figure 3; they are shown in Figure 5.

 

 

Figure 5 shows tally sheet contours of the number of failures expected to occur at any time.  The area under a PDF curve is 1.  The Y-axis shows the relative occurrences of failures and the X-axis shows ages-to-failure.  The long life seal has a rather symmetrical curve but the short life curve has a long tail to the right.

 

So we have the Weibull analysis details, what are we going to do with it?

·       We can quantify average times to failure for the existing system (which will be different than determined by arithmetic),

·       We can determine how many repairs we will make in a 5-year interval (43800 hours) as the turnaround period with the existing system (which will be different than determined by arithmetic),

·       We can determine the strategy for when to switch pumps into/out of service

·       We can build system models for determining the risk for system failure and reliability of the system given different conditions and how many replacements we will need to make in a 5 year turnaround interval.

·       We can make a financial decision regarding making repairs on over time or at regular time.  For example, if working repairs on overtime achieves restoration of service in 40 hours (total repair costs = $10,000) or not working overtime and achieving restoration of service in 730 hours (total repair costs = $5,000)—which is one month, given the outage cost of the system in $10,000 per hour of downtime.  Which course of action should we follow?

Weibull average times to failure for the existing system (which will be slightly different that determined by arithmetic).
Pump A’s Mean Time =
h*G(1+1/b) = 28665*0.90568 = 25961 hours/failure
Pump B’s Mean Time =
h*G(1+1/b) = 7928*0.88562 = 7021 hours/failure

How many repairs to make in a 5-year turnaround interval (43800 hours) with the existing system.
Arithmetic average number of repairs =(25961 hours/failure +7021 hours/failure+10818 hours/partial failure)
à 1+1+0.416 = 2.416 failures using the two Weibull curves in a deterministic fashion

What strategy should we follow for when to switch pumps into/out of service
The route to longer life (and thus fewer maintenance interventions) on this system is to rotate the pumps into and out of service on a regular basis to prevent deterioration in this ammonia system because of standing and waiting.  You rotate equipment into and out of service primarily for maintaining competences in the workers and secondarily for the equipment (Think about this:  Why do you do frequent fire drills?)  Successful rotation of equipment into/out of service requires a written procedure to be maintained and followed for success. 


First quartile companies have written procedures for rotating equipment into/out of service without failure based on a disciplined approach where employees are carefully drilled in effective operation of equipment.  They can tolerate longer periods of time between swapping equipment—say every 3 to 4 months.


Second quartile companies need a little more drill and perhaps they will swap equipment every 2 to 3 months. 


Third quartile companies may need much more frequent drill for refreshers about procedures and processes, so they may have a cycle of every 1 to 2 months.

 
Fourth quartile companies seldom have written procedures.  If they have the written procedures, they often can’t find them.  They are lax about carefully following the written details.  Thus, they seldom have functional standby equipment, which means they often run to system failure with the higher costs.

 

      Often people are concerned with rotating equipment into/out of service because if you have two old pieces of equipment both may die at the same time.  Remember equipment dies in a probabilistic manner, and not in a deterministic manner.  If you truly believe equipment dies in a deterministic manner, then tell me precisely how much life remains in each of your pieces of equipment.  Of course you can’t tell me that as only the Old One knows the answer!

We can build system models for determining the risk for system failure and reliability of the system given different conditions.
Consider the use of RAPTOR reliability block diagram modeling systems which provide no-cost modeling software for small systems.  The RAPTOR model is shown below in Figure 6 and will approach the details in a probabilistic manner (like real life) rather than a deterministic manner (like idealistic life).  You can download the models for Figure 6 and Figure 13 by clicking here for RAPTOR version 7.0.

 

 

     If you double click on each block in the RAPTOR model it will open for more details obtained from the life curves from Figure 3 where
Pump A: life data = Weibull, shape factor (beta) = 3.943, scale factor (eta) = 28995 hours, location (t0)* = 0 ; repair data = Lognormal**, mean = 730 hours (MuAL)**, standard deviation = 2.0 (SigF)** in Table 1 below.  See Figure 7.



Pump B: life data = Weibull, shape factor (beta) = 2.145, scale factor (eta) = 7928 hours, location (t0)* = 0; repair data = Lognormal, mean = 730 days (MuAl)**, standard deviation = 2.0 (SigF)**.  See Figure 8.



Set up the standby node so block A starts first (while B is idle), as shown in Figure 9.




Under the Run icon, choose a simulation mission time = 5 years*8760 hours/year = 43800 hours.  Repeat the simulation 1000 times (without graphics, this requires ~3 minutes to complete).  If you use graphics, you can see live equipment (green), failed equipment (red), and standby equipment (blue)
the simulation requires ~7 minutes with graphics.


You will find the system availability is very high 99.9979% and the reliability is also very high, at 0.996 (this means you have a 0.004 chance for failure on a 5-year mission as shown in Figure 10).




Figure 11 shows how the time is used.  About 96.7% of the time dual equipment is available, while ~1020 hours out of the 43800 hour mission you are operating on solo equipment.  On the average you loose 0.883 hours each 5-year mission based on 1000 iterations.


Figure 12 shows to expect on the average ~2 maintenance interventions to occur during the 5-year mission, although in some simulations up to 4 maintenance interventions were required, while  in other cases no interventions occurred.



If by rotating equipment into and out of service at very little extra cost, to prevent deterioration of idle devices, we can change the model as shown in Figure 13.



The results of A = B shows, in Figure 14, substantial improvement compared to Figure 10 simply by avoiding deterioration by standing and waiting.



The details of how time is spent are shown in Figure 15, which can be contrasted to Figure 11 where A ≠ B.



While the system does not incur failures, individual pieces of equipment do require maintenance interventions, as shown in Figure 16 where the assumption is that the failure data on A and B are the same.  Notice the reduction in maintenance interventions in Figure 12 where A ≠B (compared to Figure 16).



We can make a financial decision: repairs on overtime or at regular time.
Figures 13, 14, 15, and 16 show excellent results when repair times are, on the average, 730 hours and when pumps are maintained in superior conditions, thus, no motivation exists for repairing them on overtime.

Figures 10 and 11 for the good/bad pump condition may look different using the Monte Carlo simulation risks and working repairs on overtime to achieve restoration of service in 40 hours (total repair costs = $10,000) or not working overtime and achieving restoration of service in 730 hours (total repair costs = $5,000) given the outage cost of the system in $10,000 per hour of downtime.

From Figure 11, the system is projected to be down 0.883 hours in 5-years which calculates to a loss of 0.883hr*$10,000/hr = $8830 for a 5 year period.  Figure 12 says to expect 1.988 maintenance events in 5 years with and extra cost of $5,000 per incident which calculates to 1.988 incidents*$5,000 = $9990.

 

      Put another way:  Would you spend $9990 to save at most $8830 in a 5-year period?  The answer is NO overtime repairs unless the system is down!

 

Once you have the statistics, the answers are rather obvious.  Without the statistics, we have many arguments and oftentimes we take the wrong actions that cost us money.

_________

*Please note: 

Use of the 3rd Weibull parameter, t0, is called the location parameter.  Use of this parameter has four strict requirements for when it can be used (See The New Weibull Handbook, 5th edition) and all 4 restrictions must be met:

1.     You must have a physical reason for use of a location offset.  (Simply making a better curve fit to the data is not one of the reasons!!)

2.     You must see curvature in the raw age-to-failure data on a Weibull plot

3.     You must have more than 21 failure data—maybe more than 100 for subtle offsets

4.     You must get a better goodness of fit statistic after use of the t0

 

** Roughly 85% to 95% of life data is adequately represented by a Weibull distribution.  Similarly 85 to 95% of all repair data is adequately represented by a Lognormal distribution.  In a Lognormal distribution the mean value (MuAL) is represented at 50% probability, and the slope of the trend line (SigF) is determined by a shape factor, which is a measure of how consistently the job can be repaired. 

 

If the job is repaired and is always finished in the same amount of time the shape factor would be a perfect 1.0.  A well-controlled repair time would demonstrate a shape factor = 2.  A less orderly repair time would demonstrate a shape factor of 3.  A highly variable repair time would show a shape factor of 4—or if the organization demonstrates a Keystone Cops disorder then it’s greater than 4! 

 

Here are some typical amounts of scatter in repair times shown in Table 1.

 

Table 1:  Lognormal Repair Data Has Long Tails To The Right

If The Repair Time (MuAL) Is 8 Hours What’s The Repair Time Scatter?

Sig F

50% Completed  Within (hours)

80% Completed Within (hours)

90% Completed Within (hours)

98% Completed Within (hours)

1.0

8.0 to 8.0

8.0 to 8.0

8.0 to 8.0

8.0 to 8.0

1.5

6.1 to 10.5

4.8 to 13.5

4.1 to 15.6

3.1 to 20.5

2

5.0 to 12.8

3.3 to 19.4

2.6 to 25.0

1.6 to 40.1

3

3.8 to 16.8

2.0 to 32.7

1.3 to 48.8

0.6 to 103

4

3.1 to 20.4

1.4 to 47.3

0.8 to 78.3

0.3 to 201

If The Repair Time (MuAL) Is 730 Hours and SigF is 2 For The Model

2

476 to 1445

467 to 2053

378 to 2533

255 to 3753

Comments:
Refer to the caveats on the Problem of the Month Page about the limitations of the following solution. Maybe you have a better idea on how to solve the problem. Maybe you find where I've screwed up the solution and you can point out my errors as you check my calculations. E-mail your comments, criticism, and corrections to Paul Barringer by clicking here.   Return to top of page.

You can download a copy of this page as a PDF file.

Return to Barringer & Associates, Inc. homepage

Last revised July 23, 2013
© Barringer & Associates, Inc. 2007