Reliability and Data


Dictionary definitions for the common man:

From Microsoft Encarta Reference Library 2005®, ©1993-1993-2004 Microsoft Corporation:

 

  re·li·a·ble 

adjective

1.  dependable:  able to be trusted to do what is expected or has been promised
● She is extremely reliable and a hard worker.

2.  likely to be accurate:  able to be trusted to be accurate or correct or to provide a correct result
● I don’t think that clock is very reliable.


- re·li·a·bil·i·ty, noun
- re·li·a·ble·ness, noun
- re·li·a·bly, adverb

  da·ta

noun (takes either a singular or plural verb)

1.  factual information: information, often in the form of facts or figures obtained from experiments or surveys, used as a basis for making calculations or drawing conclusions.

2.  COMPUTING  information for computer processing: information, for example, numbers, text, images, and sounds, in a form that is suitable for storage in or processing by a computer.

_________________________________________________________________

Working definitions for reliability professionals:

  reliability

an engineering definition for reliability

1.     The duration or probability of failure-free performance under stated conditions.

2.     The probability that an item can perform its intended function for a specified interval under stated conditions.  (For non-redundant items this is equivalent to definition 1.  For redundant items this is equivalent to definition of mission reliability.)  See MIL-HDBK-338, page 56.

a business definition for reliability

3.     Reliability is the probability that a device, system, or process will perform its prescribed duty without failure for a given time when operated correctly in a specified environment.  It costs money to achieve high reliability and it cost money when things become unreliable thus money is a big motivator for reliability.

 

  data 

an engineering definition for reliability data

1.     “Failure must be precisely defined in practice.  For dealings between producers and consumers, it is essential that the definition of a failure be agreed upon in advance to minimize disputes.  For many products, failure is catastrophic, and it is clear when failure occurs.  For some products, performance slowly degrades, and there is no clear end of life.  One can then define that a failure occurs when performance degrades below a specified value.  Of course, one can analyze data according to each of a number of definitions of failures.  One must decide whether time is calendar time or operating hours or some other measure of exposure, for example, the number of start-ups, miles traveled, energy output, cycles of operation, etc.  Also, one must decide whether to measure time of exposure starting at time of manufacture, time of installation, or whatever.  Engineers define failure and exposure.”

2.     “Most non-life data are complete; that is, the value of each sample unit is observed.  Such life data consist of the time to failure of each sample unit.  Much life data are incomplete.  That is, the exact failure times of some units are unknown, and there is only partial information on their failure times.”

3.     “Sometimes when life data are analyzed, some units are unfailed, and their failure times are known only to be beyond their present running times.  Such data are said to be censored on the right.  Unfailed units are called run-outs, survivors, removals, and suspended units.   Similarly, a failure time known only to be before a certain time is said to be censored on the left.  If all unfailed units have a common running time and all failure times are earlier, the data are said to be singly censored on the right.  Singly censored data arise when units are started on test together and the data are analyzed before all units fail.  Such data are singly time censored if the censoring time is fixed; then the number of failures in that fixed time is random.  Time censored data are also called Type I censored.  Data are singly failure censored if the test is stopped hen a specified number of failures occurs, the time to that fixed number of failures being random.  Time censoring is more common in practice; failure censoring is more common in the literature, as it is mathematically more tractable.”  For the direct quote of engineering definitions 1, 2, and 3 see; Nelson, Wayne; Applied Life Data Analysis, John Wiley & Sons, New York, NY, ISBN: 0-471-09458-7, 1982, pages 6–7.

4.     “Measured data like age-to-failure data is much more precise because there is more information in each data point.  Measured data provides much better precision, so smaller sample sizes are acceptable.  Ideally, each Weibull plot depicts a single failure mode.  Data requirements are described by D.R. Cox [1984] who said to determine failure time precisely, there are three requirements:
       1) a time origin must be unambiguously defined,
       2) a scale for measuring the passage of time must be agreed to, and finally,
       3) the meaning of failure must be entirely clear. 
The age of each part is required, both failed and unfailed.  The units of age depend on the part usage and the failure mode, for example, low cycle and high cycle fatigue may produce cracks leading to rupture.  The age units would be fatigue cycles.  The age of a starter may be the number of starts.  Burner and turbine parts may fail as a function of time at high temperature or as the number of cold to hot to cold cycles.  Usually, knowledge of the physics-of-failure will provide the age scale.  When there is uncertainty, several age scales are tried to determine the best fit.  This is not difficult with good software.  The “best” aging parameter data may not exist and substitutes are tried.  For example, the only data on air conditioner compressors may be the date shipped and the date returned.  The “best” data, operating time or cycles, is unobtainable, so based on the dates above, a calendar interval is used as a substitute.  These inferior data will increase the uncertainty, but the resulting Weibull plot may still be accurate enough to provide valuable analysis.  The data fit will tell us if the Weibull is good enough.”  For direct quote of engineering definition 4 see: Abernethy, Robert B.; The New Weibull Handbook, self-published by Dr. Robert B. Abernethy, 536 Oyster Road, North Palm Beach, Florida 33408-4328, 5th edition, ISBN-13: 978-0-9653062-3-2, 2006, pages1–4.

a business definition for reliability data

5.     Failure data is required for reliability analysis.  Failure data is similar to tombstone data, i.e., the age “date” at failure (age-to-failure) minus the birth “date” (time zero for the time origin).  Failure data requires elapsed service time/cycles/wait-time/etc., using the units that motivate the failure.  Failure data includes actual failures (failures require practical criteria for defining what is a failure) along with censored (suspended) data for items that are aging toward failure and have not yet failed.  Censored data also comes from the age of items that have been removed from service without failure such as occurs with timed replacement from a preventative maintenance action.  Censored data also includes items that have been removed from service without failure as a result of a good maintenance practice such as would occur when pump bearings are replaced for having reached a failure criteria such as excessive vibration or excessive heat; and while the pump is disassembled a new pump seal is replaced as a good maintenance practice—thus, the pump seal life is censored whereas the bearing/bearings may have reached a failure criteria.  So you must record:
      1)  The age-to-failure for the failed item in “time” units motivating the failures and identify the failure mode, and
      2)  The age-to-removal for unfailed items for censored or suspended items along with reasons for the removals.

Why two names–censored/suspended–for the same data?  Because of the litigious society in the United States (for example, it is claimed that we have more lawyers in the city of Houston than the entire nation of Japan!), and the large quantity of lawsuits where lawyers use the word censored to conjure up
        secrecy,
        hiding information, and
        making some information unavailable,
which generates mistrust in the eyes of the jury.  Other lawyers use the word suspended as a more benign word, which avoids the mental taint in the eyes of the jury in trials.  See http://www.barringer1.com/reliability.htm

 

Notice the definitions for reliability issues are far more detailed and far more complex than for the ordinary language usage!

 

We talk about reliability—the absence of failures.  However, we quantify reliability with failure data due to unreliability.  Reliability is the sweet side of the coin.  Unreliability is the sour side of the coin.

 

Unreliable equipment generates much failure data.  Highly reliable equipment generates little or no failure data.  Immediately we have a love/hate arrangement:  We want the failure data, but we hate to pay for the unreliability that goes with the data!  So, how many failures would you like to purchase out of your pocket to have all the data you need?

If you have failure data, what are you going to do with it?  You must convert the failure details into workable information for quantifying reliability and particularly the cost of unreliability, which goes into the long term cost of ownership.  Simply put:
        acquire age-to-failure data,
        give the data a voice so the facts quantify and tell about the failures, and
        find a way to decrease the costs of failures.

Many companies have much data in their maintenance systems, but since the data cannot speak, it thus conveys little information.  The data needs engineers who give the data a voice and make the failure data speak!  You need the data to speak for management/engineering/process metrics of key performance indicators (KPIs)—you can’t operate a modern business as Mr. and Ms. Clueless for very long!  We live in a fact-based environment and need clear data for making decisions.

 

Likewise equipment suppliers have much data in their spare-parts sales systems for replacements sold (usually at very high prices as a profit stream for their businesses) and the data needs to have a voice to speak to design engineers about the unreliability of their designs.  Businesses often have a specific spare-parts sales strategy:
        Some businesses expect to make their money from first sales and the equipment is usually durable with few spare-parts sales, whereas
        Other businesses expect to make their money from spare-parts sales as the equipment is less durable with many spare-parts sales


Of course some spare-parts sales in any business occurs because of consumption of life following the natural law rules of entropy, for example, batteries always run down because of consumption of their energy, and they never run up to full charge.  Similarly buildings and bridges always fall down by consumption of their life, they never fall up.  If you acquire any physical item, the first price paid is not the last price as you must sustain the equipment over its life span.  Generally speaking, life cycle cost studies show the sustaining cost is 2 to 20 times the acquisition cost of most equipment and processes.

 

Some general concepts for data

Niederman and Boyum in the 2004 paperback book What The Numbers Say: A Field Guide to Mastering Our Numerical World give the following chart on page 48 that they call the The Ten Habits of Highly Effective Quantitative Thinkers:

 

Ten Habits of Highly Effective
Quantitative Thinkers

Attitude Is Everything

1.  Only Trust Numbers  
(Be fact driven—hold back your emotions.)

2.  Never Trust Numbers 
(There are many reasons numbers are wrong—particularly with a biased agenda.)

Navigational Tools

3.  Play Jeopardy                       

(Phrase all answers as question—the numbers are useful when considered as an answer.)

4.  Live by Pareto’s Laws
(Winners always solve the vital few problems with the major financial impact taking first priority.  Ignore the trivia till later.)

Illuminating Numbers

 5.  Play 20 Question 
(Are we lying or misdirected about the numbers?  Ask the yes/no question for validity.)

 6.  Build Models 
(Build numerical models to simulate the numbers you’re trying to understand as the models help us gain insight.)

Uncertainty

 7.  Play the Odds 
(No numbers or results are certain!  Every decision has a consequence.  Judge decisions on how they were made and the numbers used for the decisions.)

 8.  Know What You Know and Don’t Know 
(
View the data probabilistically.  Most data is imperfect and incomplete.  Look for the signal within the noise.)

Estimation

9.  Go Figure 
(Does the outcome of the numbers look reasonable?  Use common sense [which is an uncommon ability!!].  In the world of slide rules you had to ask if the numbers [and decimals practical]: in the world of computers, ask if the “precise” number makes sense or if it is automated non-sense. )

10. Look for the Easy Way Out 
(
Find the easiest approach to a difficult problem.  Stay alert and inquisitive when it matters the most.)

The short meanings are shown in the parentheses.  Read the book for more insight and practical examples.  Keep these ten habits in mind as we look at the use of numbers for reliability below.

 

Don’t Be Fooled By Randomness

Everyone wants their data to be better behaved than Mother Nature can provide.  Nassim Taleb has an interesting 2004 paperback book titled Fooled By Randomness: The Hidden Role of Chance in Life and in the Markets. His points are: 
     1) We want to live in a deterministic world, but all around us are huge amounts of randomness and uncertainty,
     2) Our brains are fooled by the uncertainty of randomness, and
     3) We want to see signals of a deterministic and certitude world where none exists, which leads to both false positives and false negatives. 

Of course, randomness of chance always favors those who are prepared to seize the opportunity when a signal comes through the fog.  Taleb argues that most people underestimate the likelihood of seeing a “black swan” which is an extreme, highly disruptive event that comes out of nowhere to surprise. 

 

Many in the West believed all swans were white until European explorers discovered that black swans are native to Australia.  Taleb says that it only takes discovery of one black swan to burst the bubble that all swans are white and, thus, old rules no longer take precedent. 

 

Taleb defines a black swan event as a highly improbable event with three key issues:

1)      It is unpredictable

2)      It has a massive impact, and

3)      After the fact we concoct an explanation that makes it appear less random and more predictable than it was.

Taleb refers to black swan events in the same way that people thought that a “black swan” implied that black-colored swans did not exist.  In life, many people underestimate the odds that such rare events will/can occur.  As an example, consider the occasional bursting of the Wall Street bubble where records are set for the Dow Jones stock index one week followed by an enormous decline in the market.  Another example of a black swan event is the destruction of the twin towers in New York City on September 11, 2001. 

 

Black swan events occur in industry just as in the stock markets:  Smooth operation is suddenly ended with events such as vessels that explode or pipes that rupture from hidden effects of corrosion because no one went searching for them by means of an inspection.  The purpose of inspections is to discover what isn’t known.  We get into trouble by not learning what we didn’t know!

 

Life data for reliability has much randomness!  If you can clearly understand the message signal coming through the randomness and if you are prepared to deal with the probabilistic data, you can receive the signal that cuts through the noise of life data.  Knowledgeable practitioners of reliability are prepared to separate sense from the apparent nonsense.  When you can see the signal and you are prepared for the opportunity, then you can have success while others only see chaos and lack of success. 

 

Nassim has a table for comparing the distinctions between number complexity and simplicity, which he calls his Table of Confusion. 

Table of Confusion

General

Luck

Randomness

Probability

Belief, conjecture

Theory

Anecdote, coincidence

Forecast

Skills

Determinism

Certainty

Knowledge, certitude

Reality

Causality, law

Prophecy

Market Performance

Lucky idiot

Skilled investor

Survivorship bias

Market outperformance

Finance

Volatility

Return (or drift)

Stochastic variable

Deterministic variable

Physics and Engineering

Noise

Signal

Literary Criticism

None*

Symbol

Philosophy of Science

Epistemic probability

Physical probability

Induction

Deduction

Synthetic proposition

Analytic proposition

General Philosophy

Contingent

Certain

Contingent

Necessary (in the Kripke sense)**

Contingent

True in all possible worlds

* Literary critics do not seem to have a name for things they do not understand.
**Wikipedia says: A Kripke structure is a type of nondeterministic finite state machine used in model checking to represent the behavior of a system.


The point of the table is simple:  We wander back and forth between the two columns in search of comments that justify what we want to believe.  We want the signal but we’re overwhelmed by the noise that comes with the signal.  This means we must have the ways and the means to cut through the fog of the noise to understand what’s really going on.  The fog cutter is making the data talk to avoid the sinkhole of thinking we know more than we actually do know by the use of myths and mysticism!

 

Mean Time Between Failures (MTBF)

The MTBF metric is the bottom of the “food chain.”  MTBF is the easiest data to acquire. 
                                    MTBF = (summation of life)/(summation of failures). 
The summation of life is the total of all life hours acquired (both things that have lived and things that have died) divided by all failures that have occurred.   Of course, this implies that early in the life of a component you can infer a trivial situation of infinite life as you accumulate life hours without any failures!  Because the data is easy to acquire, be careful of doing the wrong thing because it’s easy data to acquire.  For example, if you have redundant equipment on which you alternate the service, do not count the idle life as if the equipment was operating because that will make your MTBF look very large, when in fact, the idle equipment is not toting up life hours—in short, do not “Enronize” your metrics just to look good and just because it’s easy!

 

Section 5 of  MIL-HDBK-338 reminds that MTBF only has meaning for repairable items, and, for that case, MTBF represents exactly the same parameter as mean life as it is based on the assumption of a constant failure rate.  In Appendix A of The New Weibull Handbook, Abernethy reminds that the MTBF becomes asymptotic to the MTTF with many replacements.  Consider this ultra-simple example of MTBF for 5 items (components) each working 25 hours/month.  (For simplicity, assume repair time is very small):


Table 1: Example of MTBF for 5 Items Each Working 25 Hours/month With Recorded Failures

Month--->

1

2

3

4

5

6

7

8

9

Item 1

0

0

1

0

0

0

1

1

0

Item 2

0

0

0

0

1

1

0

0

0

Item 3

0

0

0

1

0

0

0

1

0

Item 4

0

1

0

0

0

1

0

0

1

Item 5

0

0

0

0

1

0

0

0

0

Fail./mo.

0

1

1

1

2

2

1

2

1

S Hrs life

125

250

375

500

625

750

875

1000

1125

S Failures

0

1

2

3

5

7

8

10

11

MTBF*

**

250

187.5

166.67

125

107.14

109.38

100

102.27

* Units are hours/failure, which are often tersely reported as hours

** In month 1, the real answer is infinity because of division by zero, which of course is a trivial answer changing with time as the MTBF approaches the MTTF asymptotically.

 

The steady-state average failure rate is known as mean time to failure (MTTF).

 

Mean Time To Failure (MTTF)

This metric is one notch up the “food chain” for reliability statistics.  MTTF requires more careful data acquisition.  Also,

                                    MTTF = (summation of life)/(summation of failures)
but watch the catch shown in the data below for Table 2.

 

Table 2: Example of MTTF for 5 Items Each Working 25 Hours/month With Recorded Failures

Month--->

1

2

3

4

5

6

7

8

9

Item 1

0

0

1/(5)

No more life

Item 2

0

0

0

0

1/(1)

No more life

Item 3

0

0

0

1/(18)

No more life

Item 4

0

1/(10*)

No more life

Item 5

0

0

0

0

1/(20)

No more life

Fail./mo.

0

1

1

1

2

 

 

 

 

S Hrs life

125

235

315

383

404

 

 

 

 

S Failures

0

1

2

3

5

 

 

 

 

MTTF**

--***

235

157.5

127.67

80.8

 

 

 

 

* The age acquired during the month in which the failure occurred shown in hours with the parenthesis

** The units are hours/failure, which are often tersely reported as hours.

*** In month 1, the real answer is infinity because of division by zero, which of course is a trivial answer changing with time as the MTTF approaches the true answer very quickly at 80.8 hours/failure (as best we can understand with only 5 samples) with the assumption of chance failures.


Age To Failure

From Table 2, the ages-to-failure (in rank order) for each item at the end of month 5 are:

Item 4 = 35 hours

Item 1 = 55 hours

Item 3 = 93 hours

Item 2 = 101 hours

Item 5 = 120 hours

This data required the total elapsed time of almost 5 months to acquire. 

 

Weibull Analysis

About 85% to 95% of failure data adequately fits a Weibull distribution for life data, so; in general, you’re safe in playing the odds that your life data will be well represented by a Weibull distribution, so always try it first unless you know from the physics of failure that a different distribution is preferred.

 

A Weibull distribution measures the weakest link in the chain and Item 4 was the weakest link as it failed at the youngest age, followed by Item 1 and so forth.  The suspended/censored data will be included into each Weibull probability plot along with the ages-to-failure to produce a regressed line fit with a slope beta, b, and a characteristic life eta, h, which occurs at 63.2% probability of failure.  Life distributions are rarely symmetrical curves as described by the bell-shaped normal or Gaussian curve.  They are usually skewed curves with long tails to the right for most life data. The Weibull distribution can handle tailed data. 

 

If we know the Weibull eta (sort of like knowing the Gaussian X-bar), then all line slopes will pass through the mathematical point of eta at 63.2%, regardless of the Weibull beta (sort of like knowing the Gaussian 1/sigma).  Thus the eta value is known as the characteristic value of life.  Furthermore when beta =1 (chance failures), the eta value is also MTTF for components.

 

Using the entire age-to-failure dataset, we can make the following Weibull probability plot using WinSMITH Weibull software to automate the grunt work of ranking each data point and calculating a median rank plotting position for making the Weibull plot and performing the regressions.  From the Weibull plot of component failures we can learn the failure mode.  In this case, the beta value indicates the failure mechanism is a wear-out failure mode–not the assumed chance failure mode driven by random events!  For components and individual failure modes, the beta value has a physical relationship when the data is separated into components (these conclusions do not hold for mixtures of failure modes):

            b < 1, it infers infant mortality where failure rates decline with time/usage,

            b ≈ 1, it infers chance failures where failure rates are constant with time/usage, and

            b > 1, it infers wear-out failures where failure rates increase with time/usage.

So, Weibull plots of components with the same failure modes can be very instructive in explaining how failures are occurring by interpreting the beta values so that you can apply the correct “medicine” to solve the problem.  Interpreting the eta values can tell you how long they will survive.  Figure 1 shows a typical Weibull plot with some explanations.

 

Figure 1: A Weibull Probability Plot

 

Remember that when b = 1, h = MTTF.  For other values of beta, then MTTF = h*G(1 + 1/b) where G(1 + 1/b) is known as the gamma function, which is shown in Figure 2.

Figure 2:  The Gamma Function Versus Weibull Beta Values

 

Let’s see what we can learn from Weibull analysis at the end of each month.

            End of month 1:

25 hours suspended life by 5 data points.

Assume beta =1 (chance failures)  ßremember the underlying assumption for Table 1 and if we combined Weibull analysis with Bayesian analysis we get a Weibayes solution (see The New Weibull Handbook for details) that results in a probability plot using the suspensions without any failure points as shown in Figure 3.

 

Figure 3 At The End Of Month 1 Says Beta =1, and Eta = 180.34, MTTF = 180.34 hours/failure


            End of month 2:

One failure at age 35 hours, and

Four suspensions at age 50 hours.

Again, assume beta =1 (random failures) for a second Weibayes estimate

 

Figure 4 At The End Of Month 2 Says Beta =1, Eta = 140.014, MTTF = 140.01 hours/failure

 

            End of month 3:

One failure at age 35 hours,

One failure at age 55 hours, and

Three suspensions at age 75 hours.

 

Figure 5: At The End Of Month 3, Beta =2.216, Eta = 85.30 and MTTF = 75.5 hours/failure


                End of month 4:

One failure at age 35 hours,

One failure at age 55 hours,

One failure at 93 hours, and

Two suspensions at age 100 hours.

 

Figure 6: At the End of Month 4, Beta = 1.689, Eta = 108.391, and MTTF = 96.76 hours/failure


            End of month 5:

One failure at age 35 hours,

One failure at age 55 hours,

One failure at age 93 hours, 

One failure at age 101 hours, and

One failure at age 120 hours.

Of course at the end of month 5, all items have failed.

 

Figure 7: At the End of Month 5, Beta = 2.081, Eta = 93.45 , and MTTF = 82.77 hours/failure

 

The Weibull plot says the likely failure mode is wear-out (not chance failure) as beta is greater than 1 and the MTTF = Q = 82.77 hours/failure.  The P-value estimate as a goodness of fit at 76.74% greatly exceeds the required 10% minimum value.  The n/s tells us we have 5 data points with 0 suspensions.

 

For Figure 7, if we put 90% confidence limits around the trend line, we learn that the true values for such a small dataset may lie between
            60.755 ≤
h ≤ 152.7, with our best single-point estimate at h = 93.45 hours,

            0.881 ≤ b ≤ 4.116, with our best single-point estimate of b = 2.081, and

            49.506≤ Q ≤ 129.028, with our best single-point estimate of Q = 82.772 hours/failure.

Of course the small number of data points contributes to the large uncertainty.  If you want to be more sure about the test or operating results, then you need more data.  (If you are a manufacturer, you usually acquire your data by carefully controlled tests that are conducted at great expense.  However, if you are the end user of the equipment, you acquire your data by in-service operation, which are the real-world environments, but the conditions surrounding the operation are not easily controlled.  The real-world operating data is acquired as an adjunct of daily operation and the cost is observed as a maintenance activity.)  How many more failure data would you like to purchase if the cost per item was, say, $15,000 per item with the maximum consequence of a failure at, say, $20,000, including procurement/installation of the replacement item?  Hmm, just as I thought, you don’t want to spend the money out of your own pocket—sorry but you can’t have it both ways!

 

What’s the bottom line? 

            1.  Arithmetic MTBF calculations requires a long time for convergence to the correct answer. 
            2.  Arithmetic MTTF calculations converge faster than MTBF calculations. 

            3.  Weibull analysis gets on target the fastest and we even get into the ballpark the first month
                 even when we didn’t have any failures by using a Weibayes technique.

            4.  You’ll never have enough data to be really sure about your conclusions.

 

Other facts of life: 

  1. You’ll never live long enough OR have enough money to get exact results during your industrial career—you need the collective efforts of your colleagues to build a Weibull database for the purpose of survival against your competitors.  Your industrial life is going to be a result of estimates and extrapolations.  In industry, you can’t wait for perfect results.  You need to start in the right direction quickly and plan on open-field running to survive.  Perfect plans and perfect data only exist in the eyes of the perfectionist—who usually don’t work in industry.
  2. Use Weibull analysis for quicker results.  You’ll never have enough money to afford all the data you desire.  Put your results into your company’s Weibull database so others can access and improve on your datasets.
  3. You must record your data correctly.  Include the ages-to-failure.  Include the physical failure analysis for reasons of failure from your root-cause details.  This is why it is very important for your maintenance craftsman to “talk” directly to the computer and record what they see when they have their hands into the “blood and guts” of the repairs—after all, they will tell the computer in an anonymous fashion more details than they will ever tell the engineers.
  4. You must have a method for recording suspended data along with the reasons for failure.  This means you must train engineers how reliability data is correctly recorded AND you must also train your craftsmen to consistently obtain both failure data and suspensions.  Suspensions occur because of different failure modes, parts replaced by good maintenance practices before failures occur, and from parts replaced by preventive maintenance actions to avoid the high cost of expensive failures.
  5. At end of life of the components, you MUST do an autopsy on the expensive failed system to record the actual failure data along with the benign failures.  End-of-life replacements provides THE most important data you will ever record.  (Of course the ignorant and naïve will always avoid spending the pennies for inclusion of this important information to save future dollars because the system is dead—what could we possibly learn!)
  6. Always plan to learn something from failures—after all you paid good money to acquire the failure data!  Benefit from your bad experiences to avoid relearning the same information someplace down the road into the future where the costs are higher and the consequences greater.

 

Optimum Replacement Based On Weibull Analysis

Suppose we take the Weibull results from Figure 7 with beta = 2.081, eta = 93.45, and MTTF = 82.77 hours/failure.  Assume the cost of a planned replacement is $15,000 and the cost for an unplanned replacement is $45,000 (the higher cost is due to collateral damage from the failure along with loss of gross margin from lost sales). 

 

Should we plan to do preventive maintenance activities to replace the component before it fails or should we plan on running the component to failure?  If we need to do a planned replacement as a PM activity, when should we do the replacement and what is the cost per hour of operation?

 

Hint:

  1. We have a wear-out failure mode with beta = 2.081 and we have a 3:1 cost factor for avoiding the high cost of an unplanned maintenance event.  Both suggest that perhaps we have a potential for a PM activity.  If so, when should we plan to do the replacement?
  2. How would we make the calculation?  This is an easy decision, as we would employ Glasser’s optimum replacement equation for the detailed equation which is valid for the first round of replacement so the calculations are truncated after reaching the characteristic life of 63.2% unreliability.  A simpler method is to use WinSMITH Weibull software by imputing the Weibull beta and eta along with the cost details for a failure, which result in a table of costs and WinSMITH Visual will then produce the cost details shown in Figure 8.

Figure 8:  Optimum Replacement Calculation

 

Figure 8 shows the optimum replacement interval of 67 hours results in the least cost and with a service of 25 hours/month, this means that roughly every three months you will need to make a planned replacement.  However, keep in mind the motivation for letting the part run to failure is not too severe

 

The real issue to solve for this case is not the replacement interval but instead why are we using a component with such a short life under the high costs for replacements?  A more durable component may be a better answer!!!  Of course when you have the data, the issues are much clearer than without the facts.  Remember the old black-and-white TV program called Dragnet where one of the tag lines from Sergeant Friday (the detective) was: “The facts ma’am, just the facts!”—the facts are your data and you must give the data a voice to solve your problems.

Comments:
Refer to the caveats on the Problem of the Month Page about the limitations of the following solution. Maybe you have a better idea on how to solve the problem. Maybe you find where I've screwed up the solution and you can point out my errors as you check my calculations. E-mail your comments, criticism, and corrections to Paul Barringer by   clicking here.   Return to top of page.

You can download this as a PDF file.

Return to Barringer & Associates, Inc. homepage

Last revised 11/28/2007
© Barringer & Associates, Inc. 2007