Reliability and Data


Dictionary definitions for the common man:

From Microsoft Encarta Reference Library 2005®, ©1993-1993-2004 Microsoft Corporation:

 

  re·li·a·ble 

adjective

1.  dependable:  able to be trusted to do what is expected or has been promised
● She is extremely reliable and a hard worker.

2.  likely to be accurate:  able to be trusted to be accurate or correct or to provide a correct result
● I don’t think that clock is very reliable.


- re·li·a·bil·i·ty, noun
- re·li·a·ble·ness, noun
- re·li·a·bly, adverb

  da·ta

noun (takes either a singular or plural verb)

1.  factual information: information, often in the form of facts or figures obtained from experiments or surveys, used as a basis for making calculations or drawing conclusions.

2.  COMPUTING  information for computer processing: information, for example, numbers, text, images, and sounds, in a form that is suitable for storage in or processing by a computer.

_________________________________________________________________

Working definitions for reliability professionals:

  reliability

an engineering definition for reliability

1.     The duration or probability of failure-free performance under stated conditions.

2.     The probability that an item can perform its intended function for a specified interval under stated conditions.  (For non-redundant items this is equivalent to definition 1.  For redundant items this is equivalent to definition of mission reliability.)  See MIL-HDBK-338, page 56.

a business definition for reliability

3.     Reliability is the probability that a device, system, or process will perform its prescribed duty without failure for a given time when operated correctly in a specified environment.  It costs money to achieve high reliability and it cost money when things become unreliable thus money is a big motivator for reliability.

 

  data 

an engineering definition for reliability data

1.     “Failure must be precisely defined in practice.  For dealings between producers and consumers, it is essential that the definition of a failure be agreed upon in advance to minimize disputes.  For many products, failure is catastrophic, and it is clear when failure occurs.  For some products, performance slowly degrades, and there is no clear end of life.  One can then define that a failure occurs when performance degrades below a specified value.  Of course, one can analyze data according to each of a number of definitions of failures.  One must decide whether time is calendar time or operating hours or some other measure of exposure, for example, the number of start-ups, miles traveled, energy output, cycles of operation, etc.  Also, one must decide whether to measure time of exposure starting at time of manufacture, time of installation, or whatever.  Engineers define failure and exposure.”

2.     “Most non-life data are complete; that is, the value of each sample unit is observed.  Such life data consist of the time to failure of each sample unit.  Much life data are incomplete.  That is, the exact failure times of some units are unknown, and there is only partial information on their failure times.”

3.     “Sometimes when life data are analyzed, some units are unfailed, and their failure times are known only to be beyond their present running times.  Such data are said to be censored on the right.  Unfailed units are called run-outs, survivors, removals, and suspended units.   Similarly, a failure time known only to be before a certain time is said to be censored on the left.  If all unfailed units have a common running time and all failure times are earlier, the data are said to be singly censored on the right.  Singly censored data arise when units are started on test together and the data are analyzed before all units fail.  Such data are singly time censored if the censoring time is fixed; then the number of failures in that fixed time is random.  Time censored data are also called Type I censored.  Data are singly failure censored if the test is stopped hen a specified number of failures occurs, the time to that fixed number of failures being random.  Time censoring is more common in practice; failure censoring is more common in the literature, as it is mathematically more tractable.”  For the direct quote of engineering definitions 1, 2, and 3 see; Nelson, Wayne; Applied Life Data Analysis, John Wiley & Sons, New York, NY, ISBN: 0-471-09458-7, 1982, pages 6–7.

4.     “Measured data like age-to-failure data is much more precise because there is more information in each data point.  Measured data provides much better precision, so smaller sample sizes are acceptable.  Ideally, each Weibull plot depicts a single failure mode.  Data requirements are described by D.R. Cox [1984] who said to determine failure time precisely, there are three requirements:
       1) a time origin must be unambiguously defined,
       2) a scale for measuring the passage of time must be agreed to, and finally,
       3) the meaning of failure must be entirely clear. 
The age of each part is required, both failed and unfailed.  The units of age depend on the part usage and the failure mode, for example, low cycle and high cycle fatigue may produce cracks leading to rupture.  The age units would be fatigue cycles.  The age of a starter may be the number of starts.  Burner and turbine parts may fail as a function of time at high temperature or as the number of cold to hot to cold cycles.  Usually, knowledge of the physics-of-failure will provide the age scale.  When there is uncertainty, several age scales are tried to determine the best fit.  This is not difficult with good software.  The “best” aging parameter data may not exist and substitutes are tried.  For example, the only data on air conditioner compressors may be the date shipped and the date returned.  The “best” data, operating time or cycles, is unobtainable, so based on the dates above, a calendar interval is used as a substitute.  These inferior data will increase the uncertainty, but the resulting Weibull plot may still be accurate enough to provide valuable analysis.  The data fit will tell us if the Weibull is good enough.”  For direct quote of engineering definition 4 see: Abernethy, Robert B.; The New Weibull Handbook, self-published by Dr. Robert B. Abernethy, 536 Oyster Road, North Palm Beach, Florida 33408-4328, 5th edition, ISBN-13: 978-0-9653062-3-2, 2006, pages1–4.

a business definition for reliability data

5.     Failure data is required for reliability analysis.  Failure data is similar to tombstone data, i.e., the age “date” at failure (age-to-failure) minus the birth “date” (time zero for the time origin).  Failure data requires elapsed service time/cycles/wait-time/etc., using the units that motivate the failure.  Failure data includes actual failures (failures require practical criteria for defining what is a failure) along with censored (suspended) data for items that are aging toward failure and have not yet failed.  Censored data also comes from the age of items that have been removed from service without failure such as occurs with timed replacement from a preventative maintenance action.  Censored data also includes items that have been removed from service without failure as a result of a good maintenance practice such as would occur when pump bearings are replaced for having reached a failure criteria such as excessive vibration or excessive heat; and while the pump is disassembled a new pump seal is replaced as a good maintenance practice—thus, the pump seal life is censored whereas the bearing/bearings may have reached a failure criteria.  So you must record:
      1)  The age-to-failure for the failed item in “time” units motivating the failures and identify the failure mode, and
      2)  The age-to-removal for unfailed items for censored or suspended items along with reasons for the removals.

Why two names–censored/suspended–for the same data?  Because of the litigious society in the United States (for example, it is claimed that we have more lawyers in the city of Houston than the entire nation of Japan!), and the large quantity of lawsuits where lawyers use the word censored to conjure up
        secrecy,
        hiding information, and
        making some information unavailable,
which generates mistrust in the eyes of the jury.  Other lawyers use the word suspended as a more benign word, which avoids the mental taint in the eyes of the jury in trials.  See http://www.barringer1.com/reliability.htm

 

Notice the definitions for reliability issues are far more detailed and far more complex than for the ordinary language usage!

 

We talk about reliability—the absence of failures.  However, we quantify reliability with failure data due to unreliability.  Reliability is the sweet side of the coin.  Unreliability is the sour side of the coin.

 

Unreliable equipment generates much failure data.  Highly reliable equipment generates little or no failure data.  Immediately we have a love/hate arrangement:  We want the failure data, but we hate to pay for the unreliability that goes with the data!  So, how many failures would you like to purchase out of your pocket to have all the data you need?

If you have failure data, what are you going to do with it?  You must convert the failure details into workable information for quantifying reliability and particularly the cost of unreliability, which goes into the long term cost of ownership.  Simply put:
        acquire age-to-failure data,
        give the data a voice so the facts quantify and tell about the failures, and
        find a way to decrease the costs of failures.

Many companies have much data in their maintenance systems, but since the data cannot speak, it thus conveys little information.  The data needs engineers who give the data a voice and make the failure data speak!  You need the data to speak for management/engineering/process metrics of key performance indicators (KPIs)—you can’t operate a modern business as Mr. and Ms. Clueless for very long!  We live in a fact-based environment and need clear data for making decisions.

 

Likewise equipment suppliers have much data in their spare-parts sales systems for replacements sold (usually at very high prices as a profit stream for their businesses) and the data needs to have a voice to speak to design engineers about the unreliability of their designs.  Businesses often have a specific spare-parts sales strategy:
        Some businesses expect to make their money from first sales and the equipment is usually durable with few spare-parts sales, whereas
        Other businesses expect to make their money from spare-parts sales as the equipment is less durable with many spare-parts sales


Of course some spare-parts sales in any business occurs because of consumption of life following the natural law rules of entropy, for example, batteries always run down because of consumption of their energy, and they never run up to full charge.  Similarly buildings and bridges always fall down by consumption of their life, they never fall up.  If you acquire any physical item, the first price paid is not the last price as you must sustain the equipment over its life span.  Generally speaking, life cycle cost studies show the sustaining cost is 2 to 20 times the acquisition cost of most equipment and processes.

 

Some general concepts for data

Niederman and Boyum in the 2004 paperback book What The Numbers Say: A Field Guide to Mastering Our Numerical World give the following chart on page 48 that they call the The Ten Habits of Highly Effective Quantitative Thinkers:

 

Ten Habits of Highly Effective
Quantitative Thinkers

Attitude Is Everything

1.  Only Trust Numbers  
(Be fact driven—hold back your emotions.)

2.  Never Trust Numbers 
(There are many reasons numbers are wrong—particularly with a biased agenda.)

Navigational Tools

3.  Play Jeopardy                       

(Phrase all answers as question—the numbers are useful when considered as an answer.)

4.  Live by Pareto’s Laws
(Winners always solve the vital few problems with the major financial impact taking first priority.  Ignore the trivia till later.)

Illuminating Numbers

 5.  Play 20 Question 
(Are we lying or misdirected about the numbers?  Ask the yes/no question for validity.)

 6.  Build Models 
(Build numerical models to simulate the numbers you’re trying to understand as the models help us gain insight.)

Uncertainty

 7.  Play the Odds 
(No numbers or results are certain!  Every decision has a consequence.  Judge decisions on how they were made and the numbers used for the decisions.)

 8.  Know What You Know and Don’t Know 
(
View the data probabilistically.  Most data is imperfect and incomplete.  Look for the signal within the noise.)

Estimation

9.  Go Figure 
(Does the outcome of the numbers look reasonable?  Use common sense [which is an uncommon ability!!].  In the world of slide rules you had to ask if the numbers [and decimals practical]: in the world of computers, ask if the “precise” number makes sense or if it is automated non-sense. )

10. Look for the Easy Way Out 
(
Find the easiest approach to a difficult problem.  Stay alert and inquisitive when it matters the most.)

The short meanings are shown in the parentheses.  Read the book for more insight and practical examples.  Keep these ten habits in mind as we look at the use of numbers for reliability below.

 

Don’t Be Fooled By Randomness

Everyone wants their data to be better behaved than Mother Nature can provide.  Nassim Taleb has an interesting 2004 paperback book titled Fooled By Randomness: The Hidden Role of Chance in Life and in the Markets. His points are: 
     1) We want to live in a deterministic world, but all around us are huge amounts of randomness and uncertainty,
     2) Our brains are fooled by the uncertainty of randomness, and
     3) We want to see signals of a deterministic and certitude world where none exists, which leads to both false positives and false negatives. 

Of course, randomness of chance always favors those who are prepared to seize the opportunity when a signal comes through the fog.  Taleb argues that most people underestimate the likelihood of seeing a “black swan” which is an extreme, highly disruptive event that comes out of nowhere to surprise. 

 

Many in the West believed all swans were white until European explorers discovered that black swans are native to Australia.  Taleb says that it only takes discovery of one black swan to burst the bubble that all swans are white and, thus, old rules no longer take precedent. 

 

Taleb defines a black swan event as a highly improbable event with three key issues:

1)      It is unpredictable

2)      It has a massive impact, and

3)      After the fact we concoct an explanation that makes it appear less random and more predictable than it was.

Taleb refers to black swan events in the same way that people thought that a “black swan” implied that black-colored swans did not exist.  In life, many people underestimate the odds that such rare events will/can occur.  As an example, consider the occasional bursting of the Wall Street bubble where records are set for the Dow Jones stock index one week followed by an enormous decline in the market.  Another example of a black swan event is the destruction of the twin towers in New York City on September 11, 2001. 

 

Black swan events occur in industry just as in the stock markets:  Smooth operation is suddenly ended with events such as vessels that explode or pipes that rupture from hidden effects of corrosion because no one went searching for them by means of an inspection.  The purpose of inspections is to discover what isn’t known.  We get into trouble by not learning what we didn’t know!

 

Life data for reliability has much randomness!  If you can clearly understand the message signal coming through the randomness and if you are prepared to deal with the probabilistic data, you can receive the signal that cuts through the noise of life data.  Knowledgeable practitioners of reliability are prepared to separate sense from the apparent nonsense.  When you can see the signal and you are prepared for the opportunity, then you can have success while others only see chaos and lack of success. 

 

Nassim has a table for comparing the distinctions between number complexity and simplicity, which he calls his Table of Confusion. 

Table of Confusion

General

Luck

Randomness

Probability

Belief, conjecture

Theory

Anecdote, coincidence

Forecast

Skills

Determinism

Certainty

Knowledge, certitude

Reality

Causality, law

Prophecy

Market Performance

Lucky idiot

Skilled investor

Survivorship bias

Market outperformance

Finance

Volatility

Return (or drift)

Stochastic variable

Deterministic variable

Physics and Engineering

Noise

Signal

Literary Criticism

None*

Symbol

Philosophy of Science

Epistemic probability

Physical probability

Induction

Deduction

Synthetic proposition

Analytic proposition

General Philosophy

Contingent

Certain

Contingent

Necessary (in the Kripke sense)**

Contingent

True in all possible worlds

* Literary critics do not seem to have a name for things they do not understand.
**Wikipedia says: A Kripke structure is a type of nondeterministic finite state machine used in model checking to represent the behavior of a system.


The point of the table is simple:  We wander back and forth between the two columns in search of comments that justify what we want to believe.  We want the signal but we’re overwhelmed by the noise that comes with the signal.  This means we must have the ways and the means to cut through the fog of the noise to understand what’s really going on.  The fog cutter is making the data talk to avoid the sinkhole of thinking we know more than we actually do know by the use of myths and mysticism!

 

Mean Time Between Failures (MTBF)

The MTBF metric is the bottom of the “food chain.”  MTBF is the easiest data to acquire. 
                                    MTBF = (summation of life)/(summation of failures). 
The summation of life is the total of all life hours acquired (both things that have lived and things that have died) divided by all failures that have occurred.   Of course, this implies that early in the life of a component you can infer a trivial situation of infinite life as you accumulate life hours without any failures!  Because the data is easy to acquire, be careful of doing the wrong thing because it’s easy data to acquire.  For example, if you have redundant equipment on which you alternate the service, do not count the idle life as if the equipment was operating because that will make your MTBF look very large, when in fact, the idle equipment is not toting up life hours—in short, do not “Enronize” your metrics just to look good and just because it’s easy!

 

Section 5 of  MIL-HDBK-338 reminds that MTBF only has meaning for repairable items, and, for that case, MTBF represents exactly the same parameter as mean life as it is based on the assumption of a constant failure rate.  In Appendix A of The New Weibull Handbook, Abernethy reminds that the MTBF becomes asymptotic to the MTTF with many replacements.  Consider this ultra-simple example of MTBF for 5 items (components) each working 25 hours/month.  (For simplicity, assume repair time is very small):


Table 1: Example of MTBF for 5 Items Each Working 25 Hours/month With Recorded Failures

Month--->

1

2

3

4

5

6

7

8

9

Item 1

0

0

1

0

0

0

1

1

0

Item 2

0

0

0

0

1

1

0

0

0

Item 3

0

0

0

1

0

0

0

1

0

Item 4

0

1

0

0

0

1

0

0

1

Item 5

0

0

0

0

1

0

0

0

0

Fail./mo.

0

1

1

1

2

2

1

2

1

S Hrs life

125

250

375

500

625

750

875

1000

1125

S Failures

0

1

2

3

5

7

8

10