Dictionary definitions for the common man:
From
Microsoft Encarta Reference Library 2005®, ©1993-1993-2004
Microsoft Corporation:
● re·li·a·ble
adjective
1. dependable:
able to be trusted to do what is expected or has been promised
● She is extremely reliable and a hard worker.
2. likely to be
accurate: able to be trusted to be
accurate or correct or to provide a correct result
● I don’t think that clock is
very reliable.
- re·li·a·bil·i·ty, noun
-
re·li·a·ble·ness, noun
-
re·li·a·bly, adverb
● da·ta
noun (takes
either a singular or plural verb)
1. factual information: information,
often in the form of facts or figures obtained from experiments or surveys,
used as a basis for making calculations or drawing conclusions.
2. COMPUTING information
for computer processing: information,
for example, numbers, text, images, and sounds, in a form that is suitable for
storage in or processing by a computer.
_________________________________________________________________
Working definitions for reliability professionals:
● reliability
an engineering definition for reliability
1. The duration or probability of failure-free
performance under stated conditions.
2. The probability that an item can perform its intended
function for a specified interval under stated conditions. (For non-redundant items this is equivalent
to definition 1. For redundant items
this is equivalent to definition of mission reliability.) See MIL-HDBK-338,
page 56.
a
business definition for reliability
3. Reliability
is the probability that a device, system, or process will perform its prescribed
duty without failure for a given time when operated correctly in a specified
environment. It costs money to achieve
high reliability and it cost money when things become unreliable thus money is
a big motivator for reliability.
● data
an
engineering definition for reliability data
1. “Failure must be precisely defined in practice. For dealings between producers and consumers,
it is essential that the definition of a failure be agreed upon in advance to
minimize disputes. For many products,
failure is catastrophic, and it is clear when failure occurs. For some products, performance slowly
degrades, and there is no clear end of life.
One can then define that a failure occurs when performance degrades
below a specified value. Of course, one
can analyze data according to each of a number of definitions of failures. One must decide whether time is calendar time
or operating hours or some other measure of exposure, for example, the number
of start-ups, miles traveled, energy output, cycles of operation, etc. Also, one must decide whether to measure time
of exposure starting at time of manufacture, time of installation, or whatever. Engineers define failure and exposure.”
2. “Most non-life data are complete; that is, the value
of each sample unit is observed. Such
life data consist of the time to failure of each sample unit. Much life data are incomplete. That is, the exact failure times of some
units are unknown, and there is only partial information on their failure
times.”
3. “Sometimes when life data are analyzed, some units are
unfailed, and their failure times are known only to be beyond their present
running times. Such data are said to be
censored on the right. Unfailed units
are called run-outs, survivors, removals, and suspended units. Similarly, a failure time known only to be
before a certain time is said to be censored on the left. If all unfailed units have a common running
time and all failure times are earlier, the data are said to be singly censored
on the right. Singly censored data arise
when units are started on test together and the data are analyzed before all
units fail. Such data are singly time
censored if the censoring time is fixed; then the number of failures in that
fixed time is random. Time censored data
are also called Type I censored. Data
are singly failure censored if the test is stopped hen a specified number of
failures occurs, the time to that fixed number of
failures being random. Time censoring is
more common in practice; failure censoring is more common in the literature, as
it is mathematically more tractable.”
For the direct quote of engineering definitions 1, 2, and 3 see; Nelson, Wayne; Applied
Life Data Analysis, John Wiley & Sons, New York, NY, ISBN:
0-471-09458-7, 1982, pages 6–7.
4. “Measured data like age-to-failure data is much more
precise because there is more information in each data point. Measured data provides much better precision,
so smaller sample sizes are acceptable.
Ideally, each Weibull plot depicts a single failure mode. Data requirements are described by D.R. Cox
[1984] who said to determine failure time precisely, there are three
requirements:
1) a time origin must be unambiguously defined,
2) a scale for measuring the passage of time must be agreed to, and
finally,
3) the meaning of failure must be entirely clear.
The age of each part is required, both failed and unfailed. The units of age depend on the part usage and
the failure mode, for example, low cycle and high cycle fatigue may produce
cracks leading to rupture. The age units
would be fatigue cycles. The age of a
starter may be the number of starts.
Burner and turbine parts may fail as a function of time at high
temperature or as the number of cold to hot to cold cycles. Usually, knowledge of the physics-of-failure
will provide the age scale. When there
is uncertainty, several age scales are tried to determine the best fit. This is not difficult with good software. The “best” aging parameter data may not exist
and substitutes are tried. For example,
the only data on air conditioner compressors may be the date shipped and the
date returned. The “best” data,
operating time or cycles, is unobtainable, so based on the dates above, a calendar interval is used as a substitute. These inferior data will increase the
uncertainty, but the resulting Weibull plot may still be accurate enough to
provide valuable analysis. The data fit
will tell us if the Weibull is good enough.”
For direct quote of engineering definition 4 see: Abernethy, Robert B.; The New Weibull Handbook,
self-published by Dr. Robert B. Abernethy, 536 Oyster Road, North Palm Beach,
Florida 33408-4328, 5th edition, ISBN-13: 978-0-9653062-3-2, 2006,
pages1–4.
a business definition for reliability data
5. Failure data is required for reliability
analysis. Failure data is similar to
tombstone data, i.e., the age “date” at failure (age-to-failure) minus the
birth “date” (time zero for the time origin).
Failure data requires elapsed service time/cycles/wait-time/etc., using
the units that motivate the failure.
Failure data includes actual failures (failures require practical
criteria for defining what is a failure) along with censored (suspended)
data for items that are aging toward failure and have not yet failed. Censored data also comes from the age of
items that have been removed from service without failure such as occurs with
timed replacement from a preventative maintenance action. Censored data also includes items that have
been removed from service without failure as a result of a good maintenance
practice such as would occur when pump bearings are replaced for having reached
a failure criteria such as excessive vibration or excessive heat; and while the
pump is disassembled a new pump seal is replaced as a good maintenance
practice—thus, the pump seal life is censored whereas the bearing/bearings may
have reached a failure criteria. So you
must record:
1) The age-to-failure for the
failed item in “time” units motivating the failures and identify the failure
mode, and
2)
The age-to-removal for unfailed items for censored or suspended items
along with reasons for the removals.
Why two names–censored/suspended–for the
same data? Because of the litigious
society in the United States (for example, it is claimed that we have more
lawyers in the city of Houston than the entire nation of Japan!), and the large
quantity of lawsuits where lawyers use the word censored to conjure up
● secrecy,
●
hiding information, and
● making some information unavailable,
which generates mistrust in the eyes of the jury. Other lawyers use the word suspended
as a more benign word, which avoids the mental taint in the eyes of the jury in
trials. See http://www.barringer1.com/reliability.htm
Notice the definitions for reliability issues are far
more detailed and far more complex than for the ordinary language usage!
We talk about reliability—the
absence of failures. However, we
quantify reliability with failure data due to unreliability. Reliability is the sweet side of the
coin. Unreliability is the sour
side of the coin.
Unreliable equipment
generates much failure data. Highly
reliable equipment generates little or no failure data. Immediately we have a love/hate
arrangement: We want the failure data,
but we hate to pay for the unreliability that goes with the data! So,
how many failures would you like to purchase out of your pocket to have all the
data you need?
If you have failure data,
what are you going to do with it? You
must convert the failure details into workable information for quantifying
reliability and particularly the cost
of unreliability, which goes into the long
term cost of ownership. Simply put:
● acquire age-to-failure data,
● give the data a voice so the facts quantify
and tell about the failures, and
● find a way to decrease the costs of failures.
Many companies have much data
in their maintenance systems, but since the data cannot speak, it thus conveys
little information. The data needs
engineers who give the data a voice and make the failure data speak! You need the data to speak for
management/engineering/process metrics of key performance indicators (KPIs)—you can’t operate a modern business as Mr. and Ms.
Clueless for very long! We live in a
fact-based environment and need clear data for making decisions.
Likewise equipment suppliers
have much data in their spare-parts sales systems for replacements sold
(usually at very high prices as a profit stream for their businesses) and the
data needs to have a voice to speak to design engineers about the unreliability
of their designs. Businesses often have
a specific spare-parts sales strategy:
● Some businesses expect to make their
money from first sales and the equipment is usually durable with few
spare-parts sales, whereas
● Other businesses expect to make their money from
spare-parts sales as the equipment is less durable with many spare-parts sales
Of course some spare-parts sales in any business occurs because of consumption
of life following the natural law rules of entropy, for example, batteries
always run down because of consumption of their energy, and they never run up
to full charge. Similarly buildings and
bridges always fall down by consumption of their life, they never fall up. If you acquire any physical item, the first
price paid is not the last price as you must sustain the equipment over its
life span. Generally speaking, life
cycle cost studies show the sustaining cost is 2 to 20 times the acquisition
cost of most equipment and processes.
Some
general concepts for data
Niederman and Boyum in the 2004
paperback book What
The Numbers Say: A Field Guide to Mastering Our
Numerical World give the following chart on page 48 that they call the The Ten Habits of Highly Effective
Quantitative Thinkers:
|
Ten Habits of Highly Effective |
|
Attitude
Is Everything |
|
1. Only
Trust Numbers |
|
2. Never Trust Numbers
|
|
Navigational
Tools |
|
3.
Play Jeopardy (Phrase all answers as question—the numbers are useful
when considered as an answer.) |
|
4. Live by Pareto’s Laws |
|
Illuminating
Numbers |
|
5.
Play 20 Question |
|
6.
Build Models |
|
Uncertainty |
|
7.
Play the Odds |
|
8.
Know What You Know and Don’t Know |
|
Estimation |
|
9.
Go Figure |
|
10. Look for the |
|
The short meanings are shown in the
parentheses. Read the book for more insight and practical
examples. Keep these ten habits in
mind as we look at the use of numbers for reliability below. |
Don’t
Be Fooled By Randomness
Everyone wants their data to
be better behaved than Mother Nature can provide. Nassim Taleb has an interesting 2004 paperback book titled Fooled
By Randomness: The Hidden Role of Chance in Life and
in the Markets. His points are:
1) We want to live in a
deterministic world, but all around us are huge amounts of randomness and
uncertainty,
2) Our brains are fooled by the
uncertainty of randomness, and
3) We want to see signals of a
deterministic and certitude world where none exists, which leads to both false
positives and false negatives.
Of course, randomness of
chance always favors those who are prepared to seize the opportunity when a
signal comes through the fog. Taleb argues that most people underestimate the likelihood
of seeing a “black swan” which is an extreme, highly disruptive event that comes
out of nowhere to surprise.
Many in the West believed all
swans were white until European explorers discovered that black swans are
native to
Taleb defines a black swan event as a highly improbable
event with three key issues:
1)
It is
unpredictable
2)
It has a massive
impact, and
3)
After the fact we
concoct an explanation that makes it appear less random and more predictable
than it was.
Taleb refers to black swan events in the same way that
people thought that a “black swan” implied that black-colored swans did not
exist. In life, many people
underestimate the odds that such rare events will/can occur. As an example, consider the occasional
bursting of the Wall Street bubble where records are set for the Dow Jones
stock index one week followed by an enormous decline in the market. Another example of a black swan event is the
destruction of the twin towers in
Black swan events occur in
industry just as in the stock markets:
Smooth operation is suddenly ended with events such as vessels that
explode or pipes that rupture from hidden effects of corrosion because no one
went searching for them by means of an inspection. The purpose of inspections is to discover
what isn’t known. We get into trouble by
not learning what we didn’t know!
Life data for reliability has
much randomness! If you can clearly understand
the message signal coming through the randomness and if you are prepared to
deal with the probabilistic data, you can receive the signal that cuts through
the noise of life data. Knowledgeable
practitioners of reliability are prepared to separate sense from the apparent
nonsense. When you can see the signal and
you are prepared for the opportunity, then you can have success while others
only see chaos and lack of success.
Nassim has a table for comparing the distinctions between
number complexity and simplicity, which he calls his Table of Confusion.
|
Table of Confusion |
|
|
General |
|
|
Luck Randomness Probability Belief, conjecture Theory Anecdote, coincidence Forecast |
Skills Determinism Certainty Knowledge, certitude Reality Causality, law Prophecy |
|
Market
Performance |
|
|
Lucky idiot |
Skilled investor |
|
Survivorship bias |
Market outperformance |
|
Finance |
|
|
Volatility |
Return (or drift) |
|
Stochastic variable |
Deterministic variable |
|
Physics
and Engineering |
|
|
Noise |
Signal |
|
Literary
Criticism |
|
|
None* |
Symbol |
|
Philosophy
of Science |
|
|
Epistemic probability |
Physical probability |
|
Induction |
Deduction |
|
Synthetic proposition |
Analytic proposition |
|
General
Philosophy |
|
|
Contingent |
Certain |
|
Contingent |
Necessary (in the Kripke sense)** |
|
Contingent |
True in all possible worlds |
|
* Literary critics do not
seem to have a name for things they do not understand. |
|
The point of the table is simple: We wander
back and forth between the two columns in search of comments that justify what
we want to believe. We want the signal
but we’re overwhelmed by the noise that comes with the signal. This means we must have the ways and the
means to cut through the fog of the noise to understand what’s really going
on. The fog cutter is making the data
talk to avoid the sinkhole of thinking we know more than we actually do know by
the use of myths and mysticism!
Mean
Time Between Failures (MTBF)
The MTBF metric is the bottom
of the “food chain.” MTBF is the easiest
data to acquire.
MTBF =
(summation of life)/(summation of failures).
The summation of life is the total of all life hours acquired (both things that
have lived and things that have died) divided by all failures that have
occurred. Of course, this implies that
early in the life of a component you can infer a trivial situation of infinite
life as you accumulate life hours without any failures! Because the data is easy to acquire, be
careful of doing the wrong thing because it’s easy data to acquire. For example, if you have redundant equipment
on which you alternate the service, do not count the idle life as if the
equipment was operating because that will make your MTBF look very large, when
in fact, the idle equipment is not toting up life hours—in short, do not “Enronize” your metrics just to look good and just because
it’s easy!
Section 5 of MIL-HDBK-338
reminds that MTBF only has meaning for repairable items, and, for that case,
MTBF represents exactly the same parameter as mean life as it is based on the
assumption of a constant failure rate.
In Appendix A of The New Weibull
Handbook, Abernethy reminds that the MTBF becomes asymptotic to the MTTF
with many replacements. Consider this
ultra-simple example of MTBF for 5 items (components) each working 25
hours/month. (For simplicity, assume
repair time is very small):
Table 1: Example of MTBF for 5 Items Each Working 25
Hours/month With Recorded Failures |
|||||||||
|
Month---> |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
|
Item 1 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
1 |
0 |
|
Item 2 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
|
Item 3 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
|
Item 4 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
|
Item 5 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
|
Fail./mo. |
0 |
1 |
1 |
1 |
2 |
2 |
1 |
2 |
1 |
|
S Hrs life |
125 |
250 |
375 |
500 |
625 |
750 |
875 |
1000 |
1125 |
|
S Failures |
0 |
1 |
2 |
3 |
5 |
7 |
8 |
10 |
|