|
Reliability Tools: |
Reliability tools exist by the dozens: what are the tools, why use the tools, when
should I use the tools, and where should I use the tools? Click on the tools below for answers.
The details about these tools will be brief
as books are written about each item.
Think of the presentations below as hors d’oeuvres (a little snack food
or starters)—not the main course.
The most important reliability tool is a Pareto
distribution based on money—specifically based on the cost of unreliability which directs
attention to work on the most important money problem first. No magic bullet exists for reliability issues, don’t waste your time looking for a single
magic tool—none exist!
What: A test method
of increasing loads to quickly produce age-to-failure data with only a few data
points are then scaled to reflect normal loads.
Why: The benefit
of accelerated testing is to save time and money while quantifying the
relationships between stress and performance along with identifying design and
manufacturing deficiencies to get useful data quickly and at low cost.
When: Usually
performed during the development of devices, components, or systems. Also applies to items that have been in
service to obtain a metric needed to show how the item is performing under
heavy loads. Accelerate testing is a
useful method for solving old, nagging, problems within a production process.
Where: Used for
correlating test results with real life conditions.
What: A tool for measuring
the percent of time an item or system is in a state of readiness where it is
operable and can be committed to use when called upon. Availability ceases because of a downing
event that causes the item/system to become unavailable to initiate a mission
when called upon. In the simplest view
the metric is availability = uptime/(uptime +
downtime). For many other definitions
see MIL-HDBK-338,
section 5.
Why: The measure
is important for knowing the commitment of time for performing the mission and
it usually only involves the use of arithmetic.
When: Often the
measurement tool is based on past experiences and the complement of the
measurement tool addresses unavailability to perform the task.
Where: In design of
a system it is a calculated value and in operation of a system it is a
performance index that is often easy to use and provides an index that is
understandable to the average person.
Today there is a great tendency to “Enronize”
availability metrics by using uptime metrics that present data in the best
light (an issue of data integrity) to maximize managerial bonuses by excusing
(deducting) downtime from the calculations to put “lipstick on the pig”. Use the KISS
principle. Think of availability in
terms of the investor’s typical year of 8760 hours. The no-excuse annual metric in hours is
availability = uptime/8760. Suddenly
you’ll find a metric of great interest to investors that can be benchmarked as
a financial issue, and thus motivate the management team to solve real issues
of importance to the business. Please
note, you can have high availability but many failures
and thus low reliability as availability
≠ reliability. Likewise, you can have high availability, but
little output so team the metric with effectiveness
to get the complete story.
What: The concept
is derived from the human life experience involving infant mortality, chance
failures, plus a wearout period of life since data for births and deaths is
accumulated by government agencies. Most
equipment lacks the birth/death recording by government agencies and most
non-human systems can be regenerated to live/die many times before relegation
to the scrap heap.
Why: Failure rates
are different for both people and equipment at different phases of operation
and the medicine to be applied to both humans and equipment need to be
considered for effectively treating the roots of the problem.
When: The concept
is useful during design, operation, and maintenance of equipment and systems to
understand the failure mechanisms
Where: It explains
the human experiences to the ordinary person to relate equipment/system
failures to those experienced in real life so as to coordinate the design,
operation and maintenance of equipment. For
other definitions see MIL-HDBK-338,
section 9.
Block Diagram
Model (same as Reliability Block Diagram Models)-
What: Reliability
block diagram (RBD) models are graphical representations of a calculation
methodology for reliability systems.
Why: The RBD
models allow calculation of system reliability based on knowing/assuming
failure details of the components, starting with the least component and
growing the model to the greatest system to predict performance from the
elements.
When: RBDs are used in upfront
designs as a performance parameter and after the system is constructed to
ferret out poor performing blocks that limit the system performance.
Where: Frequently
used as a trade-off tool to search for the lowest long cost of ownership and to
help sell alternative courses of action for moderating the effects of
reliability issues or overcoming the poor performance by alternative designs
where the results can be calculated before building the system as the results
of the calculations provide knowledge about availability, maintenance
interventions required for failures, and the number of spare parts required to
sustain operations. For other definitions
see MIL-HDBK-338,
sections 4 and 6.
What: A measure of
how well the product performance meets objectives. In short, how well are the outputs actually
accomplished against a standard?
Capability is frequently the product of efficiency * utilization.
Why: Capability is
a component of the effectiveness equation and
usually under the control of production.
When: Data for this
metric is frequently produced by the accounting department each month as a
segment of the financial reports for the purpose of handling variances against
the standards.
Where: Frequently in
the effectiveness measure it is a weak point (as a measure of how well the
production process des the job for which it was purchased) requiring
substantial improvement that cannot be solved by the usual reliability and
maintainability (
What: Configuration
control is involved with the management of change by providing traceability of
failures back into the design standard.
If the design details are not specified, the design will not contain the
requirements and thus implementation of the project will be hit or miss for
achieving the desired end results, beginning with the conceptual design and
resulting in the operating facility.
Why: With active
configuration control you know where items are used and contained, where and
why they were installed, where signal originate, what items are used where and
in what environments, what drawing revisions have occurred and you know if the
product conforms to the drawings and specifications, what alternate
materials/components have been used, and what test reports/certifications are
available as original documents for review.
When: Configuration
control begins after the first design review to build an unbroken chain of
traceability to aid in avoiding surprises in the field which would destroy the
designed-in criteria for availability, reliability, maintainability, and cost
effectiveness established as a portion of the original design criteria.
Where: Frequently
these documentation details are assembled into a dossier with third party
witnessing for use in validating conformance to the design requirements and
provided to the owner of the equipment as witness documents.
What: Tell your vendors what you want, and want
what you say. Provide explanations of the
objectives in written contracts in terms the vendors will understand.
Why: If you can’t
clearly spell out the requirements for availability, reliability, and
maintainability the contractors cannot make these issues features of the
design. Thus, it is important to be
specific in the features the design must manifest. Explanations such as: “You know what I want
and what I need, just do it quickly” are
self-defeating expressions of vague generalities that lead to inferior designs
and constant arguments. Be specific
about requirements for building reliability
block diagrams, using quality function deployment,
performing failure mode and effects analysis, conducting fault tree analysis, and finally, conducting design reviews for reliability.
When: Write the
specifications before procurement begins.
Plan to spend time with your own purchasing department to explain the
details and sell the team on the financial advantages for including reliability
requirements into the specifications.
Likewise, spend time selling your vendors on the requirements and why
they are stated.
Where: These are up
front decisions to avoid replication of previous problems that were built into
previous designs and never corrected.
What: The cost of unreliability
is a big-picture view of system failure costs, described in annual terms, for a
manufacturing plant as if the key elements were reduced to a series block
diagram for simplicity. It looks at the
production system and reduces the complexity to a simple series system where
failure of a single item/equipment/system/processing-complex causes the loss of
productive output along with the total cost incurred for the failure. If the system IS sold out, then the cost of unreliability
must include all appropriate business costs such as lost gross margin plus
repair costs, scrap incurred, etc. If
the system is NOT sold out, and make-up time is available in the financial
year, then lost gross margin for the failure cannot be counted. The cost of unreliability is a management
concern connected to management’s two favorite metrics: time and money.
Why: In private
enterprise, failures must be concerned from a financial viewpoint and not a
“gear-head” approach of simply counting the number of failures; you must also
speak the language of the enterprise, which describes events by monetary
measures over a period of time. The
annual cost for failures is usually not stated in a clear-cut manner nor are
failure costs summarized by a system/sub-system to identify the weak links in a
monetary fashion so that appropriate action is taken to reduce the annual cost
of unreliability by building a clear Pareto
distribution to attack the vital (high cost) areas with an action plan to
reduce failures (unreliability) and to reduce the cost of unreliability.
When: For new a new
plant, this can be a design criteria to limit costs of
unreliability for competitive reasons in the marketplace. You must make the hidden costs of failures
obvious as a portion of the strategic plan.
For an existing plant, this can be an exercise in defining the cost of
unreliability and building a long-term plan to reduce the cost of failures as a
portion of the tactical plan.
Where: This activity
is best performed with high-level involvement of the management team to provide
fundamental understanding of the size of the icebergs about to rip out the
underbelly of the plant and to involve the organization in a plan to reduce the
costs so that profits are pushed upward because of the improvements. If the cost of unreliability cannot be
reduced, then the costs become extra weight for the saddlebags in the race for
survival.
What: The critical
items list is a top-level summary of problems/cost used for discussions with
management about key reliability issues.
The summary list converts technical details to a summary of costs and
time while placing the issues into a Pareto distribution explained in terms of
money and the vital few problems to be solved for competitive reasons.
Why: The purpose
of the critical items list is to focus management’s attention on items that
need to be resolved during the design phase as a corrective action loop for
influencing the lifetime costs.
When: The list
starts with the first design review as issues are disclosed in design reviews
for reliability.
Where: The critical
items list is presented to top-level management as issues to be accepted or
resolved before paper plans become steel and concrete.
What: Data is the informational
energy that runs the reliability improvement machine. Data is acquired at great cost. Data needs to be retained and used to prevent
future failure events. Proper use of
data provides an understanding of failure mechanisms and prevents reoccurrence
of bad events that cause safety or high-cost failures to occur. Reliability data requires definition of a
failure. Failures can be catastrophic
failures or slow degradation—you decide by defining the failures. The units of the measure for the data must be
in units of the degradation—sometimes it is hours, some times it is miles, and
so forth—in short, whatever motivates the failure. Reliability always ceases with a failure or a
removal from service in some aged condition that then generates a category of
data called a suspension or censored data.
Data is information in the form of facts, figures, or engineering
databases that is obtained from engineering tests, experiments, or actual
operating conditions. Reliability data
is often incomplete as the exact times to failure are rarely known or recorded
with much precision so that only partial information is available for
analysis. Reliability data comes in two
forms: 1) age-to-failure data, and 2) censored/suspended data such as occurs
when unfailed items are removed from service or when they fail due to a different
failure mode than we are studying—this is useful information and part of the
data set. Some data is better than no
data for resolving reliability issues.
Why: Data is the
information that, when used in an informed manner, helps prevent repetition of
bad history and allows an enlightened approach to rationally solving a
reliability issue using facts and figures.
Intelligent use of data for reliability issues provides the objective
evidence needed for helping to solve the root cause of failures.
When: Databases of
reliability information of past experience is very helpful for predicting
future failure events. The data is
helpful if failure rates, or the reciprocal of failures rates is described in
mean times to failure which reduces the information to an average failure rate
or average time to failure. The
reliability data is particularly valuable if retained for components as a
Weibull database with shape factor beta and scale factor eta.
Where: The data is
useful for understanding failure modes, for predicting future failures for a population
of equipment during the design stage, and for predicting future failures with
subsequent increases in the aging of equipment.
The role of the reliability engineer is to acquire the failure data and
convert the data into useful information for both current and future use.
What: Most business
decisions have considerable uncertainty, which implies at least two outcomes if
you choose a course of action. Making
decisions in the face of uncertainty requires the costs for taking action and
the probability along with the cost for not taking action and the probability
of the occurrence. In most cases the
probabilities are not well known (maybe to one significant digit) and the costs
are not well known (maybe to $10000).
The quantitative assessment is called risk assessment. The issue is to take these not-well
identified issues and devise a strategy that can minimize exposure to risk for
the business. Decision trees are graphical representation of a methodology to
reach the expected values for the decision so as to take or not-take action.
Why: Most business
decisions have no exact answers, i.e., no black and white answers but rather
shades of gray. The use of the tool is
to help decide which course of action may be to the advantage of the business
given the best estimates that can be made.
When: Decisive
details will only be known into the future and decisions have to be made today,
so use of decision trees are tools to help wisely span from today into the
future with the wisest decisions that can be made from sketchy data.
Where: If you have
absolute data, use it. Most decisions
must be made with indecisive information that requires decisions about the odds
for a given event, usually based on estimates—the wiser the estimate the better
the decision, taking into account the probabilities of the outcomes and the
money involved in the decision. Use this
tool when few details are available and you must be the pioneer to cut through
the forest to reach the promised land of opportunity and profitable ventures.
What: The
International Electrical Congress (IEC)
defines dependability as “Dependability describes the availability
performance and its influencing factors: reliability
performance, maintainability performance and maintenance support performance.” MIL-HDBK-338
defines dependability differently, as a measure of the degree to which an item is
operable and capable of performing its required function at any (random) time
during a specified mission profile, given that the item is available at mission
start. (Item state during a mission
includes the combined effects of the mission-related system R&M parameters
but excludes non-mission time; see availability.) Dependability is related to reliability with
the intention that dependability would be a more general concept than the
measurable issues of reliability, maintainability, and maintenance.
Why: The key
dependability issue is to make equipment and processes work as advertised,
which is, without failure. Dependability
aims at facilitating cooperation by all parties concerned (supplier,
organization, and customer by fostering an understanding of the dependability
needs and value to achieve the overall dependability objectives), so it
involves harmonizing conflicting issues.
Dependability has a better viewpoint from the end user of the equipment
or system than from the designer’s viewpoint or the maintainer’s
viewpoint. From a system-effectiveness
viewpoint, reliability and maintainability provide system availability and
dependability.
When: You cannot
repair yourself to happiness with a failure prone system as the failure-prone
system will be viewed as lacking dependability to function as required when you
need it. Thus, dependability is viewed
over the longer term and not in convenient snapshots, and dependability also
involves lifecycle cost issues.
Where: Reliability
contributes directly to uptime by avoiding failures whereas maintainability
contributes directly to reducing downtime by faster repairs. Thus, reliability and maintainability jointly
provide impact on dependability of the system.
Dependable systems must be ready to function, in an operable state, to
produce the desired output, upon demand by the end user, at the specified
quantity and quality of output.
Design Reviews For Reliability-
What: Specific questions
to ask of design engineers during a review specifically for reliability using
failure data from operations and maintenance are: 1) Show the calculated
availability for the system based on a RAM
model, 2) Show the calculated number of failures during the specified
mission time between turnarounds based on a reliability and maintainability (
Why: Design
reviews should demonstrate by calculation or through the use of models and
reliability tools that the system is capable of achieving the design objects rather
than making a giant leap of faith that all will be well and good. Problems
found in the design review for reliability are corrected less expensively on
paper than when corrections must be made in the field with hardware.
When: Design
reviews for reliability should be a part of the design process starting with
conceptual designs and ending when the drawings are revised for the as-built
system.
Where: This is a logical
extension of the design process to show, rather than tell, how the system will
function. This is performed as a portion
of the up-front design by the numbers process.
What: The potential
or actual probability of a system to perform a mission for a given level of
performance under specified operating conditions is defined as the product of reliability*availability*maintainability*capability
(dependability is often defined as reliability*maintainability) and all values
of the product are between 0 and 1. Many
variants of the effectiveness equation exist, e.g., OEE, and others. See a parallel comparison with system
effectiveness based productive output results of process
reliability calculations.
Why: The
effectiveness equation defines the ability of a product, operating under
specified conditions, to meet operational demands when called upon. This is a practical measure of how well the
system is performing—not how well we want it to perform, but it is a practical
measure of how the system is doing.
Since all the elements are measured between 0 to 1, the elements of the
equation quickly draw the eye to where opportunities exist for making
improvements.
When: The
effectiveness equation is useful for trade-off boxes for various alternatives
when plotted on an X-Y scale for effectiveness vs net
present value (NPV) for showing improvement alternatives. For the elements::
reliability defines the probability
of a failure-free interval (or the complement unreliability which describes the
probability of failure);
availability defines the probability
of the system being up and alive to handle the demand (or the complement,
unavailability which describes the probability of the system being down);
maintainability defines the
probability of making repairs within the allowed repair standard;
capability defines the probability
of production achieving the desired production results (a measure of how well
the product performs compared to the standard).
Frequently it is described as the product of efficiency * utilization
where
efficiency is an output/input
relationship such as (output achieved)/(the standard
required) and
utilization is how time is used such as
(direct labor)/(direct labor + labor lost)
[In the old days, if this index decreased
to as low as 80% we went berserk—today,
you can’t get this high because of wasted
time when noses are not to the grindstone!!!].
Where: It is used to
describe the performance of both new systems and old systems. Consider this example for effectiveness: If we are comparing a heavy-duty truck versus
a sports car for transportation, the truck may be more effective for heavy
loads whereas the sports car may be more effective for acceleration and high
speeds—neither are defined by the effectiveness equation until the mission is
defined. The effectiveness index is
converted into output quantities by use of the process
reliability technique for quantifying the productive plant and the
non-productive hidden plant based on a pragmatic definition of nameplate
capacity for the plant.
What: Electronic
components are everywhere, and they are getting smaller and more complex by the
year! They are becoming a larger part of
modern society every day. As a class,
they are particularly susceptible to increased failures from temperature,
vibration, and shock loading which destroys reliability.
Why: Most
electronic devices are small and delicate.
Inherent failure rates are often built into the device by the
manufacturing process (similar to building in human genetic defects), and you
cannot find the inherent defects until the components are stressed. The best remedy for electronic devices to
achieve high reliability is to start with a high quality, durable devices built
on a failure-free process, load the devices only to moderate loads, and to
carefully control the environment to suit the needs of the electronic
component.
When: Burn-in
tests, of different degrees of severity, following assembly of the system is
imposed to weed out the inherent defects by adding stresses due to temperature,
vibration, and shock loadings to cause the weak units to fail. Other accelerated tests for electronic
devices include ESS, HALT, and HASS.
Where: The usual
failure rate distribution for electronic systems is considered to be the exponential distribution, although some
electronic devices such as SCRs often display a decreasing failure rate described
as infant mortality failure modes by Weibull
analysis, and some electronic devices have an increasing failure rate
described as a wearout failure mode for devices such as electrolytic capacitors
and EPROMS. Many electronic failure rates and electronic
models are available in MIL-HDBK-217
and it’s successor PRISM.
Environmental Stress Screening (
What: A series of
screens are conducted under environmental stresses to disclose weak parts and
workmanship defects that require corrections, and this requires and
understanding of burn-in testing and
Why: The extremes
of operating conditions such as high power levels, high temperatures, high
vibration levels, etc. produce failures not anticipated from testing at nominal
conditions. Generally,
When: When
acquiring data, the tests are done upfront of production. When controlling early failures that would be
discovered by the end user, these test are done as a
portion of the production process to eliminate weak units to control warranty
costs and improve customer satisfactions
Where: Some tests
are conducted in the laboratory for quick results and then the data is used to
control product testing/release for the purpose of limiting costs and
preventing the loss of customers from unsatisfactory performance in the field.
What: Events/incidents
are single events or occurrences, especially one that is particularly
significant, that result in a failure from an
non-aging mechanism for reliability purposes.
Usually the event/incident results in a serious consequence of the loss
of functional life of a component or system.
The death of the device must be recorded as censored (suspended) data.
Why: For
reliability purposes, failure of the component, device, subassembly, or system
has been a success up to the point in life where a failure from a non-aging
event took place. This means the
event-age was a success (up to the point it was killed by an event/incident)
and inclusion of the data is required as censored/suspended data—this is
important data.
When: Include the
suspended/censored data into every analysis.
Young suspensions/censored data have little impact on the results of an
analysis but old suspensions have major effect on the analysis.
Where: The data is
used for MTBF/MTTF analysis and particularly for Weibull analysis.
What: The
probability of survival and of failure of components or equipment is under the
condition of chance failure ,which means a constant
instantaneous failure rate where the die-off rate is the same for any surviving
(unfailed) population. An old part is as
good as a new part. For any survivors in
this memory-less system that have survived to time t, a certain percent of the
survivors will die in a specified interval of time such as 2*t. The reliability of the system is often
described by the exponential distribution because many times a system is made
up of mixed failure modes that in the aggregate will function like a constant
failure rate system. The reliability of
exponential distributions are described mathematically as R(t)
= e^(-lt) = e^(-t/Q) where t is the mission time, l is
the failure rate, and Q is the mean time, given that l=1/Q. The exponential distribution is frequently
used as a first approximation to describe reliability based on a simple failure
rate or a simple mean time to failure—particularly if the system or component
has multiple failure modes.
Why: The constant
hazard rate, l, is usually a result of combining many failure rates into a
single number.
When: The
exponential distribution is frequently used for reliability calculations as a
first cut based on it’s simplicity to generate the first estimate of
reliability when more details about failure modes are not described.
Where: In electronic
systems (which can have many different types of failure modes, especially since
any electrical/electronic system is an amalgam of many different components)
the simple assumption is that the electrical/electronic package will have a
constant failure rate system defined by the exponential distribution. When in doubt about the failure mechanisms,
it is common to assume use of the exponential distribution with its constant
failure rate for simplicity.
What: Failure is
the loss of function when you needed the function to occur. Failures for reliability purposes must be
precisely defined so they are recorded correctly. Much life data is incomplete because failures
are mixed up with censored/suspended data where aged items may not have failed
or they represent removals from service before failure, or they have not yet
failed for the mode of failure under study—in short, these censored/suspended
items represent successes and are a portion of data set for study.
Why: We study
failed items for the same reason we do autopsies on humans—we want the data and
we want it categorized correctly for making important decisions. Failures require: 1) a time origin that must
be unambiguously defined, 2) a scale for measuring the passage of
time/starts/stops/etc. which motivates failure, and 3) the meaning of failure
must be entirely clear for recording the event.
When: Failure data
must be recorded as it occurs to prevent loss of information.
Failure causes involve design issues, manufacturing issues, assembly
issues, installation issues, or use issues that consume life and motivate
failures by misuse, inherent weakness, or consumption of life by means of a
wearout failure issue.
Failure modes describe the effects under which a failure is observed
including early failures where failure rates decline with usage (infant
mortality), where failure rates are constant with usage (chance failures
describe the usual mid life constant failure rate mortality), and increasing
failure rates with usage (wearout failure rates).
Failure mechanisms describe the physical, chemical, metallurgical, or
other processes which motivate the failures.
Failure criteria are the basis for registering the gravity of a failure
and sometimes temporary changes in the failure state, including duration of the
failure, have an important bearing on how a failure is recorded with the two
largest classifications of failure as complete failure (can’t complete
the intended function) or partial failure (not a complete failure but
deficient in providing all features of the intended function to a level that is
noticeable and undesirable).
Failure onset can be gradual (monitoring is intended to anticipate
detection of pending failure), intermittent (failure occurs in some magnitude
but recovers to complete the intended function), and sudden failure (surprise
events that cannot be anticipated with prior examination or monitoring).
Failure consequences can also be categorized such as critical failures
(significant damage occurs and/or injury to people occurs), major failures
(less severe than a critical failure but of such a magnitude as to
substantially reduce the required function), minor failures (reduces the performance
of the asset but oncly caused minor consequences for
the entire system), and benign failures (failures known and observed by an
expert but not detected by a novice).
Where: The CMMS system
is frequently where most data resides but usually in crude fashion. The failure data is often transferred into
the FRACAS system for converting the symptoms of
the failure into the root causes of failure.
The failure data must be converted into action items for making
management decisions about future failures and the corrective action needed.
What: Failure
forecasting is a projection of failures into the future based on assumed or
documented failure details. It is also
known as risk analysis of future failures.
For a constant failure mode system this is very straightforward. However, for complicated failure modes where
the failure rate increases with time (wearout failure modes) or where failure
rates decrease with time (infant-mortality failure modes), this becomes a more
complicated analysis as described by the Abernethy Risk which is described in The New Weibull Handbook and
implemented in the software package WinSMITH
Weibull for predicting future failures.
Likewise, reliability block diagrams are useful for predicting future
failures when the authentic failure details are supplied to the
Please note manufacturers follow two general strategies for their equipment:
1)
build the equipment to avoid failures even though this increases the original
capital costs, or
2)
build equipment and sell the original equipment at a low cost (or even a
break-even cost),
expecting to make profits with the sale of
replacement parts.
Thus for end users of the procured equipment, it is important to know the
forecasted failures in the face of supplier protests that “our equipment never
fails”—in that case, ask to see the sale of spare parts for similar equipment
and an estimate of the number of units working to get a crude estimate of the
strategy employed by the equipment supplier.
A failure is an event that renders equipment as non-useful for the intended or
specified purpose during a designated time interval. The failure can be sudden, partial, or
one-shot, intermittent, gradual, complete, or catastrophic. The degree of failure can be degradation or
gradual, sudden, or one-shot, from weakness, from imperfection