|
Reliability Tools: |
Reliability tools exist by the dozens: what are the tools, why use the tools, when
should I use the tools, and where should I use the tools? Click on the tools below for answers.
The details about these tools will be brief
as books are written about each item.
Think of the presentations below as hors d’oeuvres (a little snack food
or starters)—not the main course.
The most important reliability tool is a Pareto
distribution based on money—specifically based on the cost of unreliability which directs
attention to work on the most important money problem first. No magic bullet exists for reliability issues, don’t waste your time looking for a single
magic tool—none exist!
Accelerated
Life Testing (ALT)-
What: A time
based test method of increasing loads to quickly produce age-to-failure data
with only a few data points are then scaled (acceleration factors) to reflect
normal loads. The loads can be constant
or step-stress conditions.
Why: The
benefit of accelerated testing is to save time and money while quantifying the
relationships between stress and performance along with identifying design and
manufacturing deficiencies to get useful data quickly and at low cost to
determine the products strength limits by applying stresses high enough to
stimulate failures.
When: Usually
performed during the development of devices, components, or systems. Also applies to items that have been in
service to obtain a metric needed to show how the item is performing under
heavy loads. Accelerate testing is a
useful method for solving old, nagging, problems within a production process.
Where: Used
for correlating test results with real life conditions via the acceleration
factors generally with application of multiple stresses/cycles using 5 or more
failures.
What: A
tool for measuring the percent of time an item or system is in a state of readiness
where it is operable and can be committed to use when called upon. Availability ceases because of a downing
event that causes the item/system to become unavailable to initiate a mission
when called upon. In the simplest view
the metric is availability = uptime/(uptime +
downtime). For many other definitions
see MIL-HDBK-338,
section 5.
Why: The
measure is important for knowing the commitment of time for performing the
mission and it usually only involves the use of arithmetic.
When: Often
the measurement tool is based on past experiences and the complement of the
measurement tool addresses unavailability to perform the task.
Where: In
design of a system it is a calculated value and in operation of a system it is
a performance index that is often easy to use and provides an index that is
understandable to the average person.
Today there is a great tendency to “Enronize”
availability metrics by using uptime metrics that present data in the best
light (an issue of data integrity) to maximize managerial bonuses by excusing
(deducting) downtime from the calculations to put “lipstick on the pig”. Use the KISS
principle. Think of availability in
terms of the investor’s typical year of 8760 hours. The no-excuse annual metric in hours is
availability = uptime/8760. Suddenly
you’ll find a metric of great interest to investors that can be benchmarked as
a financial issue, and thus motivate the management team to solve real issues
of importance to the business. Please
note, you can have high availability but many failures
and thus low reliability as availability
≠ reliability. Likewise, you can have high availability, but
little output so team the metric with effectiveness
to get the complete story.
What: The
concept is derived from the human life experience involving infant mortality,
chance failures, plus a wearout period of life since
data for births and deaths is accumulated by government agencies. Most equipment lacks the birth/death recording
by government agencies and most non-human systems can be regenerated to
live/die many times before relegation to the scrap heap.
Why: Failure
rates are different for both people and equipment at different phases of
operation and the medicine to be applied to both humans and equipment need to
be considered for effectively treating the roots of the problem.
When: The
concept is useful during design, operation, and maintenance of equipment and
systems to understand the failure mechanisms
Where: It
explains the human experiences to the ordinary person to relate
equipment/system failures to those experienced in real life so as to coordinate
the design, operation and maintenance of equipment. For other definitions see MIL-HDBK-338,
section 9.
Block
Diagram Model (same as Reliability Block Diagram Models)-
What: Reliability
block diagram (RBD) models are graphical representations of a calculation methodology
for reliability systems.
Why: The
RBD models allow calculation of system reliability based on knowing/assuming
failure details of the components, starting with the least component and
growing the model to the greatest system to predict performance from the
elements.
When: RBDs
are used in upfront designs as a performance parameter and after the system is
constructed to ferret out poor performing blocks that limit the system
performance.
Where: Frequently
used as a trade-off tool to search for the lowest long cost of ownership and to
help sell alternative courses of action for moderating the effects of
reliability issues or overcoming the poor performance by alternative designs
where the results can be calculated before building the system as the results
of the calculations provide knowledge about availability, maintenance
interventions required for failures, and the number of spare parts required to
sustain operations. For other
definitions see MIL-HDBK-338,
sections 4 and 6.
What: A
measure of how well the product performance meets objectives. In short, how well are the outputs actually
accomplished against a standard? Capability
is frequently the product of efficiency * utilization.
Why: Capability
is a component of the effectiveness equation and
usually under the control of production.
When: Data
for this metric is frequently produced by the accounting department each month
as a segment of the financial reports for the purpose of handling variances
against the standards.
Where: Frequently
in the effectiveness measure it is a weak point (as a measure of how well the
production process des the job for which it was
purchased) requiring substantial improvement that cannot be solved by the usual
reliability and maintainability (
What: Configuration
control is involved with the management of change by providing traceability of
failures back into the design standard.
If the design details are not specified, the design will not contain the
requirements and thus implementation of the project will be hit or miss for
achieving the desired end results, beginning with the conceptual design and
resulting in the operating facility.
Why: With
active configuration control you know where items are used and contained, where
and why they were installed, where signal originate, what items are used where
and in what environments, what drawing revisions have occurred and you know if
the product conforms to the drawings and specifications, what alternate
materials/components have been used, and what test reports/certifications are
available as original documents for review.
When: Configuration
control begins after the first design review to build an unbroken chain of
traceability to aid in avoiding surprises in the field which would destroy the
designed-in criteria for availability, reliability, maintainability, and cost
effectiveness established as a portion of the original design criteria.
Where: Frequently
these documentation details are assembled into a dossier with third party
witnessing for use in validating conformance to the design requirements and
provided to the owner of the equipment as witness documents.
What: Tell your vendors what you want, and want
what you say. Provide explanations of
the objectives in written contracts in terms the vendors will understand.
Why: If you
can’t clearly spell out the requirements for availability, reliability, and
maintainability the contractors cannot make these issues features of the
design. Thus, it is important to be
specific in the features the design must manifest. Explanations such as: “You know what I want
and what I need, just do it quickly” are self-defeating expressions of vague
generalities that lead to inferior designs and constant arguments. Be specific about requirements for building reliability block diagrams, using quality function deployment, performing failure
mode and effects analysis, conducting fault tree analysis,
and finally, conducting design
reviews for reliability.
When: Write
the specifications before procurement begins.
Plan to spend time with your own purchasing department to explain the
details and sell the team on the financial advantages for including reliability
requirements into the specifications.
Likewise, spend time selling your vendors on the requirements and why
they are stated.
Where: These
are up front decisions to avoid replication of previous problems that were built
into previous designs and never corrected.
What: The
cost of unreliability is a big-picture view of system failure costs,
described in annual terms, for a manufacturing plant as if the key elements
were reduced to a series block diagram for simplicity. It looks at the production system and reduces
the complexity to a simple series system where failure of a single
item/equipment/system/processing-complex causes the loss of productive output along
with the total cost incurred for the failure.
If the system IS sold out, then the cost of unreliability must
include all appropriate business costs such as lost gross margin plus repair
costs, scrap incurred, etc. If the
system is NOT sold out, and make-up time is available in the financial year,
then lost gross margin for the failure cannot be counted. The cost of unreliability is a
management concern connected to management’s two favorite metrics: time and
money.
Why: In
private enterprise, failures must be concerned from a financial viewpoint and
not a “gear-head” approach of simply counting the number of failures; you must
also speak the language of the enterprise, which describes events by monetary
measures over a period of time. The
annual cost for failures is usually not stated in a clear-cut manner nor are
failure costs summarized by a system/sub-system to identify the weak links in a
monetary fashion so that appropriate action is taken to reduce the annual cost
of unreliability by building a clear Pareto
distribution to attack the vital (high cost) areas with an action plan to
reduce failures (unreliability) and to reduce the cost of unreliability.
When: For
new a new plant, this can be a design criteria to limit costs of unreliability
for competitive reasons in the marketplace.
You must make the hidden costs of failures obvious as a portion of the
strategic plan. For an existing plant,
this can be an exercise in defining the cost of unreliability and building a
long-term plan to reduce the cost of failures as a portion of the tactical
plan.
Where: This
activity is best performed with high-level involvement of the management team
to provide fundamental understanding of the size of the icebergs about to rip
out the underbelly of the plant and to involve the organization in a plan to
reduce the costs so that profits are pushed upward because of the
improvements. If the cost of unreliability
cannot be reduced, then the costs become extra weight for the saddlebags in the
race for survival.
What: The
critical items list is a top-level summary of problems/cost used for
discussions with management about key reliability issues. The summary list converts technical details
to a summary of costs and time while placing the issues into a Pareto
distribution explained in terms of money and the vital few problems to be
solved for competitive reasons.
Why: The
purpose of the critical items list is to focus management’s attention on items
that need to be resolved during the design phase as a corrective action loop
for influencing the lifetime costs.
When: The
list starts with the first design review as issues are disclosed in design
reviews for reliability.
Where: The
critical items list is presented to top-level management as issues to be
accepted or resolved before paper plans become steel and concrete.
What: Data is the informational
energy that runs the reliability improvement machine. Data is acquired at great cost. Data needs to be retained and used to prevent
future failure events. Proper use of
data provides an understanding of failure mechanisms and prevents reoccurrence
of bad events that cause safety or high-cost failures to occur. Reliability data requires definition of a
failure. Failures can be catastrophic
failures or slow degradation—you decide by defining the failures. The units of the measure for the data must be
in units of the degradation—sometimes it is hours, some
times it is miles, and so forth—in short, whatever motivates the
failure. Reliability always ceases with
a failure or a removal from service in some aged condition that then generates
a category of data called a suspension or censored data. Data is information in the form of facts,
figures, or engineering databases that is obtained from engineering tests,
experiments, or actual operating conditions.
Reliability data is often incomplete as the exact times to failure are
rarely known or recorded with much precision so that only partial information
is available for analysis. Reliability
data comes in two forms: 1) age-to-failure data, and 2) censored/suspended data
such as occurs when unfailed items are removed from
service or when they fail due to a different failure mode than we are
studying—this is useful information and part of the data set. Some data is better than no data for
resolving reliability issues.
Why: Data
is the information that, when used in an informed manner, helps prevent
repetition of bad history and allows an enlightened approach to rationally
solving a reliability issue using facts and figures. Intelligent use of data for reliability
issues provides the objective evidence needed for helping to solve the root
cause of failures.
When: Databases
of reliability information of past experience is very helpful for predicting
future failure events. The data is
helpful if failure rates, or the reciprocal of failures rates is described in
mean times to failure which reduces the information to an average failure rate
or average time to failure. The
reliability data is particularly valuable if retained for components as a
Weibull database with shape factor beta and scale factor eta.
Where: The
data is useful for understanding failure modes, for predicting future failures
for a population of equipment during the design stage, and for predicting
future failures with subsequent increases in the aging of equipment. The role of the reliability engineer is to
acquire the failure data and convert the data into useful information for both
current and future use.
What: Most business
decisions have considerable uncertainty, which implies at least two outcomes if
you choose a course of action. Making
decisions in the face of uncertainty requires the costs for taking action and
the probability along with the cost for not taking action and the probability
of the occurrence. In most cases the
probabilities are not well known (maybe to one significant digit) and the costs
are not well known (maybe to $10000).
The quantitative assessment is called risk assessment. The issue is to take these not-well
identified issues and devise a strategy that can minimize exposure to risk for
the business. Decision trees are graphical representation of a methodology to
reach the expected values for the decision so as to take or not-take action.
Why: Most
business decisions have no exact answers, i.e., no black and white answers but
rather shades of gray. The use of the
tool is to help decide which course of action may be to the advantage of the
business given the best estimates that can be made.
When: Decisive
details will only be known into the future and decisions have to be made today,
so use of decision trees are tools to help wisely span from today into the
future with the wisest decisions that can be made from sketchy data.
Where: If
you have absolute data, use it. Most
decisions must be made with indecisive information that requires decisions
about the odds for a given event, usually based on estimates—the wiser the
estimate the better the decision, taking into account the probabilities of the
outcomes and the money involved in the decision. Use this tool when few details are available
and you must be the pioneer to cut through the forest to reach the promised
land of opportunity and profitable ventures.
What: The
International Electrical Congress (IEC)
defines dependability as “Dependability describes the availability
performance and its influencing factors: reliability
performance, maintainability performance and maintenance support performance.” MIL-HDBK-338
defines dependability differently, as a measure of the degree to which an item
is operable and capable of performing its required function at any (random)
time during a specified mission profile, given that the item is available at mission
start. (Item state during a mission
includes the combined effects of the mission-related system R&M parameters
but excludes non-mission time; see availability.) Dependability is related to reliability with
the intention that dependability would be a more general concept than the
measurable issues of reliability, maintainability, and maintenance.
Why: The
key dependability issue is to make equipment and processes work as advertised,
which is, without failure. Dependability
aims at facilitating cooperation by all parties concerned (supplier,
organization, and customer by fostering an understanding of the dependability
needs and value to achieve the overall dependability objectives), so it
involves harmonizing conflicting issues.
Dependability has a better viewpoint from the end user of the equipment
or system than from the designer’s viewpoint or the maintainer’s
viewpoint. From a system-effectiveness
viewpoint, reliability and maintainability provide system availability and
dependability.
When: You
cannot repair yourself to happiness with a failure prone system as the
failure-prone system will be viewed as lacking dependability to function as
required when you need it. Thus,
dependability is viewed over the longer term and not in convenient snapshots,
and dependability also involves lifecycle cost issues.
Where: Reliability
contributes directly to uptime by avoiding failures whereas maintainability
contributes directly to reducing downtime by faster repairs. Thus, reliability and maintainability jointly
provide impact on dependability of the system.
Dependable systems must be ready to function, in an operable state, to
produce the desired output, upon demand by the end user, at the specified
quantity and quality of output.
Design Reviews For Reliability-
What: Specific
questions to ask of design engineers during a review specifically for
reliability using failure data from operations and maintenance are: 1) Show the
calculated availability for the system based on a RAM model, 2) Show the calculated number
of failures during the specified mission time between turnarounds based on a
reliability and maintainability (
Why: Design
reviews should demonstrate by calculation or through the use of models and
reliability tools that the system is capable of achieving the design objects
rather than making a giant leap of faith that all will be well and good.
Problems found in the design review for reliability are corrected less
expensively on paper than when corrections must be made in the field with
hardware.
When: Design
reviews for reliability should be a part of the design process starting with
conceptual designs and ending when the drawings are revised for the as-built
system.
Where: This is a logical
extension of the design process to show, rather than tell, how the system will
function. This is performed as a portion
of the up-front design by the numbers process.
What: The
potential or actual probability of a system to perform a mission for a given
level of performance under specified operating conditions is defined as the
product of reliability*availability*maintainability*capability
(dependability is often defined as reliability*maintainability) and all values
of the product are between 0 and 1. Many
variants of the effectiveness equation exist, e.g., OEE, and others. See a parallel comparison with system
effectiveness based productive output results of process
reliability calculations.
Why: The
effectiveness equation defines the ability of a product, operating under
specified conditions, to meet operational demands when called upon. This is a practical measure of how well the
system is performing—not how well we want it to perform, but it is a practical
measure of how the system is doing.
Since all the elements are measured between 0 to 1, the elements of the
equation quickly draw the eye to where opportunities exist for making
improvements.
When: The
effectiveness equation is useful for trade-off boxes for various alternatives
when plotted on an X-Y scale for effectiveness vs net
present value (NPV) for showing improvement alternatives. For the elements::
reliability defines the probability
of a failure-free interval (or the complement unreliability which describes the
probability of failure);
availability defines the probability
of the system being up and alive to handle the demand (or the complement,
unavailability which describes the probability of the system being down);
maintainability defines the
probability of making repairs within the allowed repair standard;
capability defines the probability
of production achieving the desired production results (a measure of how well
the product performs compared to the standard).
Frequently it is described as the product of efficiency * utilization
where
efficiency is an output/input
relationship such as (output achieved)/(the standard required) and
utilization is how time is used such as
(direct labor)/(direct labor + labor lost)
[In the old days, if this index decreased
to as low as 80% we went berserk—today,
you can’t get this high because of wasted
time when noses are not to the grindstone!!!].
Where: It is
used to describe the performance of both new systems and old systems. Consider this example for effectiveness: If we are comparing a heavy-duty truck versus
a sports car for transportation, the truck may be more effective for heavy
loads whereas the sports car may be more effective for acceleration and high
speeds—neither are defined by the effectiveness equation until the mission is
defined. The effectiveness index is
converted into output quantities by use of the process
reliability technique for quantifying the productive plant and the
non-productive hidden plant based on a pragmatic definition of nameplate
capacity for the plant.
What: Electronic
components are everywhere, and they are getting smaller and more complex by the
year! They are becoming a larger part of
modern society every day. As a class,
they are particularly susceptible to increased failures from temperature,
vibration, and shock loading which destroys reliability.
Why: Most
electronic devices are small and delicate.
Inherent failure rates are often built into the device by the
manufacturing process (similar to building in human genetic defects), and you
cannot find the inherent defects until the components are stressed. The best remedy for electronic devices to
achieve high reliability is to start with a high quality, durable devices built
on a failure-free process, load the devices only to moderate loads, and to
carefully control the environment to suit the needs of the electronic
component.
When: Burn-in
tests, of different degrees of severity, following assembly of the system is
imposed to weed out the inherent defects by adding stresses due to temperature,
vibration, and shock loadings to cause the weak units to fail. Other accelerated tests for electronic
devices include ESS, HALT, and HASS.
Where: The
usual failure rate distribution for electronic systems is considered to be the exponential distribution, although some
electronic devices such as SCRs often
display a decreasing failure rate described as infant mortality failure modes
by Weibull analysis, and some electronic
devices have an increasing failure rate described as a wearout
failure mode for devices such as electrolytic capacitors and EPROMS. Many electronic failure rates and electronic
models are available in MIL-HDBK-217
and it’s successor PRISM.
Environmental Stress Screening (
What: A
series of screens are conducted under environmental stresses to disclose weak parts
and workmanship defects that require corrections, and this requires and
understanding of burn-in testing and
Why: The
extremes of operating conditions such as high power levels, high temperatures,
high vibration levels, etc. produce failures not anticipated from testing at
nominal conditions. Generally,
When: When
acquiring data, the tests are done upfront of production. When controlling early failures that would be
discovered by the end user, these test are done as a portion of the production
process to eliminate weak units to control warranty costs and improve customer
satisfactions
Where: Some
tests are conducted in the laboratory for quick results and then the data is
used to control product testing/release for the purpose of limiting costs and
preventing the loss of customers from unsatisfactory performance in the field.
What: Events/incidents
are single events or occurrences, especially one that is particularly
significant, that result in a failure from an non-aging mechanism for
reliability purposes. Usually the
event/incident results in a serious consequence of the loss of functional life
of a component or system. The death of
the device must be recorded as censored (suspended) data.
Why: For
reliability purposes, failure of the component, device, subassembly, or system
has been a success up to the point in life where a failure from a non-aging
event took place. This means the
event-age was a success (up to the point it was killed by an event/incident)
and inclusion of the data is required as censored/suspended data—this is important
data.
When: Include
the suspended/censored data into every analysis. Young suspensions/censored data have little
impact on the results of an analysis but old suspensions have major effect on
the analysis.
Where: The
data is used for MTBF/MTTF analysis and particularly for Weibull analysis.
What: The
probability of survival and of failure of components or equipment is under the
condition of chance failure ,which means a constant instantaneous failure rate
where the die-off rate is the same for any surviving (unfailed)
population. An old part is as good as a
new part. For any survivors in this
memory-less system that have survived to time t, a certain percent of the
survivors will die in a specified interval of time such as 2*t. The reliability of the system is often
described by the exponential distribution because many times a system is made
up of mixed failure modes that in the aggregate will function like a constant
failure rate system. The reliability of
exponential distributions are described mathematically as R(t) = e^(-lt)
= e^(-t/Q) where t is the mission time, l
is the failure rate, and Q is the mean time, given that l=1/Q. The exponential distribution is frequently
used as a first approximation to describe reliability based on a simple failure
rate or a simple mean time to failure—particularly if the system or component
has multiple failure modes.
Why: The
constant hazard rate, l, is usually a result of combining many failure rates into a
single number.
When: The
exponential distribution is frequently used for reliability calculations as a
first cut based on it’s simplicity to generate the
first estimate of reliability when more details about failure modes are not
described.
Where: In
electronic systems (which can have many different types of failure modes,
especially since any electrical/electronic system is an amalgam of many
different components) the simple assumption is that the electrical/electronic
package will have a constant failure rate system defined by the exponential
distribution. When in doubt about the
failure mechanisms, it is common to assume use of the exponential distribution
with its constant failure rate for simplicity.
What: Failure
is the loss of function when you needed the function to occur. Failures for reliability purposes must be
precisely defined so they are recorded correctly. Much life data is incomplete because failures
are mixed up with censored/suspended data where aged items may not have failed
or they represent removals from service before failure, or they have not yet
failed for the mode of failure under study—in short, these censored/suspended
items represent successes and are a portion of data set for study.
Why: We study
failed items for the same reason we do autopsies on humans—we want the data and
we want it categorized correctly for making important decisions. Failures require: 1) a time origin that must
be unambiguously defined, 2) a scale for measuring the passage of
time/starts/stops/etc. which motivates failure, and 3) the meaning of failure
must be entirely clear for recording the event.
When: Failure
data must be recorded as it occurs to prevent loss of information.
Failure causes involve design issues, manufacturing issues, assembly
issues, installation issues, or use issues that consume life and motivate
failures by misuse, inherent weakness, or consumption of life by means of a wearout failure issue.
Failure modes describe the effects under which a failure is observed
including early failures where failure rates decline with usage (infant
mortality), where failure rates are constant with usage (chance failures
describe the usual mid life constant failure rate
mortality), and increasing failure rates with usage (wearout
failure rates).
Failure mechanisms describe the physical, chemical, metallurgical, or
other processes which motivate the failures.
Failure criteria are the basis for registering the gravity of a failure
and sometimes temporary changes in the failure state, including duration of the
failure, have an important bearing on how a failure is recorded with the two
largest classifications of failure as complete failure (can’t complete
the intended function) or partial failure (not a complete failure but
deficient in providing all features of the intended function to a level that is
noticeable and undesirable).
Failure onset can be gradual (monitoring is intended to anticipate
detection of pending failure), intermittent (failure occurs in some magnitude
but recovers to complete the intended function), and sudden failure (surprise
events that cannot be anticipated with prior examination or monitoring).
Failure consequences can also be categorized such as critical failures
(significant damage occurs and/or injury to people occurs), major failures
(less severe than a critical failure but of such a magnitude as to
substantially reduce the required function), minor failures (reduces the
performance of the asset but oncly caused minor
consequences for the entire system), and benign failures (failures known and
observed by an expert but not detected by a novice).
Where: The
CMMS system is frequently where most data resides but usually in crude
fashion. The failure data is often transferred
into the FRACAS system for converting the
symptoms of the failure into the root causes of failure. The failure data must be converted into
action items for making management decisions about future failures and the
corrective action needed.
What: Failure
forecasting is a projection of failures into the future based on assumed or
documented failure details. It is also
known as risk analysis of future failures.
For a constant failure mode system this is very straightforward. However, for complicated failure modes where
the failure rate increases with time (wearout failure
modes) or where failure rates decrease with time (infant-mortality failure
modes), this becomes a more complicated analysis as described by the Abernethy
Risk which is described in The
New Weibull Handbook and implemented in the software package SuperSMITH Weibull for predicting
future failures. Likewise, reliability
block diagrams are useful for predicting future failures when the authentic
failure details are supplied to the
Please note manufacturers follow two general strategies for their equipment:
1)
build the equipment to avoid failures even though this increases the original
capital costs, or
2)
build equipment and sell the original equipment at a low cost (or even a
break-even cost),
expecting to make profits with the sale of
replacement parts.
Thus for end users of the procured equipment, it is important to know the
forecasted failures in the face of supplier protests that “our equipment never
fails”—in that case, ask to see the sale of spare parts for similar equipment
and an estimate of the number of units working to get a crude estimate of the
strategy employed by the equipment supplier.
A failure is an event that renders equipment as non-useful for the intended or
specified purpose during a designated time interval. The failure can be sudden, partial, or
one-shot, intermittent, gradual, complete, or catastrophic. The degree of failure can be degradation or
gradual, sudden, or one-shot, from weakness, from imperfections, from misuse,
and so forth.
A failure mechanism includes a variety of physical processes that results in
failure from chemical, electrical, thermal, or other insults.
Why: Future
failures cost money and frequently increase the risk for safety or
environmental problems. For manufacturers,
the forecasted failures predict impending high costs for warranty expenses
which can make/break a company. With
good failure forecasts, you can anticipate expected failures now (after
x-usage), future failures when failed units are not replaced, and future
failures when failed units are replaced either with the same failure modes or
with differently designed components with different failure details.
When: This
analysis is wisely performed during the design of the equipment, however many
surprises arise from different failure modes built into the assembled product
or incurred by unanticipated usage in operations.
Where: Generally
this analysis is made during the up-front design effort—with much disbelief the
products could be “this bad”. Follow-up
analysis occurs when unexpected failure modes arise during operation of the
equipment, which causes loss of service of the equipment and high costs for the
end users.
What: Failure
rates, in the simplest form, are S(time in use)/S(number
of failures) or the reciprocal of mean times to/between failure. For more sophisticated failure data bases
such as Weibull databases the failure rates can be disclosed without giving
away proprietary data such as the shape factor, beta, which tell the failure
mode for the equipment.
Why: Simple
failure rates are a precursor of maintenance events and production
interruptions that will occur into the future, which drive up costs and cause
chaos.
When: Failure
rates derive from the history of operation or from well-known data sources such
as OREADA, IEEE 500, IEEE 493, EPRI, and other sources listed in reading lists for reliability
including Weibull databases.
Where: The
failure rates are used as an awareness criteria for the average person just as
you used automobile fuel consumption rates for understanding the health of your
automobile as well as anticipating your weekly/monthly/annual out-of-pocket
expenditures for gasoline or diesel fuel.
The failure rates drive the maintenance interventions, spare parts, and
maintenance cost for the maintenance department. Similarly they predict the interruptions to
the process and lead to misses on promised deliveries and result in negative
variances for production costs. In sort,
failure rates are precursors for the misery expected for the organization.
What: Fault
tree analysis (FTA) is a top-down process of defining the top-level problems
and, through a deductive approach using parallel and series combinations of
possible malfunctions, to find the root of the problem and correct it before the
failure occurs. The reliability tool can
be used as qualitative or quantitative methods.
Why: The
tool aids the design process, shows weak links that cause failures, and in the
critical legs of the trees, helps to define maintenance strategies for which
pieces of equipment and processes should be defended with the greatest
maintenance vigor to prevent “Murphy” from shutting down the process or causing
serious safety issues. The technique provides a graphical aid for the analysis
and it allows many failure modes including common-cause failures. Results from a FTA is usually more
pessimistic than other analysis tools such as RBDs as you can see from a study of the
Space Shuttle reliability analysis where each system is studied by multiple
reliability tools because of the high cost/profile of failures.
When: FTA
is widely used in the design phase of nuclear power plants, subsea control and
distribution systems, and for oversight studies in layers of protection studies
for process safety and loss control in chemical plants and refineries so as to
prevent accidents and control the costs of risks. The technique is helpful for identifying
critical fault paths, observing vague failure combinations before they occur in
reality, comparing alternate designs for safety, and setting a methodology to
provide management with a tool to evaluate the overall hazards in a system and
avoid single sources of critical failures.
Finally when thinking top-down about failures and where/how they can
occur, the methodology gives a diagram for setting maintenance strategies for
protecting key pieces of equipment/processes to prevent failures.
Where: FTA
is helpful for defining potential event sequences and potential incidents,
evaluating the incident consequences of outcomes, and estimating the risks of
events occurring. FTAs work in the
design room and on the operating floor where firsthand knowledge has been
gained for preventing failures.
What: Failure
mode and effect analysis (FMEA) is the study of potential failures that might
occur in any part of a system to determine the probable effect of each failure
on all other parts of the system and on probable operations success. When criticality analysis is added for
sophisticated studies the method is know as
FMEAC. In the automotive world where
FMEA is a required portion of the quality systems, it is frequently known as
PFMEA for potential failure mode and effect analysis. The basic thrust of the analysis tool is to
prevent failures using a simple and cost-effective analysis that draws on the
collective information of the team to find problems and resolve them before
they occur.
Why: The
analysis is known as a bottom-up (inductive) approach to finding each
potential mode of failure and preventing failures that might occur for every
component of a system. It also used for
determining the probable effects on system operation of each failure mode and,
in turn, on probable operational success, the results of which are ranked in
order of seriousness. FMEA can be
performed from different viewpoints such as safety, mission success,
availability, repair costs, failure modes, reliability reputation, production
processes, follow-on service, and so forth.
When: The
FMEA is most productive when performed during the design process to eliminate
potential failures. It can also be
performed on existing systems where operations personnel and maintainers are
made team members to add real-life experiences to educate the team in a
problem-solving forum that is constructive to eliminating existing problems.
Where: The
analysis can be conducted in the design room or on the shop floor and it is an excellent
tool for sharing experiences to make the team aware of details that are known
to one person but seldom shared with the team.
It is also an extremely productive tools for educating young engineers,
young maintainers, and young operators into details they should be aware can
kill the system.
What: Failure
reporting corrective action systems (FRACAS) is an organized database for
aiding in solving reliability problems using a common-sense approach by
systematically and permanently removing failure mechanism. Good historical data from this system can
populate a Weibull database.
Why: Use
data to solve problem by attacking root causes to reduce failures and make
reliability grow. Fixing failures
requires data—not opinions—so use the data acquisition system in a closed loop
to record, analyze, correct, and verify improvements have been achieved. First data reported is usually a symptom of a
failure and with a failure investigation, the symptom can be converted into a
root cause which requires the system to be editable to correctly report
failures.
When: The
maintenance repair order system usually generates evidence of a failure. Failures with significant costs (repair costs
+ collateral damage + lost margin from the failure + other appropriate business
costs) must be investigated and evaluated to reduce failures and to reduce
failure costs. Little is to be gained by
spending big money to investigate trivial failures.
Where: This
is an engineering tool requiring clerical effort to input the data and build
the Pareto distributions for identifying significant events requiring
corrective action and thus it also becomes a management tool for controlling
costs.
Highly Accelerated Life Test (HALT)-
What: HALT
is an offspring of older environmental stress screening (ESS) tests and is a testing process
for ruggedization of pre-production products by
heavily stressing the product to identify failure modes quickly and to verify
weak links in the system such as design, manufacturing, testing, environment,
and quality. HALT tests are stress based
and not time based tests to failure.
Acceleration factors are not the main consideration. HALT tests are step stress processes to
quickly induce failures.
Why: HALT
tests are intended to quickly find failures and accelerate the improvement
program so that when products are delivered to end users, they will be mature
products by elimination of potential failure modes that would normally generate
a reliability growth program. Usually
the HALT programs reduce time, cost, and delays experienced in new products by
recalls, warranty costs, etc. HALT is
similar to HASS but the stresses are more severe. In the HALT process, design and process flaws
are found, root causes identified, and corrective actions implemented quickly.
When: HALT
is used during the development program to get engineers to acknowledge and
correct fatal problems in designs by adding loads (generally temperature,
vibrations, pressures, physical stresses, etc.) by rapidly changing the load
conditions over and above normal operating loads.
Where: HALT
is frequently used for electronic systems but also applicable to mechanical
systems where thermal shocks are used to validate designs for extreme
conditions of loads. The tests are
performed in the laboratory for engineering evaluation.
Highly Accelerated Stress Screen
(HASS)-
What: HASS
uses the similar stresses as HALT, but at a lower stress
level. Compared to HALT testing,
temperature and voltage extremes may be reduced by 10%-15%, vibration levels
reduced 50%, etc., depending upon the design although all the stresses may be
above rated product specifications with the motivation to produce test results
quickly for verifying product compliance.
Why: HASS testing is used
to verify product performance is on
target and has not shifted toward inferior performance in the manufacturing
process. Note that higher stresses often
produce accelerated failures out of proportion to the increased stress applied.
When: Products
are periodically screened by HASS to verify no shifts have occurred in the
manufacturing process.
Where: HASS
tests are performed as a quality assurance test in manufacturing facilities to
learn what you don’t know about each product as it is faster than a simple burn-in
test. If 100% of the finished goods do
not receive HASS, as when only a percentage of the product is screened by HASS,
this is called a highly accelerated stress audit (HASA).
What: Life cycle
costs (LCC) are all costs associated with the acquisition and ownership of a
system over its full life. The usual
figure of merit is net present value (NPV).
Projects are considered most favorable for large positive NPVs. However for many cost individual cases,
decisions are made for the least negative NPVs.
In all cases, the default position for accounting is to know the NPV for
making no change and this is usually the last alternative for most people associated
with change.
Why: The
first cost for capital equipment (acquisition) is between ½ and 1/20 of the
total lifetime cost! The first cost,
acquisition cost, is usually definable by a firm quotation and sustaining costs
must be estimated and put into the appropriate time slots for discounting to
obtain the NPV for the project life.
Typical values used in industry for LCC are: discount rate = 12%, tax
rate = 38%, and project life is usually between 10 and 20 years.
When: Life
cycle cost is usually calculated as an up-front decision-making effort either
for projects or for cost-reduction efforts.
I does not work well for doing the analysis after the project is
underway.
Where: LCC
is the business of investing money to make changes occur. The NPV values add the voice of investments
to technical decisions to work for the lowest long-term cost of ownership.
What: A
measure of use duration applicable to an item.
For example, the life units may be starts-stops, run hours, hot-cold
cycles, distances traveled, emergency starts or starts, shelf life, and other
measurements that motivate failures.
Why: Life
is consumed by usage of life units. Some
life units occur as a sum of the different cases, for example on a gas-turbine
aircraft engine, take-offs consume more life than landings or enroute conditions which requires a synthetic value for how
life is consumed on a mission. For a
land-based, heavy-duty gas turbine used in the generation of electrical power
the number of starts is not equivalent to hours of operation as other wear
mechanisms are involved; however, 1 trip cycle = 8 normal shutdown cycles and
thus decreases the time between required maintenance actions.
When: Development
of a life-consuming profile may be more important than the literal measurement
of an elapsed time to adequately measure consumption of life that in the end
will result in a failure.
Where: Life
units have different measures and must be considered to obtain the proper
“common denominator” for calculations.
What: For
reliability successes, loads must always be less than strengths. When loads are greater than strengths,
failures occur. The issue is determining
the probability of load-strength interference, which is a joint probability of
when loads exceed strengths. The loads
should include expected conditions plus the foolishness of people to violate
rules and overload equipment, plus the vagaries of Mother Nature to impose
unexpected static and dynamic loads from hurricanes, tornadoes, earthquakes,
wildfires, and so forth.
Why: Neither
loads nor strengths are unmovable point estimates, although most designers use
point values. Failures occur and
reliability terminates when loads exceed strengths.
When: Loads
usually increase over time (e.g., airplanes like people, gain weight over time
from accumulation of dirt and extra equipment), strength usually decrease over
time (small fatigue cracks appear with many cycles and load-bearing strengths
decline).
Where: Bridges
have finite lives because of load-strength interactions, wings break off of
airplanes from fatigue, etc. A few
failures are dramatic but most failures sneak up from the unknown in a variety
of ways to cause loss of reliability. To
prevent loss of the system requires many physical inspections to learn what you
don’t know!
What: Lognormal
distributions are continuous life functions that have long tails to the right
(display positive skewness) in time or usage. A lognormal distribution plotted on semi-log
papers would appear as a normal curve.
Why: The
lognormal distribution is a common competitor to the Weibull
distribution for life. However it is adequate
for 85%-95% of all repair times.
When: Lognormal
distributions are motivated by multiplicative (or proportional) events that
grow with time, like crack growth, molecular diffusion, and some wearout problems.
Where: In
the days when plots had to be made by hand, it was the first widely used
transform to convert plotted data into straight lines. Today it is simply one of an arsenal of
probability tools used to obtain good curve fits to data with multiplicative
type events.
What: The
measure of the ability of an item to be retained in or restored to a specified
condition when maintenance is performed by personnel having specified skill
levels, using prescribed procedures and resources.
Why: Maintainability
measures the percent of maintenance jobs completed to a standard time for the
repair, with repair times for the task usually plotted on a lognormal
probability plot.
When: First
you set a standard repair time for the task, second you set a skills level,
third you measure how you’re doing against the standard.
Where: Applies
to major tasks where many repetitions are expected and where considerable time
is required.
What: All
actions necessary, both technical and administrative, for retaining an item in
or restoring it to a specified condition so it can perform a required
function. The actions include servicing,
repair, modification, overhaul, inspection, reclamation, and restored condition
determination.
Why: Equipment
deteriorates because of entropy changes, because of errors both overt and
convert, and because of the use of incorrect procedures.
When: Maintenance
is generally routine and recurring.
Where: The
effort includes fault location, diagnosis, repair, test, adjustment,
replacement, administration, and overhauls wherever equipment is located.
What: A
tactical job for rapidly repairing equipment to operable conditions by studying
operating and repair manuals. Acquires
failure data and prepares maintenance plans of restoring equipment to operable
condition in a minimum amount of time.
Prepares general diagrams, charts, drawings, and spare parts
requirements for maintenance planners.
Makes recommendations for improving the repair cycle. Provides manning level forecast for
supervisors and estimates the duration of outages. Determines the cost advantages of
alternatives for developing action plans to comply with internal/external
customer demands for timely repairs of processes/equipment. The purpose of these activities is to restore
equipment to service in a timely manner.
Why: Facilitates speedy
repairs by providing maintenance technology above the craftsman level and up
to, but not including, reliability engineering principles.
When: Provides
expertise for more complicated maintenance tasks or when organization and
oversight is required and time is of the essence for fast repairs.
Where: Provides
on-site expertise to aid craftsmen to solve non-standard repairs without
hands-on tool contact. Maintenance
engineers serve as liaisons with reliability engineers.
Management’s
Role For Reliability-
What: Management
must display leadership for setting a course for reliability
under their watch. Too little
reliability results in many breakdowns, high maintenance costs, missed
production schedules, and unhappy customers.
Too much reliability results in high equipment cost, complicated and
expensive redundancies, excessive procedures, and excessive operating costs
along with happy customers for product delivery but unhappy customers because
of high cost products. You’ve got to get
it right for your particular situation.
No 4th quartile producer has demonstrated high reliability
production systems. Many 1st
and 2nd quartile producers have demonstrated high reliability
production systems.
Why: Management
gets what management wants. Management
must say what they want and want what they say.
Management must be consistent.
Their talk must match their walk to achieve failure free processes
which take into account the cost of unreliability
throughout the entire system. Management
usually expresses their overriding desires and philosophy with policy statements as a method of widely
communicating intent to the workforce and making the direction a part of the
organization culture. Management cannot
espouse a reliability culture but only talk about fixing things faster or
grumbling only about maintenance costs—they must work to correct the root of
the failures and develop a culture of failure prevention.
When: Management
can adopt the reliability culture role at any time. The program has to be sold to the
organization—telling won’t implement an initiative for reliability. As a working example, follow the methodology used
for implementing strategies and policies for safety, quality, and environment
as role models.
Where: Management’s
role for reliability starts at the top as a strategy issue. It cannot begin at the bottom of the
organization.
What: A
density figure-of-merit metric often referred to as the average or expected
value. In the simplest form it appears
as arithmetic S(time) / S(events)
or in complicated situations as a statistic metric. It applies to mean life (ML), mean down time (MDT),
mean maintenance time (MMT), mean
time between failures (MTBF for
repairable items), mean time to failures (MTTF
for replacement items), mean time between maintenance (MTBM), mean time between maintenance scheduled (MTBMs), mean maintenance time
unscheduled (MMTu),
mean maintenance time scheduled (MMTs),
mean time between overhauls (MTBO),
mean time between unscheduled removals(MTBRu), mean time to restore (MTR), mean time between downing events (MTBDE), and so forth. The units
will be time/metric, e.g., hours/failure.
The reciprocal of the metric provides an incident rate, e.g.,
failures/hour.
Why: The
metric provides an awareness factor for deciding central tendency numbers and
for the expected number of events that will occur into the future based on
historical situations. The arithmetic
simplicity of mean time is a reason to establish the metric and listen to the
information derived from it to gain insight.
The arithmetic provides immediate answers to categorize facts for
starting continuous improvement rather than postponing a metric while searching
for delayed perfection!
When: The
metrics are used as criteria of performance and variations from the central
tendency numbers are expected however for the long term the variations are
expected to be controlled to prevent distortion of the measurement.
Where: The
metrics are used from the shop floor to the management levels as criteria for
“How are we doing?”.
Mechanical Components Interaction-
What: Mechanical
components suffer from interactions and degradations of overloads, strength
deterioration, wear, corrosion, process variations during the fabrication
process, effects of special processes where the procedures must be controlled
as discovery of the end results would result in destruction of the component,
and removal of safety factors by increasing loads.
Why: The
naïve expectation is that, individually, the impact of a single insult will not
destroy reliability of the component.
However, you frequently have multiple insults occurring, which results
in failures that are not predicted up front but can be perfectly explained
after the components have failed.
When: The
multiple destructive events are more predominate in complex devices and highly
stressed devices which too often have small safety factors that cannot cope
with the overload conditions and thus failures occur.
Where: The
foolishness of humans adds further insults to the interactions of many
different failure mechanisms which demands many more maintenance interventions
and frequent inspections. Of course the
solution to many of these cases where failures occur is to increase safety
factors by adding extra material (when possible), but this adds extra weight
and extra costs.
What:
Why: The
technique is used when: 1) many variables are present and their
interrelationships are unclear, 2) the system can’t be analyzed by direct and
formal methods; 3) building analytical models would be time consuming, complex,
and just too hard, 4) you cannot do direct experiments, 5) when the input
details such as equipment life and repair times are not discrete and they vary
over time according to a distribution, and 6) you need to do some tweaking of
the system to understand where opportunities lie for improving uptime,
reliability, and costs.
When: Build
models before you commit systems to bricks and mortar so you know their
performance on paper. Revise the models
after they are in operation to help improve the unknown weaknesses and improve
costs for future cases.
Where:
What: A
fundamental frequency distribution that produces a symmetrical bell-shaped
diagram based on the Gaussian distribution to form a normal law of errors.
Why: The
distribution is easily described with two statistics, the mean (X-bar, which is
a location parameter) and the standard distribution (sigma, which is a shape
parameter carrying units of the location parameter) as these are parameters of
the population.
When: The
distribution is widely used for quality issues where errors are frequently
symmetrically distributed and for a few cases of reliability problems where
life data is also symmetrically distributed.
For symmetrical life data, the normal data makes a good Weibull plot, whereas Weibull data usually makes a
poor normal plot—thus, Weibull plots have almost displaced normal plots for
reliability data.
Where: The
distribution is used where the statistics simplify descriptions of the
distribution, so it is easy to describe and explain.
Overall Equipment Effectiveness (OEE)-
What: Overall
equipment effectiveness (OEE) is a manufacturing index to reduce complexity of
discrete systems for problem solving and benchmarking. In many ways, it is a subset of effectiveness.
OEE=availability*performance*quality where availability = (operating
time)/(planned production time), performance = (ideal cycle time)/(operating
time/total pieces), and quality = (good pieces)/(total pieces); and OEE is best
suited to discrete manufacturing. The
index is larger than for effectiveness and allows for acceptance of down time
without have a hard measure for utilization losses in the capability (although
it does have a performance index which takes elements from both efficiency and
utilization) and it accepts planned downtime as OK in the availability
index. The effectiveness index looks at
the system from the perspective of the investor, whereas OEE looks at the
system from the perspective of the operations management which excuses many
losses such as planned outages, etc., and has the propensity for the indices to
be “Enronized” so they look good, when in fact from
the investors viewpoint, the results are not good which is a violation of the
principle of Esse Quam Videri
(To be, rather than to seem).
Why: It’s
a simple and easy-to-use index for the big-picture summary of performance in
industry and it can be benchmarked against similar industries.
When: Use
for a quick assessment and approximation of the effectiveness equation.
Where: Widely
used for a first cut at improving manufacturing operations in lieu of the more
stringent and complete effectiveness equation.
What: Vilfredo Pareto was an
Italian economist in the late 1800s who described the unequal distribution of
wealth in the world. The concept was
improved and brought to the factory floor by Joseph M. Juran
(December 24, 1904-February 28, 2008) for manufacturing operations. Juran said it was a
methodology for separating the vital few problems from the trivial many
problems. The Pareto principle, as
explained by Juran, when applied to quality issues
said: It’s the 80-20 rule where 80% of
the problems come from 20% of the causes and management should concentrate on
the 20% (the vital few causes). The same
concept works for money issues—you must separate the vital few issues from the
trivial many issues.
When the Pareto distribution is listed in
order of money lost (including the risk for money lost) it becomes a work
priority for attacking business problems that have the greatest impact on the
enterprise. Winners in the organization
work on the vital few important items (the 20%), as they put their reputations
at stake, while the losers in the organization work on the trivial many
problems (the 80% of the problem list), which, if solved, would have little
financial impact on the enterprise.
The gear-head approach is to build the
Pareto list based on numbers of failures.
This is usually not too productive.
Would you really prefer to solve 90% of
1) 1000 failures that costs a total
of $1000, or
2) 1 failure that costs $1,000,000?
The gear-head approach says to go for the 1000 small problems. However the business approach says to go for
the big $ items in the list—in the end, it’s all about the money!
The business approach is to build the
Pareto list based on the total amount of money spent or at risk (maintenance
costs + gross margin lost + rework costs + scrap costs + warranty costs + … +
…., etc to include all appropriate business costs)
rather than working on the trivial money and love affairs that keep people busy
but do not generate financial returns for the business.
The most important reliability tool is a Pareto
distribution based on $’s to set work priorities for attacking the vital few
problems as a method of separating important issues from the trivial many
issues.
Why: The
Pareto distribution, based on $’s,
sets work priorities, and assuming a one-year payback period, describes how
much money can be spent to resolve the issues.
Most reliability engineers need to be working on the top 5 or 6 items, based on $’s, all the time as data and
solutions are developed slowly and the key items always need to be on the mind
for active consideration. The mentality
is to think like a bank robber—go for where the big money is located and get it
back—and get it back fast.
When: At
least quarterly reviews of the Pareto distribution are important for
accountability of who has solved what problems and to define what new targets
have come over the horizon that require immediate attention.
Where: Pareto
distributions are used throughout the organization to keep attention on the
vital few $ issues. They are highly
favored by management when engineers employ Pareto distributions based on
money. Pareto distributions help set
work priorities and avoid focusing on love affairs with equipment or process,
which often occurs to the detriment of the business. Pareto distributions explain why some work
orders always get maintenance priority while other tasks are relegated to the
category of “whenever we get time to solve the problem.”
What: Poisson
distributions are discrete distributions and the simplest statistic process
where Poisson events are random in time, which describes a stable average rate
of occurrence of counted events. The Poisson
is frequently used as a first approximation to describe failures expected with
time. The calculations are driven by an
average value, e.g., failures/year, defects/meter2, hurricanes/year,
etc. Answers from the Poisson will come
as probabilities for 1 failure, 2 failures, etc., or the probability for 1
hurricane in a year or 2 hurricanes in a year, etc. The average value is obtained from a
constant*time-interval that is usually explained as l*t. Frequently charts are used to obtain
solutions to the Poisson equation such as the Thorndike Chart from Bell Labs or
the Abernethy-Weber chart from The
New Weibull Handbook. The equation
is often described in two formats: 1) probability = (np)re-np/r! where n = number of trials, r = number of
occurrences, and p=probability of an occurrence, or 2) probability = ZCe-Z/C! where Z=expected number
(i.e., the mean) and C=probability of an event in counting numbers. Of course, for the two different formats np=Z and r=C. When n
is large and p (or 1-p) is small, the Poisson is an excellent approximation to
the binomial distribution.
Why: Simplicity
is the major reason for use of the Poisson distribution.
When: Use
the Poisson when an answer is needed quickly and the answer deals with counting
terms.
Where: When
you know the average number of events the Poisson is easy to use to find the
probability of 1, 2, 3,…events occurring.
What: Probability
plots make sense of the chaos of failure data on an X-Y plot. Each type of plot is divided differently on
the X and Y axis based on the fundamental mathematics for a given distribution. The decision on which type of graph paper to
use is based on: 1) a simple pragmatic approach (use the one that gives the
best curve fit to the data), and 2) the physics of failure or the mechanism
driving the data for non-failures. For
reliability data, 85% to 95% of the data will adequately fit a Weibull distribution. For repair data, 85% to 95% of the data will adequately
fit a lognormal distribution. Often Weibull plots or lognormal plots
compete as to which distribution best fits the failure data.
Why: The
acquired data is plotted in the units acquired on the X-axis of a probability
plot and the data is plotted in rank order.
The Y-axis in most cases is determined using Benards median rank approximation to provided the probability
percentage. The result is often a
straight line on the properly divided X-Y graph paper. Please note, over the years many different
plotting positions have been tried with Benard’s
plot position being the strongest survivor for tailed (i.e., non-normal) data.
When: Use
when you have failure data or repair data.
They work best when age-failure plots are made by individual failure
modes or individual repair modes. They
also will handle high-level failure data and repair times where the data
represent how the system is behaving.
Where: Use
probability plots to get complicated data summarized onto one side of one sheet
of paper. When the plots have the
cumulative distribution plotted on the Y-axis, it tells what percent of the
population will have a life (or repair time) less than the corresponding
X-value.
What: Reliability
of a production process is defined as the percentage of production where output
consistency is lost as determined by a Weibull plot of daily production
output.
Reliability losses are the sum of production gaps between what should
have been demonstrated (the demonstrated production line) and what was
actually achieved—these are losses due
to special causes. Special cause losses occur from things you can put your
finger on and can be solved by process engineers, maintenance engineers, and
reliability engineers.
Nameplate lines (or entitlement line) define the possible daily
output. Nameplate lines lie to the right
of the demonstrated production line on a Weibull probability plot. The gap between the nameplate line and the
demonstrated line quantifies efficiency/utilization losses—these are losses due to common causes. Common cause losses result from subtle
problems without major identifiers and generally accepted as “that’s the way
things are” without fingering for elimination by six-sigma black-belts and
management. In many production
facilities, this category is a major source of losses and greater than all
availability/reliability/maintainability losses.
The sum of the reliability losses plus efficiency/utilization losses
constitutes a hidden factory measured in output quantities.
Production effectiveness = (annual
output)/(annual output + hidden factory losses). These details are shown graphically on a
Weibull probability plot. Contrast the
production effectiveness calculation (obtained in minutes) to the effectiveness
equation (obtained in hours/weeks).
Why: You
must see the losses on a Weibull probability plot to believe they exist. Use the graphics to sell an improvement
program based on diagnosis of the problem and where to attack. The technique provides both visual and
qualitative results. The analysis goes
onto one side of one sheet of paper.
This is a simple tool used for strong results in a creative and problem
solving organization. Reliability values
and the slope of the demonstrated line (beta) are benchmark able. Process reliability techniques measure system
performance, in production output quantities, and produce a single production effectiveness index in
percentage terms which is similar to the effectiveness
equation.
When: Works
well on daily production data accumulated over a period of time in order to see
the patterns of performance.
Where: Useful
for any production facility including electrical power generation, chemical
plants (both batch and continuous process), refineries, pharmaceuticals,
semiconductors, packaging facilities, and other complicated production
facilities where achieving a simple index of “how are we doing” is difficult to
achieve. For more details and articles,
see hyperlinks at the bottom of the page: http://www.barringer1.com/prtraining.htm.
Return to top
Quality Function Deployment (QFD)-
What: QFD
is a bad translation of a good Japanese reliability technique for getting the
voice of the customer into the design process so the product delivered is the
product the customer desires. In
particular, it is applicable to soft issues that are difficult to specify.
Why: The
method helps pinpoint: 1) what to do, 2) the best ways to accomplish the
objective, 3) the best order for achieving the design objectives, and 4) the
staffing/assets required to complete the task.
When: QFD
is a major up-front effort (as is the case with most Japanese techniques) to
learn and understand the customer’s requirements and the approach that will
satisfy their objectives.
Where: The
methodology is used as a team approach to solving problems and satisfying
customers, beginning with a listing of customer requirements, converting
customer requirements into engineering characteristics (the house of quality),
converting engineering characteristics into parts characteristics (the house of
parts deployment), converting parts characteristics in process characteristics
(the house of process planning), and finally, converting the process
characteristics into production characteristics (the house of production
planning). As with all Japanese
techniques, the up-front costs are high and many clever graphical tools exist
for transferring information with the intention of decreasing costs downstream
while satisfying customer’s needs.
What: Reliability is the
probability that a device, system, or process will perform its prescribed duty without failure for a given time
when operated correctly in a specified environment. This means that reliability is concerned with
the probability of future failures based on what has occurred with past
observations so we predict the future based on past observations.
Why: Reliability
has two broad ranges of meanings:
1) qualitatively—operating without failure for long periods of time just as the
advertisements for sale suggest, and
2) quantitatively—where life is predictable, long, and measurable in tests to
assure satisfactory field conditions are achieved to meet customer
requirements.
Reliability is concerned with failure-free operation for periods of time,
whereas quality is concerned with avoiding non-conformances at a specified time
prior to shipment; thus, reliability measures a dynamic situation but quality
measures a static situation. As in
physics, statics (not time dependent as with quality issues) is easier to
understand and calculate than dynamics (time dependent as with reliability
issues), which involves higher levels of math and greater mental capabilities
for comprehension.
When: Reliability
is expected for new equipment to start, run, and continue to function for long
periods of time without failure.
Reliability is also expected when the equipment is dormant and called to
duty. Reliability is also expected upon
service or restoration and resumption of long life. Reliability is designed into the system by
up-front activities, and reliability is sustained by careful operation of the
system along with careful nurturing of the system with sustaining maintenance
activities. Reliability always
terminates in a failure and the roots of failure can be due to design,
fabrication, installation, operation, maintenance (repair and periodic
servicing), and management of the system—in short, there are many ways and
means to kill the system but few ways to keep is operating without failure.
Where: The
adage says “the proof of the pudding is in the eating,” and, for reliability,
the proof of the system is in the long failure-free interval. Reliability tools are used from stem to stern
to demonstrate high reliability (the absence of failures for long periods of
time) by use of many tools such as:
reliability acceptance test to
demonstrate long life;
reliability analysis to compute the
expected results;
reliability and maintainability the
mathematical tasks that predict the expected results from the elements;
reliability apportionment to
allocate life issues in a top-down manner to meet an overall reliability goal;
reliability assessment determines
the achieved level of reliability of an existing system using data gathered
during test or use;
reliability assurance implements
planned management and technical measures to provide confidence that a
reliability target is obtained and maintained;
reliability
block diagrams to graphically and mathematically calculate reliability
results prior to building a system;
reliability-centered maintenance is
the systematic approach to identify preventive support and service according to
a set of procedures to reduce and avoid failures;
reliability confidence limits
demonstrate the limits for reliability within a given confidence limit;
reliability control is the
coordination and direction of system dependability
through design activities and management planning;
reliability critical item
identification whereby failure significantly affects system safety/cost or
operational success or maintenance/logistics support costs;
reliability data is the basic
age-to-failure data as life unit
information relating to the time-to-failure when organized by probability distributions;
reliability degradation which incurs
loss of the failure-free performance due to poor workmanship or bad parts or
improper operation or abuse or inadequate maintenance;
reliability design practices are a series of trade-off-tools to meet or beat the design
specification for reliability;
reliability development/growth tests
are the evaluations to disclose deficiencies and verify corrective actions to
prevent reoccurrence of the failures to achieve the design specifications and
sustain reliability growth toward
longer times between failure;
reliability estimates are life
values used prior to statistical experimentation with the end products to make
predictions, or assessments, or stress analysis evaluations;
reliability function is the
graphical representation of life characteristics plotted against operating
time;
reliability growth achievement is
the systematic improvements of a item/systems
dependability by removing failure mechanisms through corrective actions to
eliminate deficiencies and flaws often achieved by means of test-analyze and
fix;
reliability growth models
(Crow-AMSAA) measures the reliability
growth by means of log-log plots of cumulative failures on the Y-axis and
cumulative time on the X-axis to demonstrate with statistics that failures are
coming more slowly and reliability goals have been achieved;
reliability guarantee is the
commitment by suppliers to provide a given meant time between replacements or
to maintenance and overhauls intervals for equipment;
reliability improvement is the
identification of failure modes and effects having a critical impact on the
system failure potential of the design along with the systematic removal of the
failures to produce long life without failures;
reliability index is the ratio of
the mean reliability level achieved to the acceptable level specified in the
design as a figure of merit;
reliability measurement is
failure-free endurance assessment activity for making decisions about
reliability and demonstrating compliance;
reliability mission is the mission
time for demonstrating failure-free performance;
reliability prediction is the
process of quantitatively assessing whether a proposed or existing deign meets
a specified life requirement;
reliability prediction functions
estimate the life characteristics for setting goals and evaluating the design
benchmarks and needs;
reliability prediction limitations
describes the shortcomings in life values by analytical methods;
reliability prediction requirements
describes life assumptions, environmental data, and failure rates for the
design;
reliability prediction summary is a report providing conclusions and recommendations based
upon an reliability assessment analysis;
reliability program are the activities
to organize and achieve a system to insure reliability goals are achieved and
deficient areas shored-up;
reliability program plan is the
formal written definition of the specific tasks to fulfill the reliability
requirements;
reliability qualification test (RQT)
is an evaluation conducted under specified conditions using items
representative of the approved product configuration;
reliability quantitative elements
are the life characteristics and factors considered in predicting and measuring
reliability performance;
reliability requirements are the
numerical values representing a specified failure-free life or dependability
performance characteristic;
reliability sequential tests are
evaluations of the number of failures and the time required to reach a decision
based on the accumulated results of the reliability tests;
reliability tasks describe the
activities required to achieve a reliability program;
reliability tests are the formal
evaluation to determine a product’s longevity for the failure-free interval or
stability relative to time/usage;
and finally,
reliability with repair is the
failure-free performance achieved by redundancy with permitted online repairs
without interrupting equipment operation.
What: Reliability
audits verify your reliability program is effective and find areas of weakness
for corrective action. They are
inquiries by factual examination of elements of the system with written
objective criteria for performance, beginning with an assessment of how
management is involved and are they effective in building a productive
reliability program.
Why: Most
organizations know where they are strong.
On an objective basis, few organizations know where they are weak. Reliability audits are a fact-finding
exercises similar to financial and quality audits to ferret out weaknesses for
corrective action. The questions to be
answered are:
1) How well are
you doing what you promised against your reliability
policy?
2) How well is upper management doing
against company objectives for reliability?
3) How well are reliability plans,
systems, and procedures working?
4) How well are plans, systems, and
procedures being executed against the policy?
5) How well are productive efforts for
reliability working toward achieving the goals?
6) How well has the reliability system
been communicated to employees and are they committed to understanding and
implementing the improvements? and
7) Are financial objectives being met as
a result of ongoing reliability improvements? (Which is the main objective of
the audit—not just a rigid procedural/bureaucratic compliance to details).
When: Detailed annual audits should occur annually
with a follow-up to occur six month later to insure that corrective action has
been implemented. Without a six-month
deadline, few tasks will be completed because of procrastination.
Where: Audits
are needed for 1) reliability system management, 2) new techniques, technology,
developments, and controls, 3) supplier control (internal and external), 4)
process operation and control, 5) reliability data programs, 6) problem-solving
techniques, 7) control of reliability measurements, 8) human resources
involvement, 9) customer satisfaction assessment (internal/external), and 10)
software reliability (excluding Microsoft products used in the office
environment).
What: Reliability
block diagram (RBD) models are graphical representations of a calculation
methodology for reliability systems.
Why: The
RBD models allow calculation of system reliability based on knowing/assuming
failure details of the components, starting with the least component and
growing the model to the greatest system to predict performance from the
elements.
When: RBDs
are used in upfront designs as a performance parameter and after the system is
constructed to ferret out poorly performing blocks that limit the system
performance.
Where: Frequently
used as a trade-off tool to search for the lowest long cost of ownership and to
help sell alternative courses of action for moderating the effects of
reliability issues or overcoming the poor performance by alternative designs
where the results can be calculated before building the system as the results
of the calculations provide knowledge about availability, maintenance
interventions required for failures, and the number of spare parts required to
sustain operations. For other
definitions see MIL-HDBK-338,
sections 4 and 6.
Reliability-Centered Maintenance-
What: Reliability-centered
maintenance (RCM) is a systematic
planning process used to determine the maintenance requirements for a
system. RCM expects the system has an
inherent reliability and maintenance requirements are imposed upon the baseline
of inherent safety and inherent reliability designed into the system (the
design sets the standard, it can be high, medium, or low).
Why: RCM
does what is required to make sure the systems continue to do what the users
want done. If the excellent maintenance
programs demonstrate the lack of reliability expected, then the system must be
improved by design changes to physical assets or the manner in which the assets
are used.
When: RCM
requires a cultural change in both management and employees to “do maintenance
by the numbers”. This requires
discipline in the organization to perform the FMEAs that
drive the work process for maintenance and it also requires defining functional failures.
Where: RCM
works better in top-quartile manufacturers who have a disciplined work force and
are interested in achieving excellence in 1) safety, 2) operability, 3) reduced
maintenance downtime by a disciplined approach to the maintenance activities,
4) high uptimes, and 5) a reduction in failures. Lacking one or more of the five efforts at excellence
generally results in a failed RCM program.
What: A
strategic job for preparing plans to reduce the failures and the cost of
failures as a preventative measure to reduce the cost of unreliability. Acquires failure data and analyzes the data
to quantify the financial impact and prepare long-term solutions to prevent
reoccurrences to improve reliability and uptime. Determines the cost advantages and proposes
alternatives for solving the problem and recommends the alternative with the lowest long-term cost of ownership. The purpose of these actions is to prevent
failures.
Why: Prevents
future failures by working on medium- and long-term projects using technology
to solve the problems. As required,
provides technical assistance to maintenance engineers to aid their efforts for
quickly restoring equipment to service.
When: Provides
expertise for avoiding failures by means of a technical solution to reduce the
high-cost reliability problems on the Pareto
distribution.
Where: Provides
technical support and solutions for management on longer range problems, and as
required, supplies technical assistance to maintenance engineers for immediate
and difficult restoration projects as a liaison effort. Supports task improvements to accomplish
longer term objectives (think months and quarters), which will result in
smoother operations, at lower costs, without failures.
What: Reliability
growth models are important management concepts for making reliability visual
with simple displays. The simple log-log
plots of cumulative failures on the Y-axis against cumulative time on the
X-axis often make straight lines where the slope of the trend line is highly
significant for telling if failures are coming faster (b>1),
which is undesirable, slower (b<1), which is desirable, or without
improvement/deterioration (b=1), which usually drifts toward
undesirable results. The reliability
growth models are frequently called Crow-AMSSA plots in honor of Larry Crow’s
proof of why the charts work as described in MIL-HDBK-189
when he worked with AMSAA.
Why: Both engineers
and management must see reliability problems to fix them. The simple log-log plots make the models
visible. The task of the reliability
engineer is to put favorable cusps on the Crow-AMSAA trend lines to make
failures come more slowly and thus decrease the long-term cost of
ownership. If you’re doing your
improvement job correctly, you’ll never have many failures until you have a
cusp.
When: The
plots are useful for development tasks (where they first were used) or to
long-term operations. They work for
safety programs, plant improvement programs, environmental programs, or for
cost problems. Use the plots as “show
me, don’t tell me,” how the projects are proceeding and the key metric in the
form of line slope is easy to understand and easy to communicate in less than
60 seconds.
Where: They
are used for technical development issues or for management reviews. A picture is worth a thousand words for
getting management’s attention for focusing on a problem. Likewise the charts are highly useful for
showing the reductions in failures that have occurred from making a desirable
and permanent fix.
What: Management
communicates with their staffs through important policy statements. Management policies are general and relate to
procedures and rules which are specific for implementing policies. Written statements of policy regarding
reliability are decisive documents about avoiding system failures in the same
way that safety policies address the need for absence of human injuries,
quality policies address the need for absence of product discrepancies, and
environmental policies address the need for avoiding spills and releases. Management needs to also say, by a policy
statement, a reliability policy that may read like this:
We
will build an economical and failure-free process that will operate for 5 years
between planned outages.
This statement will clearly communicate that failures to the process (which
is the money machine) are to be abhorred and avoided!
Why: Process
failures are clearly money issues because, when the process ceases to run, the
company has no income, thus process failures are to be abhorred for killing the
money machine.
When: Implementing
a policy before constructions of new facilities is important to use the policy
as design criteria. When implemented
with older facilities, the task is more difficult and old facilities may never
be able to comply with the objectives at a reasonable cost alternative.
Where: Responsibility
for implementing the policy lies with:
1) the chief operating officer must authorize the policy and ensure the policy
is applied throughout the operations under the administrative directive that
sets the guidelines for financial and engineering measures,
2) the engineering/R&D executives are responsible for ensuring the policy
is implemented by systems engineering, design engineering, project engineering,
pilot plant engineering, and test engineering,
3) the manufacturing executive is responsible for ensuring that the reliability
policy is carried out by the materials and procurement functions, industrial
engineering functions, manufacturing engineering functions, operations
functions, and maintenance functions,
4) the quality assurance executive is responsible for the dissemination of the
reliability policy, its annual review and auditing for compliance to the spirit
of the policy, and for making recommendations to the chief operating officer
concerning continued relevance, applicability, and effectiveness, and
5) the human resources executive is responsible for ensuring that all new
employees are indoctrinated into the purpose and implementation of the
reliability policy as a part of the operation’s mission, goals, and priorities.
What: Suppliers
have two strategies for testing: 1) test for success and/or 2) test for
failures. Reliability testing produces failures,
particularly when the tests are accelerated with extra loads, and this may be
troublesome to have in the records for future lawsuits. Thus, it is often to everyone’s advantage to
perform reliability test under code names to protect against the broad rules of
legal discovery.
Why: The
reliability tests will determine a product’s longevity and failure-free
performance. This requires data
recording and data integrity. Plans must
be set for how the tests are to be conducted, loads to be handled, duration of
the tests, environmental conditions, operating modes, failure definitions, and
documentation for recording/analyzing the test data.
When: Reliability
test are usually run prior to release of the product for sale or after the
product has been released and troublesome failures appear in field applications
where no problems were expected.
Where: Laboratory
test are conducted in many cases but in other cases the data may simply come
from field use. Note the failures
induced require extra components that must be expected and budgeted along with
the extra costs for data acquisition/analysis.
What: For
inexpensive components and inexpensive tests, simultaneous tests involve many
components under test loads/conditions at the same time for the purpose of
quickly acquiring data and producing test analysis as the failures occur. In simultaneous testing, the suspensions
(censored data) become important details for use in the statistical
analysis. Most simultaneous tests are
accelerated to generate the data in a short period of time, although this
carries the risk of introducing unexpected failure modes (but this can also be
useful information for anticipating field failures).
Why: Conducting
analysis of the early test results, when only a few failures have occurred, will
give precursors as to passing/failing the longer-term tests. If the early test results look encouraging,
the larger test may be allowed to run to conclusion. However if early test results are
disappointing, the test may be abandoned without using all of the testing
budget so that remedial action can occur prior to completing the full-scale
planned test.
When: This
testing is usually conducted prior to release of products. However, a similar watch may be setup for
warranty repairs so as to anticipate the cost and extra supplies required to
cope with an unexpected failure that was not forecasted.
Where: This
strategy is appropriate for inexpensive components in the test laboratory. However, for warranty problems, the issues
are very appropriate for expensive components or assemblies.
What: Software
does not wear out but it does fail and most failures are due to specification
errors and code errors with only a few errors in copying or use. The only software repair is by reprogramming
and adding safety factors is almost impossible.
Software reliability improves by finding errors and fixing the errors
but estimating the number of errors that cause failures is extremely difficult
as many branches of software code may lie dormant and unused until special
events occur to make the latent failures obvious. Software failures are not often time related
but are more software code page dependent.
Software reliability is improved by extensive testing to disclose the
failures and then fixing them to repeat the test all over again to validate the
fix did not generate more failures and to continue the search of other latent
defects.
Why: More
than 50% of the software bugs (failures) occur from specifications with lesser
amounts of failures from system design and the coding process. This is due to the lack of visibility in the
software process along with problems from those specifying the requirements
with problem roots in ambiguities, inconsistencies, incomplete statements, and
lack of logical requirements. This
requires that both inputs and outputs for software must be specified in greater
detail than for mechanical, electrical, or system data to avoid the errors and
conflicts.
When: “Clean
room” software procedures are a technique for extracting details from the
customers so the programmers get the scope of the project and the input/output
correct as an up-front effort to reduce errors and wasted code. Acquiring the data is tedious, and roughly
80% of the software budget is spent get the details “right” before programming
commences.
Where: Disciplined
software specialists carefully work the plan up-front to reduce errors and
testing time. Undisciplined, so called
“neo-experts” want to see busyness in code writing up-front and thus their
software reliability is worse from not having a firm foundation from which to
work.
What: For expensive
components and expensive tests, sudden death tests involve a few components
that tie-up a test frame as they are heavily loaded under the same test
loads/conditions with several items being run at the same time. When one of the items fails, the entire test
frame is shut down so that you have 1 failure (this is the sudden death!) and
several suspensions because the unfailed units are
survivors as the test is halted until the test frame is loaded with new samples
for resumption of the life test. Opening
the test frame (instead of tying up the frame until all samples have failed) is
cost effective. If three units can be
tested simultaneously and the test is halted on the first failure, then perhaps
we will literally have only 4 failures and 8 suspensions for preparing the Weibull analysis.
Will the 4 sample + 8 suspension data set be different than if all 12
samples had been run to failure?—the answer is yes, they will be different, but will they be significantly
different?—the answer is no to the significant difference. So, as with simultaneous testing the suspensions (censored
data) become important details for use in the statistical analysis. Most sudden death tests are accelerated to
generate the data in a short period of time although this carries the risk of
introducing unexpected failure modes (but this can also be useful information
for anticipating field failures).
Why: Sudden
death testing is all about the economics and shorter elapsed time for results.
When: Sudden
death testing is used for product acceptance tests.
Where: It is
a quick test for many products and the ongoing test for production lots.
Total Productive Maintenance (TPM)-
What: TPM
is a corporate-wide effort involving all employees to fully use equipment to
the maximum limit employing an equipment-oriented management concept to reduce
failures and increase utilization of equipment and processes in a productive
manner. TPM programs are teamwork
programs and require a corporate culture of teamwork devoid of us vs. them
issues. All employees are expected to
accept ownership of the equipment and processes to do many small things all the
time to ensure high levels of availability by eliminating failures in the early
stages with low-cost actions. The
employees approach the process equipment as owners rather than renters.
Why: Maximizing
equipment uptime with lower costs by all employees working to reduce the many
small incidents that lead to a failure
When: Major
maintenance tasks are handled by the craftsmen.
Most small tasks are handled by operators in a never-ending effort of
cleaning, lubricating, and tightening to find problems early when they can be
solved simply instead of letting the problem grow to a major issue.
Where: TPM
is a system-wide effort of providing care to the equipment rather than saying
“it’s not my job,” and “We’ve got to fill out the paperwork before ‘they’ can
do anything.” The technique makes good
use of the 5 human senses but technical details must be taught to the work
force to understand good from bad and when action must be taken along with what
must be done—this requires a sharing environment where the work team works for
the common good of higher performance.
If the culture is me, me, me, TPM will not work.
What: If
you’ve got one piece of failure data and nothing else, you’re a poor person without
much hope. If you’ve got one piece of
failure data and a Weibull
database, you’re a rich person with a map on the back of an envelope and a
compass by your side to get you out of the abysmal swamp of ignorance and
misunderstanding.
Why: The Weibayes technique uses your failure
data and past experience to make Weibull
analysis forecast about what you should expect into the future and in many
cases, given a hypothesis of worst-case/best-case a failure forecast can be generated.
When: Use
the technique when you lack specific details but you know something from your
past experience—often the past experience reduces errors of Weibull
analysis. Use Weibayes
analysis to make sense out of emotional nonsense.
Where: Use
the technique to say something and point noses in the right direction rather than
playing the role of Chicken Little with the sky falling. Some data is better than no data in most
cases, and when you can keep your wits and everyone else is in panic mode, it
quiets the problem to allow reason to prevail.
What: Weibull
analysis is the tool of choice for most reliability engineers when they
consider what to do with age-to-failure data.
It uses the two-parameter Weibull distribution which says mathematically
that reliability, R(t) = e-(t/h)^b where t is
time, h
is a scale factor known as the characteristic life (most of the Weibull
distributions have tailed data and lack an easy way to describe central
tendency as the mode≠median≠mean;
however, regardless of the b-values, which is a shape factor,
all of the cumulative distribution function values pass through the h value at 63.2%
which thus entitles it to be known as the single-point characteristic
life).
Be careful in use of the three-parameter Weibull equation! It is frequently misused simply to get a good
curve fit! The three-parameter Weibull
requires compliance with these four requirements:
1) you must see curvature of data on a
two-parameter plot (concave downward curves imply a failure free interval on
the age-to-failure axis whereas concave upward curves imply a percentage of the
population are prefailed),
2) you must have a physical reason for
why a three-parameter distribution exists (producing a better curve fit is not
a valid reason!),
3) you must have at 21 failure data
points (if curvature is slight you may need 100+ data points), and
4) the goodness of curve fit must be
significantly better after use of the three-parameter distribution.
Why: The
Weibull distribution is so frequently used for reliability analysis because one
set of math (based on the weakest link in the chain will cause failure)
described infant mortality, chance failures, and wear-out failures. Also the Weibull distribution has a closed
form solution:
1) for the probability distribution
function (PDF),
2) for the cumulative distribution
function (CDF),
3) for the reliability function
(1-CDF), and
4) the instantaneous failure rate
which is also known as the hazard function.
For engineers, discrete solutions are preferred rather than use of tables
because of simplicity. In a similar
manner, engineers strongly need graphics of the Weibull distribution whereas
statisticians do not find the graphics nearly as useful for comprehension.
When: Use
Weibull analysis when you have age-to-failure data.
·
When you have age-to-failure data by component, the analysis is
very helpful because the b-values will tell you the modes of failure which no other
distribution will do [b<1 implies infant mortality with decreasing failure rates, b≈1 implies chance failures with a constant failure rate, and b>1
implies wear-out failure modes with increasing failure rates—when you know the
failure mode you know which “medicine” to apply]!
·
When you have age-to-failure for the system, the b-values
have NO physical significance and the b-, h-values only explain how the system is
functioning—this means you loose significant physical
information for problem solving.
Where: When
in doubt, use the Weibull distribution to analyze age-to-failure data. It works with test data. It works with field data. It works with warranty data. It works with accelerated testing data. The Weibull distribution is valid for ~85% to
95% of all life data, so play the odds and start with Weibull analysis. The major competing reliability distribution
for Weibull analysis is the lognormal
distribution which is driven by accelerating events. For additional information read The New Weibull Handbook, 5th
edition by Dr. Robert B. Abernethy and use the SuperSMITH Weibull and SuperSMITH Visual software for
analyzing the data (both software are bundled for a reduce price as SuperSMITH).
What: Starting
with Weibull analysis of component failures, the shape factor b
derived from the Weibull analysis provides an objective guide for selecting
repair strategies.
Why: Experience
has shown when shape factor beta is:
b < 1, failure
rates are declining with time as occurs with infant mortality failure
modes. This condition provides a run to
failure strategy. Older components are
better than new components because the failure rate for the population is lower
than when new.
b ≈ 1, failure rates
are constant with time as occurs with chance failure modes. This condition provides a run to failure
strategy (or a run until the component failure mode changes to a wearout failure mode).
An old component is as good as a new component.
b > 1, failure
rates are increasing with time as occurs with wearout
failure modes. If the cost of failures
in service is much greater than the cost for a replacement, the component may
have an optimum replacement interval for timed replacements. If the cost of failures in service is equal
to or slightly larger than for a replacement, the component many have a run to
failure strategy.
Bottom line: You must know your Weibull failure modes and your costs to make a
good maintenance decision.
When: Collect
data from the FRACAS system. Perform a Weibull
analysis. Store the data in a Weibull database.
Use the Weibull facts for making fact based technical decisions.
Where: Weibull
corrective action is used by maintenance engineers and reliability
engineers. It is a useful tool for
understanding scatter in the data and
provides guidance for taking the appropriate corrective action.
What: The
smartest way to maintain a reliability database is in Weibull format and Weibull databases are
available. Seldom do you see Weibull
databases from vendors because they jealously protect their data for
proprietary reasons—they live/die financially from the Weibull database
information.
Why: The
Weibull databases simplify the complications of failure data into two
statistical values of great importance:
b tells you HOW things
fail at the component level, and
h tells you WHEN
things fail.
The results are key benchmark data that tell you how you’re doing.
When: Gather
your failure data and create your own database.
No one is going to give you their database because they put much sweat
and tears into cleaning up the data so it is useful. The data needs to be locally generated
because it tells you: 1) the life from the grade of equipment
you purchase, 2) it describes the grade of operation of the equipment—do you
operate it like 16-year-old teen agers or wise old men/women of 65?, 3) it
describes the grade of maintenance you use to renew its life, and 4) it tells
you management’s expectations for how to treat the system.
Where: Data
collections as a Weibull database seems to many to start out as a silly
exercise by maintenance to accumulate data with much ridicule from the
unknowledgeable about why are you spending so much effort to build a Weibull
database. When adversity arises, the
Weibull database becomes everyone’s prized possession with proprietary
information. Remember the worlds of
Rudyard Kipling about plight of the English soldier: To paraphrase: In peacetime it’s Tommy this
and Tommy that, and Tommy get out of the way…but you let the bullets fly in
wartime and it’s Mr. This and Mr. That and Mr., if you please! Everyone wants the baby but no one wants the
dirty diapers that go with every baby!
If you don’t have a Weibull database, you’re already too late because
your competitor has one started and is using it to your disadvantage, and he’s
not going to tell you why you’re left in the dirt!
Comments:
Refer to the caveats on the Problem
Of The Month Page about the limitations of the following solution.
Maybe you have a better idea on how to solve the problem. Maybe you find where
I've screwed up the solution and you can point out my errors as you check my
calculations. E-mail your comments, criticism, and corrections to: Paul
Barringer by clicking here. Return to top of page.
You can download a
copy of this
page as a PDF file.
Return to Barringer & Associates, Inc.
homepage