Problem Of The Month

January 2001—Timing Of Maintenance Repairs


Timing of maintenance repairs depends on three factors:

1.      costs,

2.      failure modes, and

3.      failure descriptions in probabilistic terms. 

Everyone wants to carefully schedule maintenance actions to shut down the facilities for a short interval and renew equipment so ALL in-service failures are avoided.


Popular perceptions about equipment failures do not match harsh realities from the physics of failure, along with the price people are willing to pay for robust equipment.  Few things have highly predictable times to failure.  Most things have multiple failure modes and varying lengths of component life.  The physics of failure dictates how things fail, and the same physics dictate robustness of the component to find what age the component will reach before failure. 


You cannot call the shots for failure in a god-like fashion—it’s an issue of natural laws involving entropy.  You must adopt multiple strategies to cope with the physics of failure.  One of the methodologies is wrapped-up in RCM.


Reliability issues concern prevention of failures and optimization of failure costs for the long-term cost of ownership.  Maintenance issues concern rapid repair of failures with timely restoration of functional capabilities using good practices to obtain the long life inherent in existing equipment.  As an analogy, the reliability department functions as the Fire Marshall does to prevent fires, whereas the maintenance department functions as the Fire Fighter to quickly extinguish the fires.


John Moubray (see reading list) explains the objectives of reliability-centered maintenance (RCM).  Moubry defines “Reliability-centered Maintenance: a process used to determine the maintenance requirements of any physical asset in its operating context”. 


The RCM process is highly situational depending upon:

·         What must be done for survival of assets,

·         When the maintenance actions must occur on the asset,

·         How can you predict/prevent the functional failure of the asset, and

·         Cost effective performance of maintenance activities. 

This requires some definitions:

·         Functional failures—a state of failure where the asset cannot fulfill an acceptable performance for the end user at a price willing to be paid

·         Failure modes—events which are reasonably likely to case a failed state (ways things can fail) often described in braod categories of infant mortality, chance failures, or wear-out failures

·         Failure effects—events which provide descriptions to tell what happened when a failure occurred

·         Failure cause—often the initiating root of the problem—(don’t confuse cause and effects)

·         Failure consequences—the financial impact of business losses when a functional failure occurs, i.e., the pile of money from lost business, repair costs, etc.

·         Failure risk—often the product of severity (how bad is the failure), occurrence (the frequency of failure), and detection (the probability of detecting the failure before it occurs)

When failures are benign (failures are known to the expert but not identified by the end users) they are preferred to dramatic failure where catastrophic events happen clearly and obviously.  Behind all of the RCM activities is money (that includes safety issues where the cost of deaths in industry carry a value of US$3 to 5 million and the cost for severely mangling an individual is US$5 to 30 million). 


RCM requires prioritization of effort or else the RCM becomes known as a resource-consuming monster!  Set the priorities based on $Risk = (probability of failure)*(consequence of failure) or based on a Pareto distribution of excessive failure costs.


When components are replaced before they fail, useful life is tossed out—this only makes sense if financially you have something to gain by throwing away unused life or the cost of the item represents a case unworthy of spending money to show the differences.  Throwing away unused life makes the first quick look appear in $/hr appear higher than if the component is run to failure.  Of course you must always consider the cost for failure in the decision making process.  Furthermore, you must consider the failure mode in the decision-making effort.  Some conditions allow optimum replacement intervals and optimum intervals do not exist for other situations.  These details are explained in The New Weibull Handbook.  Conditions that exclude optimum replacement intervals are:

·         Infant mortality—many early age failure and few late age failure-an old part is better than a new part because of declining hazard rates

·         Chance failures—random ages to failure without memory of previous events—an old part is a good as a new part because of constant hazard rates

Conditions permitting optimum replacement conditions are

·         Wear-out failures—few early age failure and many late age failures—instantaneous failure rates increase with age

·         Large unplanned/planned cost ratios—the risk of failures are considered as an alternative in the decision for when to make replacements

SuperSMITH Weibull software can make the optimum replacement calculations and SuperSMITH Visual software can make the optimum replacement curves as explained in The New Weibull Handbook and as illustrated in PlayTIME for SuperSMITH.


Analysis of aircraft data in the 1960’s, for the original research activities supporting RCM concepts, showed 11% of components on aircraft have wear-out failure modes.  The complement (89%) of components on aircraft had infant mortality or chance failure modes.  The ratios are about the same in industry where more components are killed by over actions than die on their own. [This is from the Nowland & Heap report AD-A066579.]


In Weibull statistics, the failure modes are identified by the value of the shape factor beta:

·         When beta is less than 1, expect infant mortality failure modes. 

·         When beta is ~equal to 1, expect chance failure modes

·         When beta is greater than 1, expect wear-out failure modes

Thus you receive returns from your failure data if you analyze the data stored in your maintenance systems (however, it does require you identify suspensions or censored data) to achieve the correct results.  The data should be stored on a LAN accessible by your organization to make good use of the technical facts obtained by autopsy of failed components.  An example of a Weibull failure database is available—please note the absence of large beta values!  Large Weibull beta values show predictability in ages to failure.  Smaller beta values show lack of predictable failure ages.  Some times mixed failure modes are inadvertently included in datasets and the mixtures must be analyzed by WinSMITH Weibull or YBATH software for mixtures (the preferable method is to separate the mixtures physically and not to rely on software to untangle the results).


Download the Excel spreadsheet RANDOM FAILURE MODELS.XLS (33K).  Look at the Monte Carlo simulation of ages to failure and how the time periods vary between maintenance replacement strategies. 

·         For the first simulation use beta = 50 and eta = 100 to see how failures occur about every 100 units of time in a very orderly manner.  (We want to believe things fail in predictable intervals but this is not reality!)

·         For the second simulation, use beta = 1.2 and eta = 100 and notice the loss of periodic events, i.e., chaos (beta of 1.2 is found very frequently which gives the best predictability in a chaotic world, and beta = 50 is rarely found).

·         For the third simulation, use beta = 1.0 and eta =100 to see the effects of chance failures (We have many examples of constant failure rates with beta = 1). 

·         For the fourth simulation use beta = 0.8 and eta 100 to see effects of infant mortality.   (We have too many of these modes of failure which are often the result of installation errors and operator errors.)

If the cost for an unplanned failure is very high compared to a planned replacement, and beta > 1 is easy to handle using a timed replacement strategy.  However, if the cost of an unplanned failure is approximately equal to a planned replacement then run the component to failure.  If the failures modes are due to chance failure or infant mortality, then run the component to failure for any ratio of costs.  If the cost consequences of failure are very severe, then consider sparing alternatives.  Redundancy planning is more realistic that planning the timing of failures because it treats the issue as a money problem as explained in the February 2001 problem of the month.


Refer to the caveats on the Problem Of The Month Page about the limitations of the following solution. Maybe you have a better idea on how to solve the problem. Maybe you find where I've screwed-up the solution and you can point out my errors as you check my calculations. E-mail your comments, criticism, and corrections to: Paul Barringer by     clicking here.   Return to top of page by clicking here.

Technical tools are only interesting toys for engineers until results are converted into a business solution involving money and time. Complete your analysis with a bottom line which converts $'s and time so you have answers that will interest your management team!

You can download a PDF file of this problem.

Last revised 03/09/2010
© Barringer & Associates, Inc. 2001

Return to Barringer & Associates, Inc. homepage