Cost Of Unreliability

Cost improvement efforts are more productive when motivated from the top-down rather than bottom-up because it is a top management driven effort for improving costs. Finding the cost of unreliability (COUR) starts with a big-picture view and helps direct cost improvement programs by identifying:

1.     Where is the cost problem--what sections of the plant,

2.     What magnitude is the problem--all business loss costs are included in the calculation, and

3.     What major types of problems occur--cutbacks, total outages, turnarounds, etc.

COUR programs study plants as links in a chain for a reliability system, and the costs incurred when the plant, or a series of plants, fail to produce the desired result. As a top-down management effort, COUR achieves acceptance, support, and involvement of personnel at all levels within the organization for using a non-traditional cost improvement approach.

COUR begins with the big picture of failures to produce the desired business results driven by failures of the process or it's equipment. Elements of the process are considered as a series reliability model comprising links in a chain of events that deliver success or failure. Logical block diagrams of major steps or systems are identified. Failure costs are calculated by category expecting that history tends to repeat in a string of chance events unless the problems have been permanently removed and success demonstrated by objective measures. One major failure of most businesses is to produce the expected gross margin to meet or exceed the capability of the facility. Each plant or system has it's own unique failure pattern to produce the inherent financial results from the business system.

For each major block diagram (a single link in the chain of events), data is collected over a time period to identify significant cost contributors to COUR. As a minimum, consider at least two major contributor categories such as failures at links in the business chain from:

1.     Total failures,

2.     Significant quantity losses due to partial failures,

3.     Substantial quality problems which restrict output from the system, and

4.     Turnaround costs.

The strategic thrust is protecting the business from losses by preventing failures, which detract from deliver of gross margin inherent in the physical assets and process for the investment.

Top management must set the strategy for protection against losses from the inherent capability of the investment to produce gross margin. The enemy is the loss of gross margin money. The tactics are to identify what needs protection, where is protection required (and where extra protection is unjustified), and what actions are required to deliver the best results.

Consider a simple example in Figure 1 to illustrate how to approach the problem. A continuous production plant consists of three major links in the chain. The chain must carry the production load to produce the desired financial results. When any link in the chain fails, the plant fails. Rational subgroups are cluster under the black boxes identified as A, B, and C--the summary box provides system results. Raw data is identified inside of the dashed green line box. The data is used to calculate failure rates for each box. The failure rates are summed (because the chain of A, B, and C represent a series system) to find the overall failure rate for the system as shown in the column marked summary. Notice the summary column is based on a 365 day mission or 8760 hours.

Failures over a given interval of time are the required input data. The output is the how the system functions based on treating the black boxes as a reliability model. Figure 1 shows block B has the worst reliability issue (i.e., highest failure rate). Notice that Figure 1 predicts, on the average, 1.7 failures of the system per year. Failures during the mission time are precursors for reliability and for a one year mission time, the system shows a reliability for a one year mission, R = exp(-1.7) = 18.3% which says the probability of operating one year without a failure is 18.3%


Figure 1 shows the major problem is unreliability in block B. The failure rate is too high.

Next, find time lost from the failures to help identify where the problem may lie. Time lost is in proportion to production units lost and thus money never obtained from sale of the production from the plant.


Figure 2 shows the maximum time lost occurs in block Cit fails every other year on the average. When block C is down, its down a long time. Thus block C has a maintainability problem.


The average availability of the system is A = (8760-69.1)/8760 = 99.2% which is very high for most operating plants.

 The real issue is to quantify the cost of unreliability in monetary terms.


Figure 3 shows where the accumulation of costs. From the management perspective:
The key issue is clearly to keep the plant running to generate gross margin as ~70% of the losses occur from loss of gross margin in this sold out plant.
The second management priority is reducing breakdown maintenance costs.
Whining about scrap disposal losses will only cross communicate confusion into the real issues.
The Pareto problem for the site manger is to keep the plant running to produce gross margin.

On the average, this plant has a million-dollar a year problem. So whats the management strategy for reducing the cost?
First solve the maintainability problem in block C (worth ~62+% of the losses). use your maintenance engineering skills
Second solve the unreliability problem in block B (worth ~36+% of the losses). use your reliability engineering skills
Notice the losses are so small in block A they may not be worthwhile fixing. may never get fixed, why would you assign an engineer to solve this problem?
If you dont know the illness and have a single name to explain the illness, youll never fix it! Notice how the cost of unreliability analysis separates maintainability problems from unreliability problems and prioritizes the work!!

From the operations point of view, the biggest problem with the cost of unreliability lies in block C (a maintainability problem), followed by block B (an unreliability problem). Notice that the cost of unreliability also clearly tells how much money can be spent for correcting the problems. For example, if the requirement is for a one year payback, then you cannot spend more than ~$600,000 to correct the problems in block C and ~$300,000 to correct the problem in block B.

This calculation is also helpful for applying for hero awardsfor example, suppose you can correct the problems in C by spending $100,000, then you have the facts to support submission of details for your hero award. Figure 3 also shows that you cannot afford to dedicate anyone person to work in block A as the annual costs for a person can exceed the potential gains by reducing the costs.

Figure 3 gives the financial state of the plant on one side of one sheet of paper to help set the strategy for improvement. The strategy should drive the tactics for getting back the losses by working on the problems in Pareto order.

You can download a PDF file copy of this page (255K file size).

Barringer & Associates, Inc., 1998, 2001, 2008

Return to Barringer & Associates, Inc. homepage

Last revised 6/20/2010
Copyright 2003