Reliability Success

AvatarMinimum cost for a given level of performance and risk

The continuing popularity of Time-Based maintenance

A recent post in the Reliability Success forum asked the question: "WHy is time based PM still very popular if > 80% of failure are not related to time"

My take on this? There are four reasons. Some are easy to get around with adequate training and mentoring, others take a bot more time to sort out.

1) The message still hasn't reached everybody... Believe it or not. Every single time I give the RCM course, and I have trained nearly 4000 people now, people are always taken aback by the fact that 89% of failures in the N&H study were not related to time.

2) Fear of the new. I have trained and facilitated groups in RCM who, after working through every failure mode themselves, still want to run their old program in parallel with the new RCM failure management strategies. There is no logic to this...just a fear.

3) Safety. The catch cry of everyone who is losing an argument in maintenance and reliability. Yet time based maintenance is demonstrably **more dangerous** than other forms. 

4) Bad introductions. I have seen many (MANY) RCM practitioners who drop the nowlan and heap failure curves on people without explaining the back story. 

What is a "constant failure mode" really, how do they come about, what is the difference between complex and simple assets...etcetera. Once the facts are known, the logic makes sense. 

This is what I have seen in any case. I would be interested in hearing what others have seen on this issue. 

(By the way) I was working with a colleague once who said that any conversation with an RCM practitioner always starts with "first accept that you are wrong". 

That might be part of the reason also... ;-)


Check out the best roles in the game for maintenance and reliability professionals!


Preventing Failures of RCM - Pt 1

After a long and winding journey RCM is finally findings its rightful place as a cornerstone of modern asset management. The benefits are well documented now and cover all aspects where physical aspects have an impact on corporate performance.

Yet still many programs end with a whimper.

Lack of management support, poor asset selection, lack of momentum and taking technical shortcuts are undoubtedly killers of any RCM program. The overuse and misuse of criticality, streamlining the method instead of the implementation process and poor program management accounts for a lot of these issues.

Yet all are joined in one classic error; the failure to adequately train RCM Analysts.

When reviewing RCM analyses I often come across annoyingly similar mistakes, all of which have potentially harmful impacts, and all of which are avoidable if the RCM Analysts are properly prepared in the first place.

These errors tend to fall into three categories, and aside from the impacts below they are all motivation and momentum killers.

Failure modes at the wrong level of causality. Leads to blanket strategies not connected to the failure mechanisms, over use of the run to failure options, and excessive spares options required.

The worst effect of this of course is the fact that not all of the reasonably likely failure modes have been uncovered. Yet everyone familiar with RCM expects that they are. False sense of security, unmet expectations, classic failure of the implementation.

Combed of ambiguous failure modes - "Control Failure" is a classic example of this, so too is "bearing fails due to wear or contamination". Again there is almost no way to get the right sort of failure management strategies in these situations. This is not uncommon where someone is trying to justify what already exists, instead of determining the real maintenance requirements.

Developing strategies without using the decision diagram(s) - This is a shocker and almost always happens to first time RCM Analysts. The trap is to fall into inserting the strategies that already existed instead of developing strategies to manage the failures found. This has the effect of a zero benefit analysis, or worse - one that is incorrect. (And potentially dangerous)

It is easy to come undone here. Time based failures really seem like they should be managed using a predictive task of some sort... but how is this done? Lots of questions like this exist between completing an RCM Analysts course and becoming good at it.

Misapplication of the Detective maintenance formulas - This is far too big to discuss as part of this stream. But suffice to say that it is one of the real potentially dangerous areas of the analysis.

These are all errors related to technical soundness of the analysis, and there are of course, lots more like:

  • Defining design capacity instead of user requirements
  • Defining functions that are already covered by the primary function
  • Basing detective maintenance frequencies on evident (safe detected) instrument failures... and so on...
In the next posting in this series we will go into some of the program management failures. Those failures that almost guarantee the work will get little support, little momentum, and very little chance of accomplishing what the organization has set out to accomplish. 

The way to combat these failures is to have adequately trained RCM Analysts. Analysts who have received both classroom and on the job training, as well as being coached through the program management issues. 


Check out the best roles in the game for maintenance and reliability professionals!

Re-defining the role of Electrical tradespeople

Even at the beginning of the twenty first century, almost every maintenance review we do uncovers a raft of electrical tasks and routines that serve no purpose apart from raising the likelihood of equipment failures.

When I was a baggy shorts apprentice back in the 1980's we used to do lots of tasks like opening up motor termination blocks, and marshaling cubicles and checking for tightness of the terminals.

And then, at least once a year, there would be a huge influx of large DC motors from the mining shovels and draglines which we were supposed to overhaul.

This ended up being a change of the bearings, re-painting the windings to restore the insulations original resistance to failure, and running a wheatstone bridge over the windings looking for indications of early life failures.

In the vast majority of cases this was a dramatic waste of time! And I cannot believe that this thinking still exists today.

Something I have learned - if it is not subject to ambient vibration, and the cables don't move - then nothing is going to come loose! (Period)

In fact, you run the risk of loosening the terminals by mucking around with them. As well as messing up the gaskets, introducing foreign matter and moisture, and a whole host of other issues under the heading of "messing with things that are working fine".

In the case of DC motor overhauls, most of this stuff can be done in situ. Skimming armatures, replacing and bedding in brushes, and getting rid of excess carbon build up are all small and regular tasks that need to be done in situ, not in an overhaul situation.

Bearings should NOT be replaced on a hard time basis! This is one of the greatest scams of modern asset management. Do you really need to re-paint the windings to restore the insulation? I have yet to find a case where not doing this has led to early failures. (But there are many cases where interfering has caused failure!)

So what should Sparkies do then?

There are routine tasks that electricians should be doing, and some of these are above. But principally electricians are there for the hard hitting end of the deal. The moment when it all turns to muck and we need to rapidly get to the bottom of the problem.

The job creation works as outlined above are more likely to lead to failure rather than prevent or predict it. The heart of the problem is our attitudes towards maintenance people and their employment.

If hey aren't actively engaged in maintaining assets then we see them as wasting our funds. Yet with a small mind change we could employ the electrical trades in a lot of higher end tasks such as analytical problem solving and reviewing general maintenance practices.

We don't need to force them into activities that are detrimental to our operations...surely.


Check out the best roles in the game for maintenance and reliability professionals!

Maintenance and Management - The root of the problem (Pt 2)

If we are ever going to get around The Budget Game, and build budgets that truly reflect our real spending for a desired level of performance and risk, then we need to get to the base of how it is done nowadays.

When a maintenance manager starts thinking about what she will need for maintaining the plant over the next twelve to twenty four months their first port of call is often a combination of the anticipated routine activity for the future, and the near past.

Routine forecasting

Future proactive activity generally comes from the forecast planned maintenance tasks held in the companies CMMS / ERP system. If an organization has its act together, then this will include both the OPEX maintenance and the Capital maintenance planning. (And surprisingly few actually do have this stuff together)

But...  where did this stuff come from in the first place? If your plant is anything like most plants then job-expectancy is around 5 years. meaning that few people even realize why these activities are being done and even fewer have the foggiest as to where they came from.

Once you star to dig into it a little, you find out that the routine OPEX stuff, operating maintenance lets call it, generally is a combination of manufacturers recommendations and experience. (Experience meaning "Ouch that hurt, lets not do that again")

There have been reams of articles written on this, but there are many issues with manufacturers guidelines. One particularly nasty issue is the facts of a manufacturers business model. This is generally something like; "move as quickly as possible to the next model, rush through the basic engineering, make sure all maintenance recommendations are conservative, and above all DO NOT GET SUED!"

So it isn't too much of a leap in logic to work out that building and relying on a maintenance strategy that comes from this background, or waiting until things go wrong and force our hand, is both undesirable and possibly even unethical.

And then there is the real fun stuff...the capital maintenance plans. For the sake of this article we will say that these are all the major refurbishments, replacements and overhauls.

Where does this come from? Does anybody ever really know?

My experience, again after researching into the dim dark pasts of many plants, sites and companies, is that it is often provided by either accountants or the initial contractor.

And the logic is often tied to either the depreciation dates of the assets, which is a bit of financial black arts once you get into it, or it is tied to something a contractor put into his / her spreadsheet because "thats what they were asked to deliver".

Thats a bit frightening isn't it? Not something you would want to bet you career on is it? (or worse, the lives of those working with or near these assets)

Forecasting Corrective Maintenance

This is where things really go haywire. Every person who has had anything to do with Life Cycle Modeling, or Whole-of-Life asset planning, has generally encountered this problem.

How do you forecast corrective maintenance? Most people generally arrive at the point where they decide to take an average of the past 2 years (pick a number) and then use this, minus 10% (because you're going to get better right?) as the means of forecasting corrective actions.

And the problem is???... It is absolute sheer and utter garbage.

What happened last year, or in the last five years, may have little or nothing to do with what will happen next year. There are failure modes that have not yet occurred, corrective actions that have been eliminated totally, and a whole range of additional considerations on the efficiency front.

The size of the prize

So we are now facing the point where we know:

a) The routine maintenance may, or may not have anything to do with the levels of performance and risk we require from our assets.
b) Not only that but they came into being in uner dubious circumstances to say the least.
c) Our routine capital maintenance is probably overly conservative or some other derivative of fairy land forecasting, and
d) Our corrective forecasts are, at the very least, dead wrong.

Comforting isn't it? Took me a while to get to this point in my thinking some years ago.

But if we get it right what could happen?

1) On the local front the Budget Game turns from an annual competition and negotiation to a discussion about performance and risk.

2) We get strategies designed to deliver the performance and risk we are chasing, and most importantly...

3) We start down a path that will get us a greater level of confidence, dramatically greater, about our net present costs. (All of the costs we will ever have in todays money)

The last one is a kicker, and we went into the real value of high confidence Net Present Value forecasts in the last article.

In short, it takes maintenance from something we think (because everyone is telling us) is a strategic initiative - through to a firm board room topic and long term competitive advantage.

Not bad for bunch of grease monkeys (like me) and technicians is it?

Check out the best roles in the game for maintenance and reliability professionals!

At the heart of reliability

"Reliability engineering is concerned with forecasting and preventing failures..."
I just read this in an article I am reading by a respected author and practitioner on reliability engineering. The sad fact is that he is wrong, and this line of thinking has been wrong for about half a century now... old habits die hard I guess.

By forecasting this guy means "forecasting" as in probabilistic modelling. Now while I agree with the use of probabilistic tools where it is warranted, blanket statements like this are what lead people into dramatically over analyzing and over maintaining their plant items.

Lets take the case of a bearing failure. Due to long term minor overloading cracks have developed within the inner race, breaking the surface and rapidly contributing to the deterioration, and ultimately the failure, of the bearing.

The consequences of this failure are severe, so severe that a condition monitoring regime has been put in place and is being done at 33% of the P-F Interval. (To make sure that the onset of failure is detected)

How then are you going to prevent the failure in this case? There's no way known to man.. the bearing is going to fail just as the sun will rise again tomorrow. Nothing in this world will stop it... what we can do however, is preempt it somehow. Through early interventions, changes to the production run cycle or whatever other options you may have.

Are we about predicting failure here? Yes! Not forecasting but predicting as part of our failure management strategy.

And why have we bothered? Because the consequences are severe.

Lets take another example of an over speed switch in a turbine. A plant has 6 turbines. After careful consideration of the demand rates, failure rates and acceptable / tolerable levels of risk we calculate that these need to be checked every 18 months. (say)

We perform our baseline checks and find everything to be okay, and it is not until we check for the third time that we actually find that one of the over speed switches is now in a failed state. (Meaning it will not work to protect the machine if it is needed)

Preventing, avoiding and even (in this case) predicting the failure is way out of the question. Actually it has already failed.

Again we see that the reason why we are doing this at all is not to predict or avoid failure per se, it is to manage the consequences to a tolerable / acceptable level.

So where is all this going...

Even those who are very deeply embedded in probabilistic analysis realize that the likelihood of accurately forecasting failure is very remote. Because the data is never available. In fact, in my own experience with probabilistic analyses I have found that most turn into projects to try to find relevant data to use in the model.

The famous statement on the use of Weibull is that you only need 3 failure points. Fair enough... but getting even those three is often exceedingly difficult.

They need to be of the same failure mode, and if they are serious enough to warrant investigation then they carry significant safety / economic consequences. So analyzing them after the fact is almost in the realm of negligent isn't it?

The whole point of modern asset management is not to predict / forecast dates of failure - it is to manage the failure process where the consequences warrant it!

The Predictive and Detective maintenance examples above are pretty clear on this. And then there are run-to-failure cases, where we have determined that the most effective means of managing the asset is to actually let it fall over.

Do probabilistic methods have their place? Of course!! I'm a big fan of most of them and I use them regularly within my team and our business - but only where they are the best option. (You know the old story, when you have a hammer everything looks like a nail)

The real danger is thinking that it is all about preventing or avoiding failure, it is not - and thinking it is will lead only to frustration, over maintenance, and misapplied maintenance strategies.

Check out the best roles in the game for maintenance and reliability professionals!

Maintenance and Management - The Budget Game (Pt 1)

The Budget Game is played out year after year in the vast majority of asset intensive organizations all around the world. 

And what is the budget game? I'm sure you will recognize it.

1) I know that the manager is going to try to cut back my budget so I am going to pad it out a little.

2) The manager knows that you are trying to pad it out a little, so he knows he has to cut it back.

3) And when we are getting near to the end of the financial year make sure to spend every cent you have or they will take it away from you next year.

Survival of the fittest. Those who can outmaneuver each other wins this round and gets what they want. But generally, the entire organization pays the price. 

The seriousness of this didn't occur to me until I saw the case of Severn Trent Water in the UK. Where the Serious Fraud Office was called in by the industry regulator, with implications of charges being laid against individuals. 

Wow...the entire industry took a deep breath and suddenly every conversation had a different shade to it. 

This was obviously a very specific case and an extreme example. But the more I thought of it the more logical it became. When you pad out a budget what you are actually doing is defrauding the shareholders, owners, and in the case of regulated industries sometimes even the general public.

Yet everyone plays the budget game every year... that's a frightening thought when you look at it in the light of this line of thinking. 

It was about this time that I, and a few select clients at the time, began to look seriously at bidgeting practices and how they could be improved to eliminate The Budget Game.

The screamingly obvious issue was that without a solid and logical tie between performance and risk, and the activities required to achieve it, that this issue was never going to go away.

And in the mire we uncovered a few practices, techniques and planning mechanisms that would not only eliminate the budget game, but when implemented correctly it could force an entirely new dynamic on the management of the asset base. 

The first of these was Zero Based Budgeting, the second, an evolution of the first, was Risk Distributed Budgeting. 

While Zero based Budgeting was not new,  what had been missed previously was it's capability to feed into the long term planning accuracy by setting the framework in place for proactive data capture. (E.g. failure data without crashing a few more assets)

Over the next couple of weeks I am going to post a series on both of these methods, trying to go into detail about what they are, why they matter, how to do it, and of course - how to implement it. (You might want to subscribe via the links at the top, or via the email subscription box on the side.)

Without even considering the financial and risk impacts of this form of work during the implementationother impacts of this work includes:

  • Tying costs directly to the required / expected performance and risk of the physical asset base
  • Eliminating costs tied to bad habits formed over years. (Remember the monkeys?)
  • Setting up a framework for proactive data capture
  • Increasing the accuracy of capital maintenance activities
  • Elimination of The Budget Game
And if your organization track such things, and if they do not then they should do, you start to get a vastly higher level of confidence about the net present costs of the asset base. Meaning, more importantly, a far greater level of confidence over the Net present Value or profits of an organization.

Thats the sort of thing clients can take to the bond market for billions, not millions, of dollars in potential benefits.

Game changing ideas...


Part 2 - The Root of the problem

Check out the best roles in the game for maintenance and reliability professionals!

Monkey see, monkey do...

Ever hear the one about the monkeys and cultural change? I am probably not going to do it justice but it goes something like this...

There are three monkeys standing in line in a cage, and above the third monkey there is a bunch of bananas. The third monkey naturally reaches for the sweet treats, and as he takes one, the other two monkeys are drenched with water.
So they immediately start at the third monkey who is busily munching on his favorite food. But he doesn.t realize what.s happening, so he reaches for another banana and the other two are deluged.
By the time the third monkey has eaten the bunch of bananas, the other two are quite annoyed. So in steps the scientist, and replaces the third monkey with a new monkey. He spies the bananas and as he stretches out his arm, he is attacked by the other two monkeys.
The new monkey doesn.t quite understand why, but quickly stops going after the bananas. Some time passes and the scientist comes back and takes one of the drenched monkeys and replaces him. This new monkey again goes for the bananas and the other two attack him.
Then the scientist replaces the third of the original monkeys, with a new one. This new monkey is immediately attacked, and has no idea why. Even when the banana/water system is disabled, and another monkey introduced, he is attacked immediately.
And if the scientist keeps repeating the experiment, the two monkeys in the cage attack the new ape being introduced, though nobody can remember why, its just the way it is.
This is how maintenance regimes are formed, how streamlined and often counterproductive RCM techniques become "the way we do things around here", and repeat or chronic failures become accepted as part of the cost of doing business.

In the past twelve months we have seen the downfall of several global banking institutions. Accompanied with more than a few career burn outs as well.

These people were all exceptionally smart, just like you. And they were all exceptionally motivated and hard working, just like you.

So what went wrong? Mindless copying of their peers... monkey see, monkey do...

What are you doing?

Check out the best roles in the game for maintenance and reliability professionals!