RCM – Living RCM: Achieving reliability from data

1 The objectives of LRCM (slides)

Slide 1 LRCM objectives
Slide 1 LRCM objectives

Living RCM was created to address two vital yet unresolved issues not addressed by initial RCM or the EAM. First, observational data entered on the work order is, generally speaking, of insufficient quality for Reliability Analysis (RA) and decision modeling. Efficient maintenance management requires model based decision making. Otherwise maintenance personnel would have to decide everything from first principles, or from “rules of thumb” based on experience but never scrupulously confirmed. Maintenance  decision “models” are policies. rules, or algorithms that permit quick, decisive, verifiable action.

The second issue addressed by LRCM refers to the RCM knowledge base itself. Generally in the maintenance department RCM is considered a one-time project, primarily to devise the EAM PM work schedules. The initial RCM analysis and the tasks derived therefrom tend to remain static. Given the dynamics of production, maintenance technology, and growing experience the plan quickly grows stale. Additionally the initial RCM analysis was imperfect, a consensus formed from the analysts’ best recollections. To solve this issue LRCM keeps the RCM analysis in view from within the day-to-day work order process, so that the plan is continually, effortlessly revised in order to approach ever closer to reality.

Objective 1: Work order data quality for RA – the nature of data

Slide 2 The nature of data
Slide 2 The nature of data

Although maintenance personnel, including analysts, planners, managers, and engineers seem preoccupied with data, they seldom clearly define that beast. They should target three distinct categories of maintenance related data. “Age data”,  vitally important, receives too little attention within the maintenance process. Age data records the age at which a failure mode ends its life and the type of event, either failure or preventive renewal (life was “suspended”). During EAM configuration, maintenance engineers neglected to emphasize age data as a vital process requirement.

Condition monitoring (CBM, PdM) data, on the other hand is well structured and accessible.

Finally, cost data, has a singular purpose whose role in maintenance has gotten lost in the shuffle. Cost data provides the optimizing objective within a maintenance decision model. Analysts should use the EAM and related APM tools to nail down the business factors surrounding failure and prevention, best expressed as a ratio between the cost of a failure mode failure event relative to the cost of preventing that event through some proactive program.

Age data

Slide 3 Age data time line
Slide 3 Age data time line

Age data derives from one or many time lines marking events relating to a failure mode. A named position on the time line marks the ending of a life. It may also mark the beginning of a new life. The way in which a life ends (either by failure, potential failure, or suspension) is a vital attribute of each point in a sample. A sample is a collection of points, and a point is a lifetime defined by two positions, a beginning and an ending, on a time line.

Ending event types: failure, potential failure, and suspension

Slide 4 Three ending event types
Slide 4 Three ending event types

For reliability analysis to be an effective tool in optimizing the maintenance process events that occur daily must be clearly defined by the organization. These events are contextually dependent. An event that is considered a failure in one organization may be defined as a potential failure or even a suspension ( a non-failure) in another. Standards for defining these events must be set and agreed upon by the manager and his team. Images, incorporated into the RCM knowledge base, can help to establish consensus on what exactly constitutes failure.

Non-rejuvenating events

Slide 5 Non-rejuvenating events
Slide 5 Non-rejuvenating events

RCM tends to be pedantic in its word definitions. Besides developing good context dependent definitions for failure, we also wish to make the clear distinction between “maintenance” and “service”. Maintenance is the restoration of or the addition of life units to a part or failure mode. Service, on the other hand, consists of tasks, such as lube top ups and change-outs that do not add any micrometers of metal back to part surfaces. Nevertheless service tasks are required if the component is to achieve its design reliability. Usually, the manufacturer’s service recommendations are acceptable at least initially and should not be hotly debated within an RCM analysis session. However, one aspect of service is important to note and record if the service task affects a CBM monitored variable, such as the metal content of an oil sample. If the service task is performed prior to a CBM task it could deceive the CBM decision process. The CBM procedure should account for this possibility.

Condition data

Slide 6 Two types of CBM variables
Slide 6 Two types of CBM variables

In recent decades condition data has assumed dominant attention in maintenance information procedures and tools. Two sub-categories of condition data, however, are little known nor discussed, but convey much needed insight into the role and effectiveness of CBM.

External variables detect the stresses on an asset even before the failure process has begun. While internal variables reflect the failing condition of the asset.

Both types of CBM data may be used in a predictive maintenance program as both influence the probability of failure.

Examples of external and internal condition data

Slide 7 Examples of internal and external CBM variables
Slide 7 Examples of internal and external CBM variables

Most maintenance departments focus heavily on internal variables such as vibration analysis and oil analysis. However external condition data may often be easier to acquire and more “predictive” providing a longer PF interval since external data reflects the stresses applied to the asset before any discernable damage has occurred.

Cost data

Slide 8 Cost model equations
Slide 8 Cost minimization model for determining the optimal preventive renewal age.

As can be inferred from the comments of Slide 1 cost data provides the pivot point for optimal decision modeling.  The three equations are simply mathematical language for the following logical statements:

  1. The expected maintenance cost Ct of of an item during a single life cycle of the item will be the cost CR of preventing a failure times the probability R(tp) that the item survives to preventive action time tp PLUS the failure cost CF times the probability (1-R(tp)) that it fails prior to tp.
  2. An analogous statement can be made for the expected time tt of failure.
  3. The ratio of Ct:tt is the simply division of 1 by 2 which is what we wish to minimize in order to achieve maximum return for the shareholders of the enterprise.

A numerical algorithm is then applied to determine the optimal time to perform maintenance.

The key (EAM-RCM) relationship

Slide 9 The key relationship
Slide 9 The key relationship

The main theme of this module concerns the objectives of LRCM. The first objective, to acquire analyzable data from the work order, depends almost entirely on the on the integrity of the relationship illustrated on Slide 9. The catalog values must faithfully represent the failure modes analyzed in the RCM knowledge base. The catalogs are an imperfect lens on the RCM knowledge base, since the EAM does not include the Effects analysis. Furthermore, should situations change, or should unanticipated failure modes arise an appropriate catalog value will not be found when completing the data on the work order. For these reasons, the EAM work order history has proven inadequate as a data sample source for reliability analysis.

Data samples for RA

Slide 10 Samples for reliability analysis
Slide 10 Samples for reliability analysis

Once a process is in place that guarantees analyzable grade data on the work order, it remains merely to extract a sample for analysis and decision model construction. This step is straight forward. Slide 10 defines precisely what a sample is (a collection of life cycles) and how a sample is extracted from the work order database.

The left column represents 5 work orders. For simplicity we’ll assume that a work order involves a single failure mode called either RCMREF15 and RCMREF16. The technician reported that the failure mode event was a functional failure FF, a potential failure PF, or a discretionary renewal of an unfailed failure mode S (suspension). The sample generation process extracts the work order data into a virtual Events table. Note that each work order generates two event records: an ending event record and a beginning event record. Solid arcs represent life cycles consisting of an beginning and ending event.

The reliability analysis algorithm, essentially, counts up of the arcs occurring within a calendar window keeping track of which ended by failure (or potential failure) and which ended by suspension.

The dashed arcs have their beginnings or endings outside the calendar window. These are called left and right suspensions, and the algorithm knows how to handle them so as to include their information in the probability calculation.

The meaning of “optimal”

Slide 11 Meaning of "optimal
Slide 11 Meaning of “optimal

Traditionally, when CBM was simple, an obvious potential failure declaration point P could be set. In simple situations, such as the gradually increasing differential pressure across a filter or the gradually decreasing tread depth of a tire the decision model of Slide 11 could be considered “optimal”.

Most real data and failure probability

Slide 12 Real world data and failure probability
Slide 12 Real world data and failure probability

In most practical maintenance situations condition monitoring data does not reflect equipment health as obviously as implied by Slide 11. Data is erratic, subject to multiple stimuli, and therefore requires a probabilistic approach for optimally declaring that an asset is in a potential failure state. Slide 12 shows two graphs: Condition data and Failure probability versus working age. The vertical double arrow says that there is a relationship between the condition data at a given moment and the probability of failure. The maintenance engineer must find precise relationship between relevant condition data and failure probability so that he can develop and deploye an optimal decision policy.

The meaning of “optimal policy”

Slide 13 "Optimal policy" meaining
Slide 13 “Optimal policy” meaining

Animated Slide 13 defines “optimal policy” as it relates to one of several objectives. If we seek a policy that results in the lowest combined proactive and reactive maintenance cost, for example, the black curve would apply. The horizontal access represents a spectrum of policies from extreme conservative (low probability of failure) to extremely liberal (no attempt to predict preempt failure from condition monitoring data).

Objective 2: Systematic RCM knowledge improvement

Slide 15 Anticipated versus discovered failure modes
Slide 15 Anticipated versus discovered failure modes

Typically, an initial RCM analysis, once it is completed and the PM schedules loaded into the EAM, remains static and largely out of sight. The failure catalogs in the EAM hardly convey the depth of thought and effort that went into the analysis. Despite the intellectual energy expended in the initial RCM project, technicians frequently fail to select the appropriate catalog value representing an RCM derived failure mode from the work order drop down lists.

2 LRCM procedures and software

Procedures – Continuous improvement in the maintenance process

Slide 16 Maintenance process flow
Slide 16 Maintenance process flow

The well traveled maintenance process of Slide 16 consists of identifying the work to be done, planning the job, scheduling, and executing the work. In theory most of the tasks will have been programmed in the EAM as output from the initial RCM analysis. In practice, as we know, unscheduled failure events occur that disrupt the ideal work flow. Reliability engineers and analysts attempt to improve upon the maintenance plan so as to reduce frequency and severity of failure but lack a systematic process for improving the RCM knowledge base. LRCM provides the missing link inserted between the the Execute and Analysis nodes. LRCM ensures that the knowledge base improves continuously and that the work order data to be analyzed is accurate and consistent.

Procedures – LRCM scheduled activities

Slide 17 Daily, weekly, and periodic LRCM activities
Slide 17 Daily, weekly, and periodic LRCM activities

Three groups of activities performed daily, weekly and periodically constitute a LRCM process. The work order data entry form is at the heart of daily LRCM proactive. Technicians use the LRCM work order completion form to perform two essential activities:

  1. To convey the failure mode and its ending event type accurately to the EAM work order database, and
  2. To suggest improvements and feed back any discrepancies between the RCM knowledge base and the technicians direct observations having renewed failure modes while executing the work order.

On a weekly schedule the reliability engineer processes the suggestions from the technicians and accepts, edits, or rejects the feedback. This is the mechanism for incremental continuous RCM knowledge improvement. A second weekly activity would be to review the performance of decision models executed during the week.

Periodically new equipment will be added to the physical asset hierarchy and an initial RCM analysis will be performed.

Software – MESH LRCM Modules

Knowledge Builder

Slide 20 Knowledge Builder
Slide 20 Knowledge Builder

This module is the MESH RCM knowledge depository. Among its many innovations is the ability to embed images into the failure mode and effects description fields. This feature was added primarily to allow the maintenance organization to clearly define what “failure” means within the context of each failure mode. When the failure versus suspension decision is not clear the image capability will allow organizational standards to evolve through discussion and consensus.

Knowledge Builder also includes a configurable criticality (risk) matrix associated with each failure mode as well as other features designed to support accurate work order reporting.

Work order data entry

Slide 21 Work order data entry module
Slide 21 Work order data entry module

The work order data entry module is the central innovation of MESH LRCM. For the first time maintenance and reliability engineers can be confident that analyzable grade data will be returned via the work order to the EAM. The MESH work order data entry interface provides two key functions:

  1. The ability to select the failure mode and its event type (failure, potential failure, and suspension) directly from the RCM tree view of the knowledge base.
  2. The ability to assess and feed back, dynamically, any discrepancy between the RCM knowledge and observed reality regarding the failure modes, their effects, consequences and mitigating tasks.

Knowledge feedback

Slide 22 Feeding back knowledge from the field or shop
Slide 22 Feeding back knowledge from the field or shop

Gaps easily develop between the initial RCM knowledge base as reflected in the EAM failure catalog values and reality as it is observed day-to-day by technicians troubleshooting and executing work order tasks. Technicians must have a way to report such divergences and suggestion how to correct or refine the knowledge base. Slide 22 illustrates the MESH feedback function.

Feedback Manager module

Slide 23 LRCM feedback manager module
Slide 23 LRCM feedback manager module

The RCM knowledge base is a valuable intellectual asset. It’s function is to represent the collective knowledge justifying the current maintenance plan. Knowledge, by definition, is imperfect, incomplete, or inaccurate and must be continuously improved in order that it may fulfill its role. The Feedback Manager interface allows the reliability enigneer responsible for maintaining the RCM knowledge base to assess, acknowledge, edit, accept or reject suggestions by the technician fed back through the LRCM work order module.

Knowledge Trail Module

Slide 25 RCM knowledge trail
Slide 25 RCM knowledge trail

When serious failure strikes, especially one whose consequences are health, safety, or environmentally (HSE) related we need to discover what went wrong on our maintenance plan. What did we know about the failure mode and when did we know it? What compensating provisions or mitigating tasks were in place? Were they adequate. MESH can scroll backward and forward in time and display the state of knowledge any previous point, so as to determine whether the failure mode was well understood, its consequences mitigate, and what further should be done.

RCM EAM synchronization module

Slide 26 RCM EAM synchronization module
Slide 26 RCM EAM synchronization module

As new and updated knowledge elements accrue  in the RCM knowledge base, it is essential that the catalog values continue to reflect reality and the latest understanding of failure behavior. Otherwise reliability analysis will fail to keep pace with the moving target that maintenance tends to be. Periodically, the reliability analyst should run the synchronization to determine what updates should be made in the EAM failure catalogs.

Sample generation and reporting

The sample generation and reporting system provides analyzable data from which conclusions and continuous improvement flows. Documented and verifiable decision models connect the low and high level KPIs

3 LRCM Workshop

Work order data quality

Exercises

ShovelTruckThere are 4 groups and 3 simple  exercises to be completed in this segment. The equipment is a Hitachi hydraulic shovel. The exercises are:

Exercise 1:  report a functional failure

Scenario

Slide 33 Completing the work order - failure
Slide 33 Completing the work order – failure

Maintenance receives a call from the operator saying that:

  1. The shovel arm does not lift …

The technician arrives and observes that

  1. The frontal attachment does not respond to the command
  2. There is a communication error between the control lever and the lift control panel

Please complete the data entry for this work order.

Exercise 2:  report a potential failure

Scenario

Slide 34 Hydraulic shovel potential failure
Slide 34 Hydraulic shovel potential failure

A team of technicians, in the pre-300 h visual inspection found a fuel hose system, while not leaking had significant wear. They knew that if it were not replaced it would fail imminently. So they decided to change the hose. Please complete the Work Order.

Exercise 3:  report a suspension

Scenario

Slide 35 Suspended failure mode instance
Slide 35 Suspended failure mode instance

The same team of technicians were doing another preventive visual inspection at 300h. This time they found the fuel hose system with little wear (they judged that the hose can last for at least another year). Nevertheless, given its relatively low cost and high vulnerability to stone impact, they decided to change the hose. Please complete the work order.

RCM Knowledge continuous improvement

Technician Exercise 4:  knowledge feedback – suggesting an undocumented failure mode

Scenario

The operator reports that air conditioning does not cool the cabin at all. There is a noise coming from the pump compartment. The operator decides to stop the equipment. The technician arrives and finds that the AC Hydraulic motor is running but not delivering fluid to the AC motor because the spring on the relief valve is broken.

However the technician finds no RCM failure mode leaf in the knowledge tree representing this condition. Please use the knowledge feedback system to propose a new failure mode, its effects, and consequences. Use the justification area to propose a mitigating task and frequency.

Technician – Exercise 5:  knowledge feedback – suggesting a change to the effects narrative

Scenario

Slide 37 Knowledge feedback to the Effects analysis
Slide 37 Knowledge feedback to the Effects analysis

The operator reported “Slow boom”. The technician located the failure mode “Valve block Internal chamber worn” in the RCM knowledge tree. However he noticed some important details regarding the effects of failure that were not included in Effects. The effects narrative mentions that “there is a noise coming from the valve block when this failure mode occurs”. The technician would like to change “there is a noise” to “there may be a noise”. And he would like to add that thermography will detect a greater than 10C difference in temperatures between chambers, which would indicate that this failure mode has taken place.

Please use the feedback system and make the changes necessary in the Effects description. Use the “Justification” text area to recommend a thermography mitigating action at 3000 hour intervals and that the main valve should be changed if a difference >10C between any of the chambers is found.

Processing knowledge feedback

See articles:

  1. MESH – RCM knowledge continuous improvement
  2. MESH Basic reliability analysis on the work order
  3. RCM – feedback suggestion mechanism
  4. RCM – feedback – suggesting a new failure mode

4 Empowerment and recognition

See articles:

  1. RCM – Dashboards
  2. RCM – LRCM dashboards