HardSoft Industrial: The 7 Questions of RCM

By Carlos Cristancho, Spec. in Industrial Instrumentation and Control

Introduction

In the oil and gas industry, the reliability of instrumentation equipment and industrial control systems is fundamental to ensuring operational continuity, process safety, and the profitability of operations. Maintaining these assets requires a strategic and structured approach, especially in a sector where the margin for error is minimal and the consequences of failure can be catastrophic. In this context, Reliability-Centered Maintenance (RCM) becomes an essential tool.

The SAE JA1011 standard establishes clear and specific criteria for evaluating RCM processes, proposing a set of “7 questions” that guide the implementation of this methodology. These questions allow maintenance and reliability managers to effectively identify and manage failure risks, establishing preventive and predictive maintenance strategies that optimize resources and maximize the lifespan of assets.

The SAE JA1012 standard provides a detailed explanation of the essential criteria guiding the answers to the “7 questions”, defining the necessary requirements for evaluating RCM processes and identifying additional elements that must be considered to ensure effective implementation. This offers a comprehensive framework for those who wish to rigorously and successfully apply the RCM methodology.

In this article, I aim to explore each of the “7 RCM questions” from the perspective of instrumentation and control systems applied to the oil and gas production industry, a field in which I have developed my professional career. Throughout this analysis, I will share examples and reflections based on my direct experience in designing and implementing maintenance plans. This practical approach will enable readers to visualize how the principles of the SAE JA1011 standard, in conjunction with the SAE JA1012 standard, can be applied in critical industrial environments, helping to prevent unplanned downtime, optimize safety, and reduce operational costs.

The intention of this article is to provide a structured and practical guide that enables professionals in the sector to develop robust maintenance plans tailored to the specific needs of the oil and gas industry. By applying RCM principles, maintenance teams can significantly improve the reliability and performance of assets, ensuring safer and more efficient operations.

What are the desired functions and associated performance standards of the asset in its current operational context (functions)?

To implement Reliability-Centered Maintenance (RCM) on instrumentation equipment and industrial control systems within the oil and gas sector, it is essential to clearly understand the desired functions of each asset and the performance standards it must meet in its operational context. In this industry, these systems play a critical role in ensuring that processes remain stable, efficient, and, above all, safe. Any deviation in their operating parameters can jeopardize operations and lead to significant financial losses.

The first step in developing any maintenance plan, which aligns with section 5.1.1 of the SAE JA1011 standard, is a demanding and highly responsible task. Defining the operational context of each asset requires searching, collecting, and connecting the documentation related to the project's design and construction, particularly for new equipment or projects. For assets or projects already in operation, this process also necessitates access to operational history.

Based on my professional experience, the documentation I have used as input for defining the operational context applicable to instrumentation and control systems is illustrated in Figure 1. This initial step also requires effective, multidisciplinary communication with the organizational departments involved in project development or operation, in conjunction with the application of international, national, regional, and company-specific standards.

This is because the functions and failure management strategies of an asset are influenced not only by the asset itself but also by the context in which it operates. Therefore, a clear definition of the operational context is essential before addressing questions related to asset performance and reliability.

A description of how instrumentation and control systems are intended to be used in a specific project, the geographical and environmental conditions where these systems will operate, along with standards related to production, performance, safety, and environmental integrity, are the required components for declaring an operational context.

In Figure 1, it can also be observed that the output of the operational context declaration is a document that identifies whether the instrumentation and/or control system will operate in a batch (intermittent) or flow (continuous) process, the expectations for product quality and customer service, including metrics such as waste rates and customer satisfaction, and compliance with relevant environmental regulations at organizational, regional, national, and international levels.

Additionally, the document declaring the operational context must identify established safety expectations, including injury and fatality rates, the characteristics of the operational environment (e.g., arctic or tropical), the operational schedule (e.g., continuous or intermittent), and load conditions (maximum load or base load), as well as the availability of backup systems or capacities.

Figure 1. Scheme based on the operational context criteria and guidelines described in SAE JA1011 and SAE JA1012 standards.

It should also identify the extent to which work in progress can absorb equipment downtime without affecting total production, decisions on the storage of critical spare parts that may influence failure management strategies, and consideration of market demand fluctuations and raw material availability that could impact operations.

For complex systems, it may be beneficial to organize the operational context hierarchically, beginning with the organization's general mission statement that uses the asset.

Instrumentation assets in oil and gas fulfill desired or required functions such as accurately measuring variables (pressure, temperature, flow, etc.), continuously monitoring processes, and generating control signals that enable real-time automatic adjustments. On the other hand, control systems (such as DCS or PLCs) interpret signals from instruments and adjust operations to maintain conditions within safe and operational limits.

According to section 5.1.2 of the SAE JA1011 standard, the functions to be identified are classified into primary and secondary functions. Primary functions are the fundamental reasons an organization acquires an asset or system. These functions represent the core purpose of the asset or system and are essential to achieving the organization’s objectives.

For example, consider a transmitter that measures level by differential pressure. The LIT-0715012 asset fulfills the following primary functions:

Measuring the level in the Clarifier Tank CLFT-071692.
Transmitting the control signal to the PLCINS-018270 controller located in the Control Cabinet PLCSTAP-018270.
Activating interlock I3 for high level in the Clarifier (H = 112.58 inH2O), triggering the closure of valve FCV-071910 to prevent production water from entering the Clarifier from the Fwko Tanks (TK-071011/2) and the Washing Tanks (TK-071293/4).
Activating interlock I8 for low level in the Clarifier (L = 104.59 inH2O), triggering the closure of valve XV-0716021 to prevent sludge from leaving the Clarifier toward the distribution header feeding the Sludge Pumps AP-071601A/B.

According to section 6.2.1 of the SAE JA1012 standard, this means the transmitter must:

Measure level: The primary function is to measure water level in the clarifier tank, which is the central purpose of having a level transmitter.
Measurement capacity: The transmitter must measure levels via differential pressure within a range of 104.59 inH2O to 112.58 inH2O, implying appropriate design and mounting space.
Alarm activation: The references to "104.59 inH2O" and "112.58 inH2O" establish performance standards in terms of differential pressure, which are critical for the transmitter’s functionality in level measurement.
Usage conditions: The reference to the "Clarifier Tank" indicates that the transmitter is designed to operate in a specific environment, which is also part of its primary function.

Identifying primary functions is crucial for several reasons:

Performance Focus: Enables the organization to focus on the asset's performance to meet objectives.
Maintenance Strategy Development: Understanding primary functions helps establish maintenance policies to ensure the asset continues to fulfill its purpose.
Risk Assessment: Understanding primary functions allows for identifying risks associated with their failure and developing mitigation strategies.
Investment Justification: Demonstrates how the asset contributes to organizational objectives, justifying its acquisition.

Secondary functions, according to the SAE JA1012 standard, are additional functions performed by the asset. These are generally less apparent than primary functions but, depending on the asset's operational context, the loss of one or more secondary functions can significantly impact overall performance and operability, sometimes even more so than the loss of primary functions.

For the differential pressure level transmitter, secondary functions include containing production fluids and auxiliary liquids in the process connection. Secondary functions require as much attention as primary ones since they can influence safety, efficiency, and user satisfaction. It is crucial to ensure they are clearly identified.

Another example involves a control cabinet. Its secondary functions include supplying electrical power, cooling, connectivity, structural support, and physical space for the reliable and safe operation of the equipment housed within. It also protects internal devices from mechanical risks (impacts, vibration, breakage, abrasion, etc.), physical risks (humidity, solar radiation, excessive temperature, etc.), among other risks, allowing the controller and its peripherals to operate reliably and safely under specified environmental conditions.

When identifying secondary functions, the following aspects should be considered, according to the SAE JA1012 standard:

Environmental Integrity: Ensuring the asset complies with environmental regulations and does not harm the environment.
Structural/Safety Integrity: Evaluating whether the asset maintains its structure and safety during operation.
Control/Containment/Comfort: Considering the asset's ability to regulate its performance and provide a safe and comfortable environment.
Appearance: Aesthetic signaling of the asset, particularly in environments where visibility and readability are necessary.
Protection Devices and Systems: Identifying functions that protect the asset and users from failures or abnormal conditions.
Economy/Efficiency: Assessing how the asset contributes to operational efficiency and cost control.
Superfluous Components: Recognizing non-value-adding components that could be removed to improve efficiency.

It is important to emphasize that, according to section 5.1.2 of the SAE JA1011 standard, the identification of asset functions must include the functions of all protection devices.

Figure 2. Functions of protection devices in the framework of the SAE JA1011 standard.

As an example, in defining the primary and secondary functions of the pressure control valve PCV-6712919, shown in the red box in Figure 2, the functions of the pressure safety relief valves PSV-6712912A/B must be included because they are the protection devices of the three-phase separator SEP-671292T. These valves must open in scenarios of overpressure associated with failures in the pressure control valve in the gas-to-flare line.

According to section 5.1.3 of the SAE JA1011 standard, the definition of each function must include a verb, an object, and a performance standard, which should be quantified whenever possible. This ensures that expectations regarding the asset's operation are clear and measurable.

In the example in Figure 2, one of the primary functions of the pressure control valve is to open proportionally to the gas pressure in the three-phase separator. A secondary function is to contain the gas in the mechanical connection with the pipe. One of the functions of the protection devices is to open instantaneously if the separator pressure reaches 95 psig to relieve overpressure caused by the loss of the primary function of the gas pressure control valve in the separator.

In the above example, the structure for defining functions, as outlined in the SAE JA1012 standard, is applied:

Verb: Indicates the action the asset must perform. For example, in the case of the proportional pressure control valve, the verb is “open”.
Object: The element on which the action is performed. For the valve, the object is “gas”.
Performance Standard: Defines the conditions under which the function must be performed, including specific metrics. For the valve, the performance standard is “proportional to the gas pressure in the three-phase separator”.
Protection Functions: Must include the word “if” or the phrase “in case of”, followed by a brief summary of the circumstances that will trigger the protection function, as seen in the example of the safety relief valves mentioned in Figure 2.

In my experience, accurately identifying primary, secondary, and protection device functions not only enables the design of more efficient maintenance plans but also helps prioritize resources for those assets whose failure could negatively impact process safety or lead to unplanned shutdowns.

The functions defined for the assets in the above examples also incorporate performance standards, which are crucial to represent the level of performance desired by the client (owner or operator) within the operational context of the asset, as indicated in section 5.1.4 of the SAE JA1011 standard.

This includes defining acceptable operational thresholds, measurement or response frequencies, and required precision levels. These standards are typically derived from both manufacturer specifications and internal industry regulations. In my experience, having these criteria clearly defined makes it more effective to identify when an asset begins to deviate from normal behavior, facilitating timely intervention before a failure occurs.

An asset is considered failed when its performance drops below the minimum acceptable level defined by the client. Conversely, if the asset maintains performance above this threshold, it is considered to be operating satisfactorily. Users include not only owners but also operators and society, which expects assets to operate safely and without causing environmental harm.

Performance can be evaluated from two perspectives: desired performance (what users expect) and inherent capability (what the asset can actually achieve). When commissioning an asset, its initial capacity must exceed the minimum performance standard, as shown in the right-hand graph in Figure 3. This ensures a margin for deterioration before the asset reaches functional failure.

The design of an asset must consider deterioration, allowing for a reasonable performance margin. However, this margin should not be excessive, as it could result in over-design and increased costs. If the desired performance exceeds the built-in capacity, the asset may become unmaintainable, meaning maintenance efforts will not be sufficient to meet user expectations, as illustrated in Figure 3.

For this reason, it is essential to quantify performance standards whenever possible, as this ensures accuracy and precision. In situations where qualitative standards are required (e.g., for aesthetic functions), it is crucial to ensure these standards are clearly understood by all stakeholders involved.

Figure 3. Relationship between Initial Capacity, Margin for Deterioration, and Desired Performance in Industrial Assets within the Framework of SAE JA1012.

Throughout my career, I have learned that understanding functions and performance standards is not just a theoretical exercise but involves observing how each asset responds within its operational context. Instrumentation and control systems, in particular, are exposed to harsh environments with variations in temperature, pressure, vibration, and corrosion, which affect their reliability over time. A good practice is to periodically review and adjust performance standards to align with current operating conditions, especially if these change due to facility expansion or modification.

Additionally, I have found that involving operators and technicians in the process of defining functions and standards helps capture specific details about asset behavior in operation. Operators, who are in direct contact with the systems, often provide valuable insights into the sensitivity or response of certain instruments under particular conditions. This knowledge is a significant advantage for the maintenance team, enabling them to anticipate potential deviations and establish better intervention strategies.

How can assets fail to perform their functions (functional failures)?

As we have seen, deterioration refers to the gradual decline in performance over time. On the one hand, an asset may degrade below its initial capacity while still functioning adequately for the user. On the other hand, functional failure occurs only when performance drops below the minimum acceptable level defined by the user.

Performance standards for certain functions often include upper and lower limits. Functional failure occurs if an asset operates outside these limits. For example, a Coriolis flow transmitter must measure within a range with a specific accuracy for a defined period. Deviating beyond the tolerated precision constitutes a failure, and the reasons for these failures can vary.

Therefore, functional failure occurs when an asset cannot perform its designated functions according to the required performance standards. Each asset typically has multiple functions, and any of these can fail independently. For example, a Coriolis flow transmitter may measure flow within the required range and accuracy (primary function) but still leak (secondary function), indicating a failure in one function while succeeding in another.

In the context of Reliability-Centered Maintenance (RCM) as described in the SAE JA1012 standard, functional failures refer to the various ways an asset can fail to perform its defined functions. RCM analysis emphasizes identifying all potential failure states associated with each function of an asset, as a comprehensive understanding of these failures is critical for effective maintenance and risk management.

An asset may fail completely, such as a controller sending erratic signals to actuators, losing process control. Alternatively, it may fail partially, performing its function inadequately, such as a transmitter with a local indicator that does not display readings but still sends the required control signal. Identifying partial failures is essential because they often arise from different causes than total failures and have distinct maintenance implications.

To apply section 5.2 of the SAE JA1011 standard in the maintenance plans I have developed for instrumentation and control systems operating in the oil and gas production industry, I have treated each function of an asset as a logical proposition. By negating each logical proposition, I can determine the failure states associated with each function.

Take, for example, a temperature switch or thermostat. The asset TSH-071601 has the following primary functions:

Detects if the temperature in the stator of Mud Pump P-071601G exceeds the high-temperature setpoint (H = 128°F).
Transmit the signal to the PLC-12341 controller, housed in cabinet PLCINS-12341.
Activate interlock I17 to shut down the pump motor in case of high temperature.

The failure states for these functions, derived by negating the logical propositions, are:

Does not detect if the temperature in the stator of Mud Pump P-071601G exceeds the high-temperature setpoint (H = 128°F).
Does not transmit the signal to the PLC-12341 controller, housed in Cabinet PLCINS-12341.
Does not activate interlock I17 to shut down the pump motor due to high temperature.

Another example can be cited regarding the secondary functions of a PLC. Table 1 shows the secondary functions for the PLC-12341 asset and the corresponding failure states associated with each function.

Secondary Function	Functional Failure
Optimize electrical energy consumption to contribute to environmental protection.	Does not optimize electrical energy consumption to contribute to environmental protection.
Provide data to evaluate compliance with environmental regulations.	Does not provide data to evaluate compliance with environmental regulations.
Implement access control and security measures to restrict access to sensitive system areas.	Does not implement access control and security measures to restrict access to sensitive system areas.
Prevent unauthorized manipulations that could endanger people's safety.	Does not prevent unauthorized manipulations that could endanger people's safety.

Table 1. Secondary functions and functional failures of a PLC.

What causes each functional failure (failure modes)?

In Reliability-Centered Maintenance (RCM), it is essential to distinguish between a functional failure, which is the loss of the ability to perform a function, and a failure mode, which is the specific event or condition that causes the functional failure. The process of identifying failure modes is one of the central aspects of RCM analysis, as described in section 5.3.1 of the SAE JA1011 standard.

Therefore, according to section 8.1 of the SAE JA1012 standard, to identify the causes or functional failure modes, it is first necessary to determine the functional failures of each asset within the scope of the maintenance plan. This requires previously defining the functions that the assets must fulfill in their operational context.

The identification process results in a table that links asset functions, the corresponding functional failures, and the associated failure modes, as summarized in the example in Table 2 for a control system.

Asset: Centrifugal Turbocharger Control System TC-93078
Function	Functional Failure	Failure Modes
Control the start-up, operation, and safe shutdown of the TC-93078.	Does not control the oil pressure in the distribution header to the turbine T1-TU-93078 and the centrifugal compressor C1-CO-93078.	The pressure transmitter sends incorrect readings. The controller generates an erratic output. The controller's input channel fails intermittently.
	Does not control the fuel gas system valve.	The valve gets stuck when opening. The valve position sensor sends incorrect signals. The controller's output channel fails intermittently.

Table 2. Failure Modes in the context of the SAE JA1012 standard.

As established in section 8.1 of the SAE JA1012 standard, each failure mode must be described using a noun and a verb, ensuring that the description is detailed enough to facilitate the selection of appropriate failure management policies. However, it should not be so detailed as to waste time during the analysis process.

In failure mode analysis for instrumentation and control systems, it is essential to specify the nature of each possible failure. For instance, instead of merely stating that a pressure sensor "fails," it is more helpful to specify "pressure sensor blocked by obstruction," which provides key information about the underlying cause. This precision helps direct the necessary maintenance measures. Thus, if the sensor fails due to "paraffin buildup in the process connection line," specific actions should be taken to prevent such accumulation.

It is important to identify all failure modes that are "reasonably probable" and could affect the functionality of instrumentation and control systems. Evaluating this probability depends on the judgment of trained personnel with experience in oil and gas operations and must be accepted by the client.

Identifying too few failure modes could lead to critical omissions, while including too many may overcomplicate the analysis. Finding the right balance is crucial for effective maintenance management.

Additionally, the likelihood of some failure modes depends directly on the operational context. For example, a potential failure mode like "obstruction by sediment" is more likely in heavy crude oil production wells than in gas lines. This highlights the importance of considering specific operating conditions when evaluating failure modes, as the context influences the selection of appropriate maintenance strategies.

Failure mode identification should be conducted at a level of detail sufficient to enable the selection of effective maintenance policies, without delving excessively into causality. As details are explored further, the number of potential causes can increase, but analyzing every possible root cause is not always practical. Instead, the focus should be on identifying failure modes at a practical level that adds value to risk management, concentrating on those that are manageable and controllable within the industrial context.

This is because the concept of root cause suggests that the causes of a failure mode can be explored indefinitely. However, such exploration can become unnecessary and often unmanageable by the organization. For example, a failure mode like "vibration transmitter sends inconsistent signals" could be analyzed to a detailed level, identifying causes such as "loose connection" or "installation error." Beyond a certain point, such as the personal conditions of the technician, the cause falls outside the practical scope of operational management.

Section 5.3.3 of the SAE JA1011 standard establishes that failure modes must be identified at a level of causality that allows for the definition of practical management policies. This level can vary depending on the case: some modes can be addressed at basic levels, while others require more detailed analysis.

Moreover, the operational context is crucial in determining which failure modes are reasonably probable. For instance, "erratic readings from a flow transmitter due to residue buildup" may be relevant in a heavy crude oil transfer system but not in a dry gas line where obstruction is less likely due to the fluid's characteristics.

Failure modes in the industry can be classified into various categories, such as corrosion deterioration, design defects in control systems, and human errors during equipment calibration or installation. For example, deterioration can reduce instrument accuracy over time, while human errors in configuring control systems can directly affect operations.

FAILURE MODE	Input Devices	Control Logic Units	Valves
Abnormal instrument reading			X
Delayed operation			X
External leakage - process medium	X		X
External leakage - utility medium	X		X
Erratic output	X	X
Failure to close on demand			X
Failure to function on demand	X	X
Failure to open on demand			X
High output	X	X	X
Internal leakage			X
Leakage in closed position			X
Low output	X	X	X
Noise			X
No output	X
Other	X		X
Plugged / Choked			X
Minor in-service problems	X	X	X
Spurious operation	X	X	X
Structural deficiency			X
Unknown	X	X	X
Vibration			X

Table 3. Adaptation of Table B.9 from ISO 14224:2016.

Identifying these types of failures is crucial for developing a maintenance plan that addresses all possible causes and minimizes associated risks. In this context, Table 3 presents an adaptation of the failure mode list defined in ISO 14224:2016, which should not be interpreted as an official version or conforming to the standard.

Its purpose is to illustrate the analysis within the framework of the article, highlighting the importance of considering multiple categories of failure modes when planning maintenance strategies.

Only failure modes directly related to a specific functional failure, such as "inability to control pressure in the compressor header," should be included, excluding those associated with other functional failures, such as loss of containment or protection.

The identification of failure modes should be based on previous failure records, failure modes currently managed through maintenance programs, and potential failures considered reasonable.

This approach enables proactive risk management, essential for safety and operational continuity in oil and gas production. The experience and judgment of field technicians play a significant role in this identification process.

What happens when each functional failure occurs (failure effects)?

To accurately identify the failure effects associated with each failure mode, it is important to assume that no proactive maintenance is being carried out. This zero-based approach allows for a clear understanding of the potential impacts of failure modes, assuming that the failure mode will effectively lead to a functional failure.

This assumption helps draft precise failure effect The ability to monitor and adjust system conditions is temporarily lost, potentially leading to an emergency compressor shutdown due to undetected unsafe conditions.statements and ensures that the analysis is grounded in the reality of potential operational failures.

At this stage, the RCM professional must differentiate between failure effects and failure consequences. As stated in section 9.1 of SAE JA1012, failure effects refer to the immediate results of a failure mode—what happens when the failure occurs—while failure consequences relate to the broader implications of these effects, including their significance and potential impact on safety, operations, and costs (how and why the failure matters).

Section 5.4.1 of SAE JA1011 specifies that failure effects must be clearly described, particularly in the absence of any proactive measures to anticipate, prevent, or detect the failure. This means the statement focuses solely on the outcome of the failure itself, without considering any intervention that might mitigate its impact.

For example, a failure in a pressure transmitter in an industrial boiler might not only prevent adequate monitoring of internal pressure but also create risks associated with loss of containment or unsafe operating conditions. In this sense, failure effects should be detailed not only in operational terms but also regarding their impact on the production chain and regulatory compliance.

For instance, a Distributed Control System (DCS) that fails to detect a temperature increase in the fuel gas header supplied to a compressor due to a sensor failure could cause catastrophic equipment damage and unplanned shutdowns, resulting in significant economic losses.

Additionally, these effects could lead to violations of environmental regulations, such as uncontrolled gas emissions into the atmosphere, highlighting the importance of identifying and describing the effects of each failure mode in a detailed and specific manner.

The identification process typically involves documenting the asset's functions, the corresponding functional failures, the associated failure modes, and the failure effects. This process results in a structured format known as a Failure Modes and Effects Analysis (FMEA), as shown in Table 4.

Failure effect statements form the basis for evaluating the failure consequences associated with each failure mode, providing the information needed to determine what failure management policies should be implemented to avoid, eliminate, or minimize the negative impacts of failures. This is vital to ensuring that the needs and expectations of asset owners and users are met.

Asset: Centrifugal Turbocharger Control System TC-93078
Function	Functional Failure	Failure Modes	Failure Effects
Control the start-up, operation, and safe shutdown of the TC-93078.	Does not control the oil pressure in the distribution header to the turbine T1-TU-93078 and the centrifugal compressor C1-CO-93078.	The pressure transmitter sends incorrect readings. The controller generates an erratic output. The controller's input channel fails intermittently.	The oil pressure in the distribution header does not remain within safe operational limits, resulting in insufficient or excessive lubrication, causing severe damage to the turbine and compressor components. The control valves receive inconsistent signals, causing abrupt variations in oil flow that can lead to accelerated wear of moving parts. The ability to monitor and adjust system conditions is temporarily lost, potentially leading to an emergency compressor shutdown due to undetected unsafe conditions.
	Does not control the fuel gas system valve.	The valve gets stuck when opening. The valve position sensor sends incorrect signals. The controller's output channel fails intermittently.	The valve remains closed, interrupting the fuel gas supply, resulting in a complete loss of the compressor's operational capacity and unplanned process shutdowns. The system does not detect the actual position of the valve, allowing improper opening that could lead to excess fuel in the combustion chamber, increasing the risk of explosion. The gas valve does not respond to the controller's commands, resulting in fuel flow fluctuations that compromise the compressor's operational stability and process efficiency.

Table 4. FMEA for the control system in Table 2.

Maintenance based on the asset's condition (condition-based tasks), regularly scheduled maintenance to restore functionality (preventive maintenance), and pre-established replacement of components after a certain period or use are the main strategies for managing failures. Each of these tasks is associated with specific frequencies dictating how often they should be performed.

Within the framework of section 5.4.2 of SAE JA1011, failure effects must provide detailed information to precisely analyze the consequences associated with instrumentation equipment and control systems operating in the oil and gas production industry.

According to section 9.2 of SAE JA1012, the main objective at this stage is to recognize any indications suggesting the occurrence of a failure. This includes visible changes in equipment behavior, such as warning lights, alarms, or alterations in operational parameters (e.g., speed or noise). In the case of hidden failures, it is crucial to consider the potential consequences of multiple failures occurring simultaneously, as this can complicate both detection and response.

In a refinery, an example could be a hidden failure in the flow control valve regulating crude oil feed to a distillation unit. If the valve positioner sends incorrect signals due to intermittent disconnection, this might initially go unnoticed.

However, if this failure combines with an unexpected increase in feedline pressure, it could lead to uncontrolled flow into the unit, generating risks of process overload and possible damage to downstream equipment. Detecting and responding to such simultaneous failures would be especially challenging due to their cumulative impact on the system.

The statement about failure effects must specify any potential risk that could affect personnel safety or compliance with environmental regulations. This involves identifying specific situations that could result in injuries or environmental damage, such as an increased risk of fires, chemical spills, or exposure to hazardous conditions. It is important to focus on describing specific events rather than generalizing about safety-related consequences.

An example in the oil and gas production industry is a transmitter measuring water level in an industrial boiler used to generate steam for services. If this transmitter fails and provides incorrect readings, it might not detect a low water level in the boiler.

This increases the risk of overheating the boiler tubes, which could lead to an explosion and pose a significant hazard to personnel in the area. Additionally, the release of high-pressure steam and potential hydrocarbon particles could harm the environment, including thermal pollution and uncontrolled emissions into the atmosphere.

This scenario underscores the importance of precisely describing specific events that could arise from a failure, highlighting both safety risks and environmental implications.

The description should also include an analysis to estimate the period during which the asset will be out of service, evaluate whether the failure requires a reduction in operational speed, determine if the failure affects production quality (including potential increases in waste rates or penalties for failing to meet performance standards), examine whether other systems or processes are impacted by the failure, and investigate whether the failure results in increased operational costs, such as higher energy consumption.

In the context of a switch detecting high vibration on the coupling side of an electric motor-generator, a failure in this device could lead to continuous equipment operation under unsafe conditions. If the switch fails to detect elevated vibration and does not activate the alarm or shutdown system, excessive vibrations could cause significant damage to the motor-generator’s bearings. This would result in the need for prolonged and costly corrective maintenance.

This would leave the motor-generator out of service for an extended period, affecting equipment availability and requiring a reduction in the operational speed of dependent systems to maintain production. Additionally, motor-generator failures could compromise production quality by causing fluctuations in the electrical supply, increasing waste rates, and generating penalties for failing to meet quality standards.

Operationally, damages and inefficiencies could raise costs due to higher energy consumption caused by operating equipment under suboptimal conditions and the need to use less efficient backup generators.

If the failure causes damage to other components or systems, this must be recorded to understand the broader repercussions of the failure. Lastly, the description of failure effects should outline the actions needed to restore system functionality after a failure occurs. This includes identifying the maintenance or repair tasks that must be carried out.

How does each failure affect operations (failure consequences)?

Failure consequences relate to the broader implications of failure effects and include how and why the failure matters. According to section 5.5.1 of SAE JA1011, it is essential to classify each failure mode systematically, considering both visible impacts and those not immediately evident.

This process must carefully differentiate failures that affect safety and the environment from those that generate only economic or operational consequences, allowing for a more precise approach to risk management associated with each situation.

Failure consequence categorization is a fundamental aspect of the RCM process. It allows organizations to prioritize maintenance efforts based on the potential impact of failures, ensuring resources are allocated efficiently to mitigate risks and improve operational reliability.

Therefore, after identifying probable failure modes and their effects, the next step is to evaluate and categorize the consequences of each failure mode. This stage relies on detailed descriptions of failure effects, which serve as the primary source of information. Types of consequences include:

Operational impact: Some failure modes may affect production, product quality, or customer service.
Safety and environmental risks: Others may pose threats to safety or the environment, critical considerations in any operational context.
Cost implications: Certain failures may lead to increased operational costs, such as higher energy consumption.
Cascading effects: Some failure modes may seem inconsequential on their own but could lead to more severe failures if not addressed.

If failure modes are not anticipated or prevented, the organization may incur significant repair costs, diverting resources from other critical areas. This highlights the need for proactive management. The impact of each failure mode is influenced by the asset’s operational context, established performance standards, and the physical effects of the failure. This means the same failure mode may have different consequences depending on the situation.

The severity of consequences dictates the organization’s response. Severe consequences justify significant preventive measures, while minor consequences may lead to a more reactive approach, where failures are simply corrected as they occur.

Understanding the consequences of failure modes is more critical than their technical details. The goal of failure management should be to avoid or mitigate consequences rather than solely focusing on preventing failures themselves.

Differentiating between hidden and evident failures is critical in the RCM approach. Hidden failures can lead to significant risks without any warning, requiring careful monitoring and management. Evident failures, while also requiring attention, allow for immediate corrective actions.

Hidden failures are not apparent to operators under normal circumstances. In other words, the equipment may be in a failed state without visible signals or symptoms until another failure or abnormal event occurs. For example, a vibration sensor may fail without alerting operators, and its failure only becomes evident when a related system fails.

In contrast, evident failures are noticeable to operators when they occur. The effects of these failures are apparent during normal operations, allowing the crew to respond accordingly. For example, if a compressor stops working, operators can immediately see that there is a problem.

The RCM process emphasizes the importance of distinguishing between hidden and evident failures, as each type demands different management strategies. Hidden failures can constitute up to 50% of possible failure modes in complex systems. This highlights the need to identify and manage them properly to avoid unforeseen consequences.

Hidden failures present a particular risk as they can result in severe consequences without any warning signals. For example, if a protection function, such as a high condensate level switch in a drum supplying fuel gas to a motor-compressor, fails hiddenly, the equipment or process it is meant to protect may operate unsafely, potentially causing accidents or system failures. Evident failures, on the other hand, while also problematic, allow for quick identification and response, helping to reduce risks and prevent additional complications.

According to section 10.1.1 of SAE JA1012, protection functions aim to reduce the consequences of failures in the functions they are designed to safeguard. These protection systems consist of at least two elements: the protection function itself and the protected function. The interaction between these components is crucial.

If the protection function fails and this failure is evident, operators can act to prevent or mitigate the repercussions of the failure in the protected function. However, if the protection function fails hiddenly, operators may not realize the risk until it is too late. Additionally, if the protection function fails and causes failures in the protected function, a multiple failure occurs.

For the development of Figure 4, the example of a vertical two-phase separator receiving natural gas used as fuel in an internal combustion engine is considered. The separator receives gas at the bottom and, at the top, has a demister mesh that separates liquid droplets carried by the gas through condensation. After this filtration stage, the gas exits through a line connected to the top of the separator and is directed to the distribution header feeding the engine's combustion chambers.

Condensed gas or liquid droplets separated by the demister mesh fall by gravity to the bottom of the separator. At this stage of the process, controlling the condensate level in the separator is critical. On one hand, if the condensate level rises to reach the gas outlet line, it gets carried to the combustion chambers, causing severe consequences for the engine's integrity.

Figure 4. Hidden failures and multiple failures within the framework of the SAE JA1012 standard.

On the other hand, if the condensate level falls until it is completely depleted inside the separator, gas will flow to the separator intended to receive the liquid condensate at this stage of the process. This would create an overpressure scenario that should trigger the relief valves protecting the destination separator for the condensed liquid.

For the purposes of Figure 4, the case of high condensate level is considered. Scenarios involving activation of the high-level switch are associated with failures in the BPCS and SIS control loops, which are safety layers that must remain operational to avoid switch activation.

If the switch activates, the controller triggers the safe shutdown of the machine powered by the engine, which could be a turbine or a reciprocating compressor. In the Scenario 1 of the left graph in Figure 4, both the protected function (safe engine operation) and the protection device (high-level switch) are operational, and no operator action is required.

In Scenario 2 of the left graph, an evident failure in the high-level switch is considered, and in this case, the operator performs the safe shutdown of the engine along with the closure of the gas supply valves to the separator. In this scenario, it is possible to keep the engine off while repairing the switch, or, if the engineering design allows, the engine can continue running using a bypass.

In Scenario 3 of the left graph, operation returns to the normal state described in Scenario 1. In Scenario 4 of the same graph, a situation is represented where the condensate level in the separator has reached the process’s maximum allowable level. However, with the high-level switch available, the corresponding alarm or the safe shutdown of the engine is triggered, depending on the controller’s programming.

In Scenario 1 of the right graph, the operational conditions of the protected function and the protection device are normal. Scenario 2 considers the emergence of a hidden failure in the high-level switch, meaning a failure in the protection device. In this case, the engine continues to operate, but without corrective actions for high condensate level protection in the separator, functional failures materialize in the engine due to condensate in the combustion chambers, leading to the multiple failure shown in Scenario 3 of the right graph.

It is important to highlight that, as established in section 5.5.2 of SAE JA1011, the evaluation of failure consequences must be conducted under the assumption that no specific task is being carried out to anticipate, prevent, or detect the failure.

What should be done to predict or prevent each failure (proactive tasks and task intervals)?

According to section 5.6.1 of SAE JA1011, failure management policies should be based on the behavior of the conditional probability of failure modes over time. The likelihood of a failure mode varies depending on the asset's age.

This implies that certain failure modes become more likely as the asset ages or is exposed to stress conditions, while others maintain a constant probability regardless of the asset's age, as illustrated in the graphs in Figure 5. The graph on the right is characteristic of random failures unrelated to time.

Figure 5. Failure patterns in the context of the SAE JA1012 standard.

As an example, two pressure transmitters with remote seals in the process connection operating under different conditions may exhibit failure modes as described in Figure 5. On the left side, one of the transmitters operates in a heavy crude oil line, where the accumulation of product on the sensor diaphragm gradually increases the probability of failure over time. On the right side, the other pressure transmitter operating in a condensate line keeps the sensor diaphragm clean due to significantly lower sediment accumulation.

Figure 6. Manufacturing Defects and Wear Zone in the context of the SAE JA1012 standard.

In other cases, some failure modes may become less likely as the asset is used, because manufacturing defects tend to appear early in the asset's life. Other patterns show a wear-out zone, indicating that as the asset ages, it becomes more prone to failures due to wear caused by prolonged use or stress, as shown in Figure 6.

As an example, a control valve with a pneumatic actuator may exhibit failure modes like those described in Figure 6. On the left side, the valve trim may experience failure modes associated with manufacturing defects or assembly errors. On the right side, over time, the actuator diaphragm may enter a wear-out zone associated with aging.

Figure 7. Failure Probability in the context of the SAE JA1012 standard.

The failure pattern shown in the left graph of Figure 7, known as the bathtub curve, refers to those assets that may exhibit failures associated with manufacturing defects or assembly errors during operation and subsequently enter a wear-out zone.

On the other hand, the failure pattern shown in the right graph of the same figure refers to those assets that may be reliable at first but, as they age, become more susceptible to failures due to age-related degradation.

As an example for the left graph of Figure 7, during the initial operation stage of a new control system, it is common to have a high probability of failure associated with configuration errors or adjustments required during the testing stage, and over time, some elements of the control loops enter a wear-out zone.

For the right graph of Figure 7, we can cite safety relays as an example, which have high reliability in their initial stage, meaning a low probability of failure. However, depending on the activation frequency, the reliability of these types of assets tends to deteriorate.

Understanding how age and stress influence the probability of failures is essential for determining effective failure management policies, because it allows organizations to develop maintenance strategies that align with the anticipated behavior of assets throughout their life cycle.

According to section 5.6.2 of SAE JA1011, these strategies refer to the scheduling of effective and technically applicable maintenance tasks. A scheduled task is considered effective if it successfully reduces, avoids, eliminates, or minimizes the consequences of a potential failure mode.

According to section 11.2 of SAE JA1012, this means that the benefits derived from performing the maintenance task must outweigh the associated costs. Costs can be direct (such as labor and materials) and indirect (such as downtime or productivity loss). If the task does not provide significant benefits in terms of risk reduction or cost savings, it is not effective.

Furthermore, a scheduled maintenance task must be realistically executable, considering the available technology, resources, and capabilities. It is important for the task to be feasible without encountering insurmountable technical obstacles. If a maintenance task requires specialized tools or skills that are unavailable, it is not applicable.

If it is not possible to schedule an effective and applicable task, and the consequences of a potential failure are unacceptable to the asset owner or user, alternative strategies must be developed to manage these consequences. This could include redesigning the system, implementing different maintenance approaches, or investing in new technologies.

If it is possible to schedule more than one effective and applicable maintenance task for the same asset, the choice should be based on which failure management policy or maintenance strategy offers the best cost-effectiveness, as stated in section 5.6.3 of SAE JA1011. Opting for a policy solely for its level of technical sophistication may not be economically advantageous, as it could lead to high costs without a proportional reduction in risks.

According to section 11.3 of SAE JA1012, Reliability-Centered Maintenance (RCM) proposes selecting policies that effectively mitigate the consequences of failures and are cost-effective, focusing on cost reduction. The goal is to find a balance where the selected policy not only addresses failures effectively but is also financially justifiable, considering both direct and indirect costs.

It is important to highlight that, as established in section 5.6.4 of SAE JA1011, failure management policies should be selected under the assumption that no preventive maintenance activities are being conducted. Considering that no preventive maintenance tasks are in place, the evaluation process becomes more impartial. This facilitates a more accurate understanding of the risks and potential consequences associated with each failure mode without the interference of current maintenance practices.

It also ensures that the selection of policies is based on the inherent risks of the system rather than on mitigations that may already be implemented, which could obscure the true nature of the risks. By disregarding current preventive maintenance efforts, organizations can more effectively detect gaps in their failure management strategies. This can result in identifying failure modes that are not being adequately addressed by existing policies, highlighting the need to develop new or revise current management strategies.

This approach lays a foundation for continuous improvement in maintenance practices. It enables organizations to formulate policies that are both corrective and preventive, ultimately increasing the reliability and safety of their systems. It ensures that all potential risks are considered and that the selected policies are robust enough to effectively address failures.

What should be done if a proactive task that is convenient is not available (default actions)?

To answer the question of what should be done if a proactive task that is convenient is not available, within the framework of the SAE JA1011 and JA1012 standards, the default actions can be summarized as follows: Make a one-time change to the design or operational context:

Make a one-time change to the design or operational context: According to section 14.1 of the SAE JA1012 standard, one-time changes include modifications to the design of the asset, changes in operating or maintenance methods, and training to improve staff capacity. These actions aim to eliminate or minimize the need for the unavailable proactive task.
Adopt a run-to-failure policy: If it is not possible to implement a proactive task that is technically or economically viable, it may be opted to allow the equipment or component to fail and then proceed with repair or replacement, as mentioned in section 14.2 of the SAE JA1012 standard. This policy is only suitable if the consequences of the failure are tolerable and do not compromise safety, the environment, or critical operations.
Implement alternative failure management tasks: If the proactive task is not available, it is possible to consider condition-based maintenance tasks, scheduled restoration, or scheduled discard tasks that are technically and economically feasible. These options are described in sections 13.1 to 13.3 of the SAE JA1012 standard.
Establish temporary measures or redundancies: To avoid negative consequences while seeking a permanent solution, temporary controls, redundancies, or additional protection measures can be implemented to mitigate the impact of the lack of the proactive task.

These actions are taken based on a detailed evaluation of the consequences of failures, the cost-benefit of the available options, and the operational context of the asset. Additionally, it is essential to document the decisions made and their justification to maintain a clear and defensible record.

Conclusions

The implementation of Reliability-Centered Maintenance (RCM) establishes itself as an indispensable strategic approach for effectively managing instrumentation equipment and industrial control systems in the oil and gas industry.

This methodological framework, based on the SAE JA1011 and SAE JA1012 standards, provides practical and conceptual tools to address the inherent challenges of an environment characterized by high safety standards, continuous performance demands, and economic constraints. A fundamental aspect of RCM is the precise identification of the primary and secondary functions of assets, as well as the definition of the operational context in which they operate.

This initial step not only ensures that maintenance plans are oriented toward the organization's specific objectives but also lays the foundation for establishing clear and measurable performance standards. By understanding the purpose and operating conditions of each asset, alignment is achieved between maintenance strategies and the company's operational goals.

The analysis of functional failures, failure modes, and their respective effects and consequences is another essential pillar of RCM. This comprehensive approach allows for a complete understanding of the risks associated with asset deterioration or malfunction.

This analysis is not limited to anticipating failures but also enables the development of proactive and effective maintenance policies that mitigate adverse impacts on operations, safety, and the environment. Thus, RCM becomes a key tool for reducing operating costs by minimizing unplanned downtime and preventing catastrophic events.

On the other hand, designing maintenance strategies based on failure patterns and the expected behavior of assets over time allows for resource optimization and improved maintenance effectiveness. These strategies are tailored to the specific characteristics of each asset and their operational environment, reinforcing the need for a flexible and contextualized approach.

When proactive tasks are not technically or economically feasible, RCM offers alternatives such as system redesign or adopting run-to-failure policies, always ensuring that decisions are based on a careful cost-benefit evaluation.

Finally, interdisciplinary collaboration and the integration of knowledge from operators, technicians, and specialists are critical to the success of RCM. This participatory approach not only enriches the analysis process but also ensures that maintenance strategies reflect operational realities and dynamically adapt to changes in operating conditions or organizational goals.

In conclusion, Reliability-Centered Maintenance, applied under the principles established by the SAE JA1011 and SAE JA1012 standards, represents a standard of excellence for the management of industrial assets in the oil and gas industry.

This approach, grounded in detailed analysis of functions and failures, as well as in the implementation of cost-effective maintenance policies, ensures safe, profitable, and sustainable operations. The adaptability of RCM and its capacity to integrate diverse technical and operational knowledge make it an indispensable tool for addressing the challenges of an increasingly complex and demanding industrial environment.

Bibliographic Resources

SAE JA1011 Evaluation Criteria for Reliability-Centered Maintenance Processes: This standard establishes the minimum criteria that a process must meet to be considered RCM. It provides a structured guide for evaluating and improving maintenance practices.
SAE JA1012 Guide to the Reliability-Centered Maintenance Standard: It complements SAE JA1011, offering a detailed explanation of key RCM concepts and processes. It is a valuable reference for the effective implementation of reliability-based maintenance strategies.
ISO 14224:2016 Collection and Exchange of Reliability and Maintenance Data for Equipment in Petroleum, Petrochemical, and Natural Gas Industries: This international standard provides guidelines for collecting and exchanging reliability and maintenance data, facilitating informed decision-making in asset management.
OREDA (Offshore Reliability Data): The OREDA project is a joint industry initiative that collects and analyzes reliability and maintenance data for equipment used in offshore and onshore facilities. Its publications, such as the "OREDA Handbook," are fundamental references for professionals in the reliability field.
https://www.aladon.com/wp-content/uploads/2015/12/IntroductiontoAladonRCM3Brochure.pdf
https://ww2.eagle.org/content/dam/eagle/rules-and-guides/current/design_and_analysis/132_reliabilitycenteredmaintenance/rcm-gn-aug18.pdf?utm_source=chatgpt.com
https://www.dau.edu/acquipedia-article/reliability-centered-maintenance-rcm?utm_source=chatgpt.com
https://www.livingreliability.com/en/posts/criticality-analysis-in-rcm/?utm_source=chatgpt.com

CONTACT AND SOCIAL MEDIA

If you would like to collaborate on future projects, discuss technical topics, or follow the content I develop, I invite you to connect with me through the following platforms:

Website: https://carloscristancho.com/

Email: contact@carloscristancho.com

LinkedIn: carlos-cristancho

GitHub: carlos-cristancho

Discord: Coding Crew Server

YouTube: @carloscristancho

Telegram: Coding Crew Official

HardSoft Industrial

viernes, 6 de diciembre de 2024

The 7 Questions of RCM