Random and Systematic Failures

The safety lifecycle approach recognizes that there are two fundamentally different ways in which the SIS can fail to perform its intended function. These are known as random and systematic failures. Because this concept underpins every aspect of the safety lifecycle, a clear understanding of failure types is crucial. More about Random and Systematic Failures can be viewed on the blogs;

Random Failures

Random failures occur unpredictably and are typically attributed to the degradation of hardware components due to physical causes such as corrosion, thermal stress, or wear-out. These failures are generally well-understood and happen independently of external conditions.

Random failures are hardware failures. Every item of equipment has a finite lifetime, during which some component within the equipment may break due to natural wear-and tear processes caused by fatigue. This is true even if the equipment is installed correctly, operated within specification, and maintained properly.

A random failure is usually permanent, meaning it renders a component or module inoperable, and is often traced to specific devices or loops within a system. Since these failures occur at random times, their likelihood can be statistically analyzed to determine an average probability.

Examples of Random Failure

For example, a lightning strike near a plant can induce an electrical surge that damages a component, such as a transistor in a controller module. This is a classic case of a random failure caused by external stress that is inherently unpredictable.

Another example involves a power supply module failing because the electrolytic capacitor inside it loses its electrolyte over time due to evaporation. This natural degradation process eventually leads to the capacitor becoming an open circuit, preventing the power supply from functioning.

Both cases illustrate how random failures are inherently linked to physical mechanisms and can be mitigated through thoughtful design, such as using higher-integrity equipment, incorporating redundancy, or employing robust materials less susceptible to wear-out.

Mitigation of Random Failures

Random failures can never be eliminated entirely, but they can be handled mathematically. Although it is impossible to predict when any individual item of equipment will fail, we can know a great deal about typical failure behaviour, given data from a large enough population of equipment in service.

For example, we can determine the item’s useful lifetime, and the probability that it will fail during a given period of time. This information is essential during risk analysis, because it allows us to calculate the extent of risk reduction that a particular design of SIS can be expected to providedhence, whether it is sufficient to meet the tolerable risk target

To address random failures effectively, system designers often adopt strategies that enhance the resilience of the hardware. These include selecting components with higher durability and reliability, implementing backup systems to maintain functionality during failure, and conducting regular maintenance to identify and replace aging components before they fail.

Systematic Failures

Systematic failures, in contrast, result from errors during the development, design, operation, or maintenance of a system. Unlike random failures, systematic failures are not tied to physical degradation but instead arise from flaws in processes, procedures, or logic. These failures are consistent and repeatable under identical circumstances, making them more challenging to predict and characterize statistically.

Systematic failures are device failures ultimately caused by human errors. The lifecycle presents numerous possibilities for human errors to occur; a few examples are

Incorrect risk analysis (failing to identify hazards, underestimating risks)
Administrative errors (working from out-of-date versions of documents, incorrect drafting of documents, miscommunication)
Incorrect design of SIS
Software bugs
Incorrect installation of SIS
Failure to maintain equipment, or errors during maintenance (such as failing to remove overrides after completing the maintenance procedure

While some of these are under the direct control of the process plant owner or design and construction contractor, others are not. For example, a safety equipment manufacturer may make a design error, which could lie hidden for many months or years until a particular combination of circumstances brings it to light. When the error is finally revealed, severe consequences could occur without warning; for example, an emergency trip may fail to operate on demand, leading to a fire or explosion.

Systematic failures often have a broader impact than random failures, as they can affect multiple devices, loops, or even entire systems within a corporation. This is because systematic failures are tied to “the way things are done,” such as organizational practices, training, or procedural gaps.

Mitigation of Systematic Failures

Unlike random failures, systematic failures cannot currently be mathematically modelled. Since it is impossible to test every combination of circumstances and events that could ever arise, we can never know for sure whether errors exist in our SIS, how many, or how serious they are. Statistical treatment is of little value, since error rate data collected in one environment is unlikely to be applicable to another. The only practical way to address systematic failures is to minimize them.

The two main ways of doing this are:

Reduce the number of errors made in the first placed for example, by ensuring individuals are competent, providing clear requirements and procedures, and reducing the number of opportunities for error (fewer and simpler operations); and

Provide opportunities to detect errors for example, by verification and review, and by recording and investigating every unexpected incident involving the SIS.

For this reason, IEC 61511 places great emphasis on software development techniques, management procedures, cross-checking of work completed and competency of individual safety practitioners.

To mitigate systematic failures, organizations must focus on improving administrative controls and monitoring processes. This includes enhancing staff training, refining procedural guidelines, and applying rigorous testing protocols during the system’s life cycle. Qualitative measures, such as life cycle activities, aim to minimize the likelihood of systematic failures by addressing potential weaknesses in development, operation, and maintenance phases. By proactively identifying and rectifying these issues, organizations can reduce the occurrence and impact of systematic failures.

Top References

Safety Instrumented Systems Verification: Practical Probabilistic Calculations William M. Goble Harry Cheddie
IEC-61511
www.exida.com
https://www.exida.com/Blog/random-versus-systematic-faults-whats-the-difference
Guidelines for Safe Automation of Mechanical Processes by Center for Chemical Process Safety
Reliability, Maintainability and Risk by Dr David J Smith
Functional Safety from Scratch by Peter Clarke, xSeriCon

Nasir Hussain

0092-3334647564 | thepetrosolutions@gmail.com | + posts

Certified Functional Safety Professional (FSP, TÜV SÜD), Certified HAZOP & PHA Leader, LOPA Practitioner, and Specialist in SIL Verification & Functional Safety Lifecycle, with 18 years of professional experience in Plant Operations and Process Safety across Petroleum Refining and Fertilizer Complexes.