Practical Methods to Reduce Systematic Failures

Systematic failures result from errors during the development, design, operation, or maintenance of a system. Unlike random failures, systematic failures are not tied to physical degradation but instead arise from flaws in processes, procedures, or logic. These failures are consistent and repeatable under identical circumstances, making them more challenging to predict and characterize statistically. For further details about Systematic and Random Failures, please view the other blog posts;

Systematic failures are device failures ultimately caused by human errors. The lifecycle presents numerous possibilities for human errors to occur; a few examples are

Incorrect risk analysis (failing to identify hazards, underestimating risks)
Administrative errors (working from out-of-date versions of documents, incorrect drafting of documents, miscommunication)
Incorrect design of SIS
Software bugs
Incorrect installation of SIS
Failure to maintain equipment, or errors during maintenance (such as failing to remove overrides after completing the maintenance procedure

Practical Methods to Reduce Systematic Failures

1. Ensure competency

Define the competency level required for each lifecycle task, including qualifications, experience and knowledge.
Assign individuals to tasks for which they are competent.
Encourage individuals to query any information or instructions they do not understand or agree with.

2. Information availability

Ensure resources are available, e.g. access to up-to-date versions of standards and codes of practice.
Provide and implement a document control system, to ensure everyone works from the latest version of each document. (This is often part of an ISO 9000 quality management system.)
Use the SRS and other key lifecycle documents as the sole means of transferring information between individuals Use adequate labelling (of equipment and wiring) and commenting (of software code).
Ensure procedures and manuals are available and fit for purpose: clear, unambiguous, complete, and provided in the local language.

3. Simplification

Do not use equipment with more features than actually required
Make unneeded features (especially software features) unavailable
Use passwords and other means of access control to limit the number of individuals that can change things (such as documents, wiring and software settings)
Use restrictive languages for the application program
Avoid unnecessary diversity. Use the same brand or type of equipment and software for all similar applications where practical. However, this can conflict with avoidance of common cause failures.

4. Familiarity

Avoid unnecessary novelty. Use well-established and familiar equipment, procedures and methods.
Suitability Use equipment and software only for its intended function. Pay attention to any restrictions listed in the equipment’s Safety Manual.
Use SIL-certified equipment and validated tools (software development tools, analytical software, test equipment).
Alternatively, use equipment with a good, documented track record of prior use.

5. Review

Follow a properly designated review procedure, especially for software development. Ensure an adequate degree of independence between the executing engineer and the reviewer. Record deviations and errors found, not for disciplinary purposes but to allow an assessment of whether systematic failures are properly under control.
Compare the expected and actual performance of the SIS, especially in terms of trip rate (real trips and spurious trips). If the actual trip rate is much higher than expected (based on random failure rate calculations), it indicates the presence of systematic failures in the design and/or implementation of the SIS.

6. Investigation

Record and investigate all incidents of unexpected SIS behavior, especially unwanted (spurious) trips, diagnostic alarms, test failures, issues found during maintenance, and events when the SIS is found to be in an abnormal state (e.g. unauthorized bypasses, parameters changed). Most of these will indicate the presence of some kind of systematic failure.

7. Maintenance

When maintaining the SIS, always inspect and test it before carrying out any maintenance works such as cleaning and repair. Record the ‘as-found’ condition of the SIS, since this more accurately represents the ‘real’ state of the SIS during the majority of its working lifetime.
Investigate the root cause of issues such as loose connections, corrosion or other physical damage, unauthorised or unexpected alterations from design (compare back with the SRS), and any other finding that that could compromise the functioning of the SIS.

References:

Functional Safety from Scratch by Peter Clarke, xSeriCon
Safety Instrumented Systems Verification: Practical Probabilistic Calculations William M. Goble Harry Cheddie