Author: Kaivan Karimi
Date: July 5, 2018
Werner Heisenberg, who was born on December 5, 1901, is known as the father of the Uncertainty Principle. He also was the first person to discuss the “observer effect” of quantum mechanics, which states that the act of observing a system inevitably alters its state.
Functional safety is part of the overall safety of a system or piece of equipment and generally focuses on electronics and related software. It looks at those aspects of safety that relate to the function of a device or system and ensure that it works correctly in response to the commands it receives. In a systemic approach, functional safety identifies potentially dangerous conditions, situations or events that could result in an accident and harm somebody or destroy something. ISO 26262, an adaptation of IEC 61508 for Automotive Electric/Electronic Systems, defines state-of-the-art practices to address two types of failures that could lead to malfunctions of systems or subsystems in a vehicle:
Security failures are the third type of failure that should be considered, but they are not directly covered by ISO 26262. In this blog, I am focused on safety alone, although nothing can be safe if it isn’t secure.
Systematic failures are the mistakes or oversights in design resulting from human error somewhere along the development process.
For random failures, let’s consider autonomous cars which contains the most complex hardware and software ever deployed by automakers. These self-driving marvels of engineering are fueling demand for extremely powerful CPUs, GPUs, and multiprocessor compute engines in the automotive industry. Semiconductor manufacturers are busy building the next generation of these powerful compute engines by pushing technology to the point where, for the first time in history, hardware is becoming less reliable. This problem arises from two major factors that can cause random failures – physics and complexity.
In terms of hardware physics, CPUs run at increasingly faster clock speeds, producing more heat, and use ever shrinking transistors, which can now be measured in number of atoms. Heat induces accelerated wear-out; the hotter the part operates, the sooner it fails. Smaller transistors are more susceptible to faults caused by electromagnetic interference, the impact of alpha particles and neutrons, and cross-talk between neighboring cells.
On the complexity front, manufacturers have been adding more and more inter-related functionality to each CPU. Unfortunately, CPUs ship with bugs, many of which are found only after the chip goes into production; known bugs are documented in the manufacturer’s errata sheets. These bugs can affect computations and give erroneous results, thereby causing safety vulnerabilities. The probability of such errors directly impacts the ISO 26262 ASIL rating. ASIL stands for “Automotive Safety Integrity Levels” and measures the severity of injury combined with the probability of it occurring. There are 4 levels of ASIL labeled A,B,C and D with ASIL D being the highest rating.
Adding to the complexity, the software must process a flood of data from sensors such as cameras, LiDAR and radar in real time to model the car’s surroundings and make safe decisions to control the vehicle. This requires highly efficient, safety-certified and secure software that can use special purpose hardware (accelerators) for vision processing and deep neural-net based machine learning algorithms. A “Heisenbug”, a pun on the name of Werner Heisenberg, is a bug that disappears or alters its behavior when one attempts to probe or isolate it. For those familiar with software bug terminology, it is the antonym of Bohr bug, a repeatable bug that performs reliably under a possibly unknown but well-defined set of conditions. Due to the unpredictable nature of a Heisenbug, the error may change or even disappear when you try to recreate the bug or use a debugger.
The occurrence of such hardware and software errors in an autonomous driving system can impact the safety of the system. To achieve ISO 26262 safety certification, these errors must be detected and handled efficiently. To help address these errors and functional safety challenges, BlackBerry QNX has developed QNX Loosely Coupled Lock Step (LCLS) to detect, and to recover from, hardware and software errors caused in autonomous driving systems, system designers must implement mechanisms that compensate for such errors. In previous generation systems, hardware lock step has been used to detect faulty CPU operation, by having duplicate CPUs executing the same code. If one of the CPUs misbehaved one could detect that something had gone wrong. However, since both CPUs will “correctly” execute the same code, hardware lock step does not compensate for random bit flips in memory or Heisenbugs. One could also use a hardware analyzer to check the internal states and determine if something has gone wrong. However, this technique is not practical for today’s high-performance hardware, where there are far too many internal states for a hardware checker to analyze in real time.
Clearly, hardware diagnostics on its own is not enough to detect all these errors. When paired with real-time software checking, an efficient and complete means of verifying the system operation can be achieved. Such a system uses redundant copies of the software each of which perform safety-critical calculations, and the output of these copies is compared to perform verification. This is the concept of QNX Loosely Coupled Lock Step.
For more information on this topic please download this Functional Safety/LCLS whitepaper: https://blackberry.qnx.com/en/forms/whitepaper-automotive-functional-safety