# **Testing and Built-In Fault-Tolerance Against** Silent Data Corruptions on Computing Chips

#### Huawei Li

**Professor at ICT, Chinese Academy of Sciences** 

President and Co-founder of CASTEST Inc.



Shanghai - October 31, 2024





TestConX China Workshop

|   | 1) SDCs on Computing Chips                                                        |
|---|-----------------------------------------------------------------------------------|
| : | 2) Origins of SDCs                                                                |
| : | 3) Testing against SDCs                                                           |
|   | 4) Build-in fault-tolerance (BIFT) against SDCs<br>a. BIFT for Multi-Core Systems |
|   | b. BIFT for Deep Learning Systems                                                 |
| ! | 5) Concluding Thoughts                                                            |

## **Complexity of typical computing chips**





Intel Haswell-EP Xeon E5 ~7B transistors



Intel/Altera Stratix 10 ~30B transistors



**NVIDIA GH100** ~80B transistors



**IBM Power9** ~8B transistors



Xilinx VU9P ~ 35B transistors



Cerebras WSE-3 ~ 4T transistors

Trillion-transistor computing chips are coming



- Chips with 200 billion transistors on a single piece of silicon at 1nm-class fabrication processes
- Advancements in packaging: massive multi-chiplet solutions packing more than a trillion transistors

#### The hidden killer in data centers

- Amazon web services experienced a substantial service outage. (July 2008)
- Facebook lost more than 10% of photos in hard drive failure. (May 2009)
- Google: ephemeral computational errors correlated to components in processors
  - application data corruption and crashes; data corruptions exhibited by various load, store, vector operations; a deterministic AES mis-computation, database index corruption leading to queries being non-deterministically corrupted, ....
- Meta: hundreds of instances of computing errors from processors
  - Spark workloads: core 59 on one processor consistently returned a result of 0 when calculating Int(1.1<sup>53</sup>), rather than 156. But the same core would return the correct value of 142 for Int(1.1<sup>52</sup>).

Google: Cores that don't count," HotOS '21, USA <u>https://doi.org/10.1145/3458336.3465297</u> Meta: Silent Data Corruptions at Scale. arXiv:2102.11245 (2021). <u>https://arxiv.org/abs/2102.11245</u>

#### The hidden killer in data centers: example

Case that the computing error causes data loss:

- The files are compressed and stored within a data store.
- Before a decompression is performed, the file size is checked to see if the file size is greater than 0.
- A valid compressed file with contents would have a non-zero size.
- When the file size is mistakenly computed as 0, the file was not written into the decompressed output database.



#### Silent Data Corruptions: the hidden killer in data centers



- SDCs are usually rare events, but in a datacenter running millions of computing chips, 24 hours a day, the rare event becomes an expected occurrence.
- SDCs can have serious impact on largescale infrastructure services when the data corruptions propagate across the stack and manifest as application level problems such as service outage or data loss.
- SDCs can also cause serious impact on safety-critical applications such as autonomous driving.



| 1) SDCs on Computing Chips                                                        |
|-----------------------------------------------------------------------------------|
| 2) Origins of SDCs                                                                |
| 3) Testing against SDCs                                                           |
| 4) Build-in fault-tolerance (BIFT) against SDCs<br>a. BIFT for Multi-Core Systems |
| b. BIFT for Deep Learning Systems                                                 |
| 5) Concluding Thoughts                                                            |

#### **Origins of SDCs**

Soft errors: transient, difficult to reappear



#### **Origins of SDCs**

- Origins of SDCs
  - operating conditions, design errors, manufacturing defects & variations, aging
  - **subtle defects** which create **circuit marginalities** that fail only under the specific combination of temperature, voltage, frequency, and instruction sequence or data set.
- Fundamental causes of the increasing SDC rate:
  - ever-smaller feature sizes that push closer to the limits of CMOS scaling,
  - ever-increasing complexity in architectural design (e.g., DVFS),
  - steady increases in CPU scale installed in the system
- New challenges to detect diverse manufacturing defects especially those defects that manifest in **corner cases**, or only after **post-deployment aging**.

#### **Evaluation of SDCs**

- SDC evaluation: *FIT (Failures in Time)* 1 FIT: one error in a billion (10<sup>9</sup>) hours
- The SDC FIT rate of a chip or system: the sum of the SDC FIT rates of all its components
- MTTF is inversely related to FIT.

A FIT rate of 1000 is equivalent to MTTF of  $\sim$  114 years.

- CPU SDCs
  - evaluated within fault injection studies: one in a million (i.e., 1000 FIT)
  - Observed: one in a thousand in datacenters (i.e., MTTF of  $\sim$  41 days)
- Soft-error-rate budgets of IBM Power4 system
  - Chip level: 114 SDC FIT (1000 yr MTTF)
  - System-kill: 4566 DUE FIT (25 yr MTTF)
  - Process-kill: 11415 DUE FIT (10 yr MTTF)

| Dutli | ne                                              |
|-------|-------------------------------------------------|
|       | 1) SDCs on Computing Chips                      |
|       | 2) Origins of SDCs                              |
|       | 3) Testing against SDCs                         |
|       | 4) Build-in fault-tolerance (BIFT) against SDCs |
|       | a. BIFT for Multi-Core Systems                  |
|       | b. BIFT for Deep Learning Systems               |
|       | 5) Concluding Thoughts                          |

#### **Challenge of Detecting SDCs**

- Circuit marginalities can result in random, or unrepeatable, SDCs.
- Detecting these marginal failures during testing can be extremely difficult because it is impossible to **check every combination of conditions** and **potential workloads**.
- Defects can also be **latent**, meaning they do not show up until after the processors have been running for a long time.

Intel: Data Center Silent Data Errors (2024).

13



#### **Testing against SDCs: design verification**



#### **Testing against SDCs: Manufacture Testing**



#### Manufacturer testing considering delay variations

#### A cost-effective at-speed test flow



- With nano-meter technologies, conventional transition and path delay fault models and at-speed test methodologies are severely challenged!
- No. of critical paths increases due to speed and power saving techniques.
- Delay variability increases due to process, defect, temperature, power, and noise factors: **small delay faults**

17

#### SDF testing using SSTA for statistical long paths

- Statistical static timing analysis (SSTA)
  - Circuit delays can be modeled as correlated random variables to take various local & global factors into account
- Delay testing should target statistical long paths whose tests maximize the detection of small delay faults



#### SDF testing considering path correlation

- Considering the path correlation for path selection
  - Achieves higher test quality with the same number of selected paths
  - Selects fewer paths to achieve same level of test quality
  - Monte Carlo simulation can be used, but time-consuming





After selecting path A, should path B or C be selected?

L.-C. Wang, et al., "Critical path selection for delay fault testing based upon a statistical timing model," TCAD 2004.

#### SDF testing with SSTA and path correlation

• Statistical static timing model for gate/wire

 $d_a = \mu_a + \sum_i a_i z_i + a_{n+1} R_{z_i}$ , *R*- random variables (*RV*'s) modeling spatial correlated variation

• Circuit delay: (*n*+1)-*dimensional space S:* Cartesian product of all *RV*'s



 $S_P$ : where path *P* meets delay constraint ( $d_p < clk$ ) S': where the circuit meets delay constraint ( $d_{circuit} < clk$ )

$$S' = \bigcap_{\text{for each path } P} S_P \qquad S_H = \bigcap_{Pi \text{ in } H} S_{Pi}$$

Path selection: Given number of paths to be selected, finding a path set H with the minimal  $S_H$ instead of with several smallest  $S_{pi}$  (longest paths)

Zijian He, Tao Lv, Huawei Li, Xiaowei Li: Test Path Selection for Capturing Delay Failures Under Statistical Timing Model. IEEE Trans. VLSI Systems (TVLSI), 2013

20

#### SDF testing considering crosstalk/power noises

- Identifying a test that corresponds to the longest delay along the target path
  - Path delay is highly pattern dependent
- Activating the worst-case crosstalk/power noises during test generation
  - Precise crosstalk-induced path delay fault (PCPDF)
  - Fault collection considering coupling capacitance and timing after place & routing



#### Manufacturer testing against SDCs

- Other suggestions for improving test quality
  - Bridging tests: enumerate likely bridging fault sites (interconnects) by layout simulation
  - Cell-aware Test: effective to FinFET technology
  - IDDQ tests: test by measuring current flow
  - TARO (transition fault propagation to all reachable outputs): for each given transition fault, generate tests for each reachable output
  - N-detect stuck-at / transition tests: detect every stuck-at / transition fault *n* times by targeting different sensitive paths
    - The pattern count increases approximately linearly with *n*
    - Effective test selection to choose a small test set with high test quality

Dawen Xu, et al.: Test-Quality Optimization for Variable n-Detections of Transition Faults Prediction (TVLSI 2014) 22



#### System-level tests in DCDiag (Intel)



- Golden value tests: e.g. the square root of 2 or the SHA-1 checksum of a fixed input, with expected value
- **Cross-thread comparisons**: all cores run the same sequence of instructions using the same dataset, while the output generated by each core is compared against that produced by other cores.
  - Variability: randomly generated datasets and/or randomizing the order or selection of processor instructions
- Inverse transformation tests: execute two operations back-to-back to arrive at the original input, e.g.
  - compression and decompression, encryption and decryption, on randomly generated dataset
- Others: non-compute functions, e.g., core-to-core / socket-to-socket communications, caches, interrupts, ...

https://www.intel.com/

#### **Testing against SDCs: Summary of system-level tests**

- Tests conducted **periodically in the system environment**
- Test programs of SLT
  - check every instruction on each core, all the caches, core-to-core communications, memory interfaces, uncore functions, with the purpose of exercising a high percentage of transistors.
  - rely on pseudo-random data and combinations of instructions
  - repeated looping for **millions of clock cycles** to span the vast data, address, and instruction space
- At a rate of 10 failures in time (FIT) for each chip, a data center of modest size (100,000 chips) is likely to experience at least one SDC event every month.
- To minimize the rate of SDCs, periodic testing of data center infrastructure to identify defective components is a critical aspect of maintenance.

Intel: Optimization of Tests for Managing Silicon Defects in Data Centers <u>https://ieeexplore.ieee.org/document/9983919</u>



**Classical fault tolerance: error detection & recovery** Information Redundancy Space Redundancy Arithmetic Computation: Parity Module 1 Coding Voter Output Module 2 **Received Data** Sent Data C2  $\mathbf{c}_3$ C4 d, d, d, d, d, d, d, d, 0011 1011 No errors 0 0 0 0 0001 1001 TMR Receiver Single error (in a position x<sub>1</sub>x<sub>2</sub>x<sub>3</sub>) a) TMR with One Voter 0110 1110 X<sub>1</sub> X<sub>2</sub> X<sub>3</sub> is detected and can be corrected 1100 0100 Faulty Line Input 1 Double error is detected but Stack-at "1' y<sub>2</sub> 0 **y**<sub>3</sub> **y**<sub>1</sub> Checksum on 1110 cannot be corrected Checksum 1110 **Received Data** 0 0 0 Error in parity bit p<sub>4</sub> 1110 Received Output 2 Module 2 Checksum SECDED ECC Checksum communication Output 3 b) TMR with Three Voters checkpointing Fault **Time Redundancy** Switch Module 2 roll-back Recompute & Compare **DMR** Standby Checkpoint & Rollback 27



| Outline                                         |    |
|-------------------------------------------------|----|
| 1) SDCs on Computing Chips                      |    |
| 2) Origins of SDCs                              |    |
| 3) Testing against SDCs                         |    |
| 4) Build-in fault-tolerance (BIFT) against SDCs |    |
| a. BIFT for Multi-Core Systems                  |    |
| b. BIFT for Deep Learning Systems               |    |
| 5) Concluding Thoughts                          |    |
|                                                 |    |
|                                                 |    |
|                                                 | 29 |

29

Keynote

#### a. BIFT for Multi-core Automotive MCUs

- Lockstep offers a high error coverage at the cost of >100% area & power consumption overhead.
- Parallel Error Detection (PED) Using Heterogeneous Cores [DSN-18]
  - several lower-performance cores to run the program segments of the main core



#### **Problems:**

- reduces the main core's performance
  - Use the beginning and end of the program segment as checkpoints
  - At each checkpoint, it requires the main core to suspend committing instructions for several clock cycles, and copy the register states to the check core
- causes high error detection latency
  - The performance of check cores is much lower than the main core, => much longer runtime in comparison with the main core
  - Resulting high error detection latency (i.e., 15,000 cycles).



**Dual-core lock-step** 

Main Core

Detection Cores

Primary CPI

1111

PED

#### **PED** with Low Performance impact to the main core

- Lockstep offers a high error coverage at the cost of >100% area & power consumption overhead.
- PED Using Heterogeneous Cores (several lower-performance cores to run the segments of the main core), reduces the main core's performance, and causes high error detection latency (i.e., 15,000 cycles).



#### checkpoint state copy

- stalls the release of physical registers corresponding to checkpoint states
- No needs to stop instruction commission while copying register states, thus reduces the impact of error detection on the main core's performance
- The performance impact to the main core: under 1%





#### **PED with Low error detection Latency**

- Lockstep offers a high error coverage at the cost of >100% area & power consumption overhead.
- PED Using Heterogeneous Cores (several lower-performance cores to run the segments of the main core), reduces the main core's performance, and causes high error detection latency (i.e., 15,000 cycles).



- The main core's control flow is used to guide the check core's instruction fetch, ensuring correct fetching each time and eliminating the overhead of check core branch prediction failures.
- The performance of the check core improves by an average of 15%.
- The error detection latency is controlled within 2,000 cycles, far less than the previous method's 15,000 cycles.



#### Implementation of Low Latency PED on RISC-V processors



- Baseline: Lockstep with more than 100% area and power overhead.
- When using 12 check cores, the performance overhead is 1% on average, the logic area overhead is 38%, the memory area overhead is 17%, while the power overhead is 16.4%.
- When using 16 check cores, the performance overhead is 0.1% on average, the logic area overhead is 50%, the memory area overhead is 22%, while the power overhead is 21.8%.
- The error detection latency is controlled within 2,000 cycles.

Zhefan Lv, et al.: Heterogeneous Architecturally Parallel Error Detection with Low Error Detection Latency for Highly Reliable Automotive Electronic Systems (JCAD 2023)



Keynote

Keynote





Keynote

#### **Recomputing based BIFT for Deep Learning Systems (HyCA)**





- For large-scale data center applications, **SDCs** caused by **circuit marginalities** that fail only under the specific conditions, can no longer be ignored.
- Improving test coverage of hardware is the fundamental reliability pursuit, while the cost of testing is high.
- Manufacturing testing against SDCs
  - targeting **statistical long paths** with the consideration of **path correlation** during delay testing to improve capture probabilities on **small delay defects**.
  - activating the **worst-case crosstalk/power noises** during test generation with the consideration of **layout** information
- System-level testing against SDCs
  - Test programs can be used to check every instruction, with **pseudo-random data** and **combinations of instructions**, and **repeated looping for millions of clock cycles**, to span the vast data, address, and instruction space

- It is important to have **light-weight error detection** mechanisms with **low error detection latency** to guarantee real-time recovery in critical applications like in the automotive scenario.
- For the large-scale PEs in deep learning accelerators, it is better to have a **build-in architecture-level fault tolerance** to repair the faults of arbitrary distributions, such as the presented HyCA.
- More architecture-level fault tolerance solutions can be designed for different computing architectures.
- Silicon lifecycle management, with sensor-rich architecture will become a promising system-level solution.
  - PVT monitors, path margin monitors, functional monitors
  - DFT & BIST resources reused at system level
  - Data-driven learning-based approaches using chip telemetry infrastructure

## Thanks & Q A

Thanks to my colleagues and students:

Minjin Zhang, Xiang Fu, Zijian He, Dawen Xu, etc. for work on delay testing of SDFs, Tiancheng Wang, Yonghao Wang, etc., for work on instruction-level testing, Cheng Liu, Zhefan Lv, etc. for work on build-in fault tolerance.



「日科堂河 <u>http://www.castest.com.cn/</u>

**Providing Silicon Life-Cycle Solutions for Test & Reliability** 



TestConXPE Shanghai - October 31, 2024

China

# **COPYRIGHT NOTICE**

The presentations in this publication comprise the pre-workshop Proceedings of the TestConX China workshop. They reflect the authors' opinions and are reproduced here as they are planned to be presented at TestConX China. Updates from this version of the papers may occur in the version that is actually presented at TestConX China. The inclusion of the papers in this publication does not constitute an endorsement by TestConX or the sponsors.

There is NO copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies: as such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author/s or their companies.

The TestConX logo, 'TestConX', and 'TestConX China' are trademarks of TestConX.



