Error Handling and Custom Fault Injection Instructions
Feature Description
The CtrlUnit is used to control ECC error injection in the DCache. Each core's L1 DCache is configured with a controller controlled by memory-mapped registers. Each hardware unit supporting ECC has a Control Bank setup. MMIO memory access instructions are used to read and write the configuration registers in the CtrlUnit. After the registers are configured, the L1 DCache will trigger an ECC error on the first read access to the DCache (e.g., a load instruction or MainPipe).
Feature 1: Address Space
- Address space 0x38022000-0x3802207F, a total of 128 bytes. This space is local to each hart.
Feature 2: DCache Control Bank
- As shown in Figure \ref{fig:CtrlBank}, each Control Bank contains the registers: ECCCTL, ECCEID, and ECCMASK. Each register is 8 bytes.
-
ECCCTL (ECC Control): ECC Injection Control Register
ECCCTL -
ese (error signaling enable): Indicates injection is enabled. Initialized to 0. ese will be pulled low after successful injection.
-
pst: Supports injection signal. When pst=1, after the ECCEID counter counts down to 0 and injection is successful, the injection timer will be restored to the previously set ECCEID for reinjection; when pst==0, injection only occurs once.
-
ede (error delay enable): Indicates the counter is enabled. Initialized to 0. If
- ese==1 and ede==0, the error injection is immediately active.
- ese==1 and ede==1, injection is only active after ECCEID counts down to 0.
-
cmp (component): Indicates the injection target. Initialized to 0.
- 1'b0: Injection target is tag
- 1'b1: Injection target is data
-
bank: Bank enable signal. Initialized to 0. When the corresponding bit in bank is set, the corresponding mask is active.
-
-
ECCEID (ECC Error Inject Delay): ECC Injection Delay Controller.
ECCEID - When ese==1 and ede==1, it starts counting down until it reaches 0. Currently, it uses the same clock as the core frequency, or a divided clock can be used. Since ECC injection depends on DCache access, the EID time may not be consistent with the time of the ECC error trigger.
-
ECCMASK (ECC Mask): ECC Injection Mask Register.
ECCMASK - 0 means no inversion, 1 means inversion. Tag injection only uses the bits in ECCMASK0 corresponding to the tag length; bits beyond this length have no effect.
Feature 3: Bus Error Unit Controller
-
DCache ECC errors will be uniformly sent to the Bus Error Unit controller for processing. The information stored by the Bus Error Unit controller includes:
Table: Information Stored by Bus Error Unit
Field Description Initial Value Address cause Reason for the error event 0 0x38010000 value Physical address of the error event Undefined 0x38010008 enable Event enable mask 1 0x38010010 global_interrupt Global interrupt enable mask 0 0x38010018 accrued Accrued event mask 0 0x38010020 local_interrupt Hart local interrupt enable mask 0 0x38010028 -
Address Space
Bus Error Unit physical address space is: 0x38010000 - 0x38010fff
-
Supported Error Types
- ICache ECC Error
- DCache ECC Error
- L2Cache ECC Error
-
Controlled Interrupts
-
Local Interrupt: Can only be reported to the Hart where the Bus Error Unit resides, reported to the backend, which is responsible for interrupt handling. Currently uses NMI_31 interrupt.
-
Global Interrupt: If a global interrupt occurs, the Bus Error Unit sends the interrupt information to the PLIC, and the PLIC is responsible for reporting the interrupt.
-
-
Feature 4: L1 DCache ECC Error Handling Flow
-
Reporting Errors
-
Tag ECC Error: As long as an ECC error occurs in any way, it is determined that an ECC error has occurred.
Table: Tag ECC Error and Tag Hit Relationship
Hit Error Detected Tag Error Result N N N N Y Y (probably hit) Y N N Y Y (hit with error) Y Y Y (hit with no error) N Relationship between Tag Hit and Error Detection and the determination result.
-
Data ECC Error: If the hit line has an ECC error, it is considered an ECC error. If it's not a hit, it's not processed.
-
If the instruction access triggers an ECC error, it is considered a Hardware error and an exception is reported.
-
As long as an error is triggered, error information must be sent to the BEU. When hardware detects an error, it reports it to the BEU, triggering an NMI external interrupt.
-
-
Ordinary Memory Access Instructions
- For ordinary memory access instructions, such as Load instructions, only tag or data ECC errors are triggered during execution. The error is reported to the BEU, and Hardware Error (19) is reported.
-
Probe/Snoop
- For Probe/Snoop
- If a tag ECC error occurs, there is no need to change the cache state, and a ProbeAck request with corrupt=1 needs to be returned to L2.
- If a data ECC error occurs, change the cache state according to the rules. If data needs to be returned, a ProbeAckData request with corrupt=1 needs to be returned to L2.
- For Probe/Snoop
-
Replace/Evict
- For Replace/Evict
- If a tag ECC error occurs, a Release request with corrupt=1 needs to be returned to L2.
- If a data ECC error occurs, a ReleaseData request with corrupt=1 needs to be returned to L2.
- For Replace/Evict
-
Store to DCache
- For Sbuffer writing data to DCache
- If a tag ECC error occurs, release the cacheline according to the Replace/Evict flow, and write the data into the DCache. Do not report the error to L2.
- If a data ECC error occurs, write the data directly. Do not report the error to L2.
- For Sbuffer writing data to DCache
-
Atomics
- For Atomic operations, report an exception, but do not report the error to L2.
-
Multiple Error Selection
- If multiple errors occur simultaneously, the priority is ldu0 > ldu1 > ldu2 > MainPipe.
\newpage
Overall Block Diagram
Interface Timing
Configuration Register Timing
-
Configuration registers can be read/written via the TileLink interface. As shown in Figure \ref{fig:DCache-Error-Config-Timing}, the A channel carries the write address and data.
- Configure the EccMask0 register at address 0x38022010, writing data 0xff;
- Configure the EccEid register at address 0x38022008, writing data 0x4;
- Configure the EccCtl register at address 0x38022000, writing data 0x5.
Tag Injection Timing
-
As shown in Figure \ref{fig:DCache-Error-TagInj-Timing}, after configuring the registers (EccCtl, EccEid, and EccMask0), when the timer counts down to 0, injection starts:
-
The tag injection interface io_pseudoError_0_valid is asserted,
-
After successful injection (i.e., io_pseudoError_0_valid && io_pseudoError_0_ready == 1), the ese bit of EccCtl will be cleared to zero, ending injection;
-
Taking MainPipe as an example, s1_tag_error, s2_tag_error, and s3_tag_error are asserted stage by stage, finally reporting error information to BEU via the io_error port.
-
\newpage
Data Injection Timing
-
As shown in Figure \ref{fig:DCache-Error-DataInj-Timing}, after configuring the registers (EccCtl, EccEid, and EccMask2), when the timer counts down to 0, injection starts:
-
The data injection interface io_pseudoError_1_valid is asserted,
-
After successful injection (i.e., io_pseudoError_1_valid && io_pseudoError_1_ready == 1), the ese bit of EccCtl will be cleared to zero, ending injection;
-
Taking MainPipe as an example, s2_data_error and s3_data_error are asserted stage by stage, finally reporting error information to BEU via the io_error port.
-