Error Handling
- Version: V2R2
- Status: OK
- Date: 2025/04/24
Terminology Explanation
Abbreviation | Full Name | Description |
---|---|---|
ICache/I$ | Instruction Cache | L1 Instruction Cache |
DCache/D$ | Data Cache | L1 Data Cache |
L1 Cache/L1$ | Level One Cache | L1 Cache |
L2 Cache/L2$ | Level Two Cache | L2 Cache |
L3 Cache/L3$ | Level Three Cache | L3 Cache |
BEU | Bus Error Unit | Bus Error Unit |
MMIOBridge | Memory-Mapped I/O Bridge | Memory-Mapped I/O Bridge |
ECC | Error Correction Code | Error Correction Code |
SECDED | Single Error Correct Double Error Detect | Single Error Correct Double Error Detect |
TL | Tile Link | Tile Link Bus Protocol |
CHI | CHI Bus Protocol |
Design Specifications
- Support ECC verification
- Support CHI DataCheck
- Support CHI Poison
Cached Memory Access Request Error Handling
Basic error handling logic: Error reporting is performed by the Cache Level that detects the error; the error status corresponding to the address is saved/propagated.
- L2 Cache reports ECC/DataCheck Errors detected in the L2 Cache to the BEU, which triggers an interrupt to report the error to the software.
- For requests from L1/L3 Cache, L2 Cache will notify L1/L3 Cache in communication based on the detected error type.
- For error data from L1/L3 Cache, L2 Cache will record the error type in the meta.
ECC
ECC Check Code
The current default ECC check code for L2 Cache is SECDED. At the same time, L2 Cache supports parity, SEC, and other check codes, which can be modified in Configs and configured during compilation. Relevant check algorithm reference
SECDED requires that for an n-bit data, the required check bits r must satisfy: \(2^r \geq n + r + 1\)
ECC Processing Flow
L2 Cache supports ECC functionality. In the MainPipe, when refilling data to Directory and DataStorage in s3, it calculates the checksum for tag and data. The former is stored in the tagArray (SRAM) in the Directory together with the tag, and the latter is stored in the array (SRAM) in DataStorage together with the data.
- For tags, ECC encoding/decoding is performed directly on a tag unit.
- For data, based on the physical design and the need for better error detection, data is currently divided into dataBankBits (128 bits) units for ECC encoding/decoding. Therefore, under the SECDED algorithm requirement, for one 512 bits cache line, there should be \(4 * 8 = 32\) bits of check bits.
When a memory access request reads SRAM, the corresponding checksum is read out synchronously. MainPipe obtains the check results for tag and data in s2 and s5, respectively. After MainPipe detects an error, it collects error information in s5. CoupledL2 arbitrates the error signals from each Slice and reports them to the BEU.
Bus Ports
TL Bus
When L2 Cache receives data from L1/L3 Cache, if it detects an error (denied/corrupt = 1), the MainPipe will set tagErr/dataErr in the corresponding meta to 1 when writing to the Directory in s3.
When L2 Cache transfers data to L1/L3, if L2 Cache detects an ECC error or tagErr/dataErr = 1 in the corresponding meta, it will set denied/corrupt to 1 in the signal of the corresponding channel (e.g., D channel GrantBuffer); otherwise, all are set to 0.
-
In particular, for data returned on the TL D channel, if denied = 1, the corresponding corrupt also needs to be set to 1; under the current design, L2 Cache should not assume that L1 Cache holds a copy of the corresponding data (L1 Cache will directly discard the corresponding copy during a subsequent Release).
-
In particular, since the TL C channel only has the corrupt field and no denied field, the opcode field is used to assist in differentiating denied/corrupt. For example, in SinkC:
task.corrupt := c.corrupt && (c.opcode === ProbeAckData || c.opcode === ReleaseData) task.denied := c.corrupt && (c.opcode === ProbeAck || c.opcode === Release)
CHI Bus
L2 Cache supports configurable Poison/DataCheck: - Poison field: - Set 1 bit Poison bit for every 8 bytes in DAT. - L2 Cache uses an over poison strategy for Poison. - Poison errors are not reported by L2 Cache.
- DataCheck field:
- Set 1 bit DataCheck bit for every 8 bits in DAT.
- L2 Cache defaults to odd parity for DataCheck.
- DataCheck only verifies data in L2 Cache, not the entire packet.
- DataCheck verification errors are reported by L2 Cache.
When L2 Cache receives data from L3 Cache, if an error is detected:
- If respErr = NDERR, the corresponding data will not be written to the L2 Cache, but the remaining pipeline processing will be completed (e.g., for an Acquire request from L1 Cache, L2 Cache will return data and set denied and corrupt to 1).
- If respErr = NDERR/DERR or any bit in the poison field is 1 or DataCheck odd parity verification detects an error, the MainPipe will set dataErr in the corresponding meta to 1 when writing to the Directory in s3.
- If DataCheck verification detects an error, the ECC error reporting process is reused. After the MainPipe collects error information in s5, it reports to the BEU.
When L2 Cache transfers data to L3 Cache:
- If L2 Cache detects a tag ECC error or tagErr in the corresponding meta = 1, it will set respErr to NDERR and set poison to all 0.
- If L2 Cache detects a data ECC error or dataErr in the corresponding meta = 1, it will set respErr to DERR and set the poison field to all 1.
- If L2 Cache detects a data ECC error or tagErr in the corresponding meta = 1 AND dataErr = 1, it will set respErr to NDERR and set the poison field to all 1.
- If L2 Cache does not detect any errors, it will set respErr to OK and set poison to all 0.
-
The dataCheck field is filled with the checksum calculated by odd parity on data.
-
In the current version, Write/Snoop transactions supported by L2 do not allow respErr to be NDERR in the related data packet transmission (therefore, respErr in TXDAT will actually only be DERR or OK).
Coherency state handling (RN receives a request containing NDERR):
- For allocation transactions, L2 will process the pipeline normally, but will not write back the relevant data from requests containing NDERR to the Directory or DataStorage. The cache state remains unchanged (specific related transaction types are ReadClean, ReadNotSharedDirty, ReadShared, ReadUnique, CleanUnique, MakeUnique).
- For release transactions, L2 processes normally (specific related transaction types are WriteBack, WriteEvictFull, Evict, WriteEvictOrEvict).
- For Snoop, L2 probes L1 (ToN), replies with SnpResp_I and NDERR, never forwards (does not reply with CompData), and temporarily does not set the corresponding L2 cache line to Invalid.
- For other transactions, L2 guarantees that the cache state of the corresponding data does not upgrade (under the current version, this is guaranteed by 1).
Uncached Memory Access Request Error Handling
The MMIOBridge in CoupledL2 converts error handling related fields between TL and CHI but does not perform any error reporting.
CHI to TL (RXDAT/RXRSP)
- If respErr = NDERR, set denied to 1.
- If respErr = NDERR/DERR or any bit in the poison field is 1 or DataCheck odd parity verification detects an error, set corrupt to 1.
-
Otherwise, both denied and corrupt are set to 0.
-
In particular, for RXRSP (e.g., Comp), because TL-SPEC requires corrupt to be 0 in some response types (e.g., AccessAck), when respErr = NDERR/DERR, both are set to denied = 1.
- When an error occurs, a subsequent Hardware Error is triggered by ICache or DCache and reported to software for processing.
TL to CHI (TXDAT)
- When corrupt = 1, set respErr to DERR and set poison to all 1.
- When corrupt = 0, set respErr to OK and set poison to all 0.
- The dataCheck field is filled with the checksum calculated by odd parity on data.