跳转至

XiangShan ICache Design Document

Terminology

Abbreviation Full Name Description
ICache/I$ Instruction Cache L1 Instruction Cache
DCache/D$ Data Cache L1 Data Cache
L2 Cache/L2$ Level Two Cache L2 Cache
IFU Instruction Fetch Unit Instruction Fetch Unit
ITLB Instruction Translation Lookaside Buffer Address translation buffer
PMP Physical Memory Protection Physical Memory Protection module
PMA Physical Memory Attribute Physical Memory Attribute module (part of PMP)
BEU Bus Error Unit Bus Error Unit
FDIP Fetch-directed Instruction Prefetch Fetch-directed instruction prefetch algorithm
MSHR Miss Status Holding Register Miss Status Holding Register
a/(g)pf Access / (Guest) Page Fault Access fault / (Guest) Page fault
v/(g)paddr Virtual / (Guest) Physical Address Virtual Address / (Guest) Physical Address
PBMT Page-Based Memory Types Page-Based Memory Types, see Privileged Architecture Svpbmt extension

Submodule List

Submodule Description
MainPipe Main Pipeline
IPrefetchPipe Prefetch Pipeline
WayLookup Metadata buffer queue
MetaArray Metadata SRAM
DataArray Data SRAM
MissUnit Miss Handling Unit
Replacer Replacement Policy Unit
CtrlUnit Control Unit, currently only used for controlling error checking/error injection

Design Specifications

  • Cache instruction data
  • Request data from L2 via TileLink bus on a miss
  • Software maintains L1 I/D Cache coherence (fence.i)
  • Supports fetch requests spanning cachelines
  • Supports flushing (BPU redirect, backend redirect, fence.i)
  • Supports prefetch requests
  • Hardware prefetch uses the FDIP algorithm
  • Software prefetch uses the Zicbop extension prefetch.i instruction
  • Supports configurable replacement algorithms
  • Supports a configurable number of Miss Status Holding Registers
  • Supports checking for address translation errors and physical memory protection errors
  • Supports error checking & error recovery & error injection1
  • Uses parity code by default
  • Achieves error recovery by re-fetching from L2
  • Software can access error injection control registers via MMIO space
  • DataArray supports banked storage for fine-grained access and lower power consumption

Parameter List

Parameter Default Value Description Requirement
nSets 256 Number of SRAM sets Power of 2
nWays 4 Number of SRAM ways
nFetchMshr 4 Number of fetch MSHRs
nPrefetchMshr 10 Number of prefetch MSHRs
nWayLookupSize 32 WayLookup depth, also limits the maximum prefetch distance via backpressure
DataCodeUnit 64 Checksum unit size in bits, each 64 bits corresponds to 1 checksum bit
ICacheDataBanks 8 Number of banks to divide a cacheline into
ICacheDataSRAMWidth 66 Width of the basic DataArray SRAM unit Greater than the sum of data and code width per bank

Functional Overview

The FTQ stores prediction blocks generated by the BPU. fetchPtr points to the fetch prediction block, and prefetchPtr points to the prefetch prediction block. On reset, prefetchPtr is the same as fetchPtr. fetchPtr increments every time a fetch request is successfully sent, and prefetchPtr increments every time a prefetch request is successfully sent. For detailed information, refer to the FTQ Design Document.

FTQ Pointer Diagram

The structure of the ICache is shown in the figure below. There are two pipelines: MainPipe and IPrefetchPipe. MainPipe receives fetch requests from the FTQ, and IPrefetchPipe receives hardware/software prefetch requests from the FTQ/MemBlock. For prefetch requests, IPrefetch queries the MetaArray and stores the metadata (which way hit, ECC checksum, whether an exception occurred, etc.) into the WayLookup in the IPrefetchPipe s1 stage. If the request misses, it is sent to the MissUnit for prefetching. For fetch requests, MainPipe first reads hit information from the WayLookup. If no valid information is available in the WayLookup, MainPipe stalls until IPrefetchPipe writes the information into the WayLookup. This approach separates MetaArray and DataArray access, accessing only one way of the DataArray at a time, achieving lower power consumption, at the cost of a one-cycle redirect latency.

ICache Structure

The MissUnit handles fetch requests from the MainPipe and prefetch requests from the IPrefetchPipe, managing them via MSHRs. All MSHRs share a set of data registers to reduce area.

The Replacer is the replacement unit, using a PLRU replacement policy by default. It receives hit updates from the MainPipe and provides the waymask for replacement to the MissUnit.

The MetaArray is divided into two banks, odd and even, to support dual-line access across cachelines.

Each cacheline in the DataArray is stored in 8 banks by default. Each bank stores 64 bits of valid data. Additionally, each 64 bits requires 1 bit for the checksum. Since 65-bit wide SRAM performs poorly, 256*66-bit SRAM is chosen as the basic unit, with a total of 32 such basic units. Each access requires 34 Bytes of instruction data, necessitating access to 5 banks (\(8\times5>34\)), chosen based on the starting address.

Detailed Functionality

(Pre)fetch Requests

The FTQ sends (pre)fetch requests to the corresponding (pre)fetch pipeline for processing. As mentioned earlier, IPrefetch queries the MetaArray and ITLB and stores metadata (which way hit, ECC checksum, whether an exception occurred, etc.) in the IPrefetchPipe s1 pipeline stage into the WayLookup, for MainPipe s0 pipeline stage to read.

During power-on reset or redirection, the WayLookup is empty, and the FTQ's prefetchPtr and fetchPtr are reset to the same position. The MainPipe s0 pipeline stage is forced to stall, waiting for the IPrefetchPipe s1 pipeline stage to write. This introduces an extra cycle of redirection latency. However, as the BPU fills prediction blocks into the FTQ and the MainPipe/IFU stalls for various reasons (e.g., miss, IBuffer full), the IPrefetchPipe will operate ahead of the MainPipe (prefetchPtr > fetchPtr), and there will be sufficient metadata in the WayLookup. At this point, the work of MainPipe s0 stage and IPrefetchPipe s0 stage will be parallel.

Relationship between ICache's Two Pipelines

For detailed fetch processes, refer to the MainPipe Submodule Document, IPrefetchPipe Submodule Document, and WayLookup Submodule Document.

Hardware Prefetch and Software Prefetch

After V2R2, the ICache may accept prefetch requests from two sources:

  1. Hardware prefetch requests from the Ftq, based on the FDIP algorithm.
  2. Software prefetch requests from LoadUnit within Memblock, which are essentially the prefetch.i instruction from the Zicbop extension. Please refer to the RISC-V CMO manual.

However, PrefetchPipe can only handle one prefetch request per cycle, so arbitration is required. The ICache top level is responsible for buffering software prefetch requests and selecting between a buffered software prefetch request and a hardware prefetch request from Ftq to send to PrefetchPipe. Software prefetch requests have higher priority than hardware prefetch requests.

Logically, each LoadUnit may issue a software prefetch request, thus potentially up to the number of LoadUnits (currently default parameter is LduCnt=3) software prefetch requests per cycle. However, for implementation cost and performance benefit considerations, the ICache receives and processes at most one per cycle; any additional requests are discarded, with the lowest port index having priority. Furthermore, if PrefetchPipe stalls and a software prefetch request is already buffered within the ICache, the original software prefetch request will be overwritten.

ICache Prefetch Request Reception and Arbitration

After being sent to PrefetchPipe, the processing of software prefetch requests is almost identical to that of hardware prefetch requests, except that: - Software prefetch requests do not affect control flow, i.e., they do not get sent to MainPipe (and subsequent Ifu, IBuffer, etc.). They only perform: 1) check if miss or exception occurs; 2) if miss and no exception, send to MissUnit for prefetch and SRAM refill.

For details on PrefetchPipe, please refer to the submodule documentation.

Exception Handling/Special Case Processing

The ICache is responsible for checking permissions for fetch request addresses (via ITLB and PMP), and receiving responses from L2. Exceptions that may occur during this process include:

Source Exception Description Handling
ITLB af Access fault occurred during virtual address translation Forbid fetching, mark the fetch block as af, send via IFU, IBuffer to backend for processing
ITLB gpf Guest page fault Forbid fetching, mark the fetch block as gpf, send via IFU, IBuffer to backend for processing, send valid gpaddr and isForNonLeafPTE to backend's GPAMem for later use
ITLB pf Page fault Forbid fetching, mark the fetch block as pf, send via IFU, IBuffer to backend for processing
backend af/pf/gpf Same as ITLB af/gpf/pf Same as ITLB af/gpf/pf
PMP af No permission to access physical address Same as ITLB af
MissUnit L2 corrupt L2 cache response is corrupt Mark the fetch block as af, send via IFU, IBuffer to backend for processing

It should be noted that for a normal fetch flow, there is no "backend exception" item. However, XiangShan, for the sake of saving hardware resources, only transmits 41/50 bits of the PC in the frontend (Sv39*4 / Sv48*4). For instructions like jr, jalr, etc., the jump target comes from a 64-bit register. According to the RISC-V specification, an address is illegal if the high bits are neither all 0s nor all 1s, and this should trigger an exception. This check can only be performed by the backend and is sent to the Ftq along with the backend redirect signal, and then sent to the ICache along with the fetch request. It is essentially a type of ITLB exception, so the description and handling are the same as ITLB.

Additionally, an L2 cache response being corrupt via the TileLink bus can be due to an L2 ECC error (d.corrupt), or it could be a denied access due to no permission to access the bus address space (d.denied). The TileLink manual specifies that when d.denied is asserted, d.corrupt must also be asserted. Since both situations require marking the fetch block as an access fault, there is currently no need to distinguish between these two cases in the ICache (i.e., no need to specifically check d.denied, which might be automatically optimized away by Chisel and thus not appear in the Verilog).

There is a priority among these exceptions: backend exception > ITLB exception > PMP exception > MissUnit exception. This is natural: 1. When a backend exception occurs, the vaddr sent to the frontend is incomplete and illegal, so the ITLB address translation process is meaningless, and any detected exceptions are invalid. 2. When an ITLB exception occurs, the translated paddr is invalid, so the PMP check process is meaningless, and any detected exceptions are invalid. 3. When a PMP exception occurs, the paddr has no access permission, and no (pre)fetch request will be sent, so no response will be received from the MissUnit.

For the three types of backend exceptions and the three types of ITLB exceptions, the backend and ITLB internally perform a prioritized selection, ensuring that at most one is asserted simultaneously.

Furthermore, some mechanisms can cause special situations, also referred to as exceptions in older documentation/code, but which do not actually trigger an exception as defined in the RISC-V manual. To avoid confusion, these will be referred to as special cases hereafter:

Source Special Case Description Handling
PMP mmio Physical address is in MMIO space Forbid fetching, mark the fetch block as mmio, perform non-speculative fetching by IFU
ITLB pbmt.NC Page attribute is non-cacheable, idempotent Forbid fetching, perform speculative fetching by IFU
ITLB pbmt.IO Page attribute is non-cacheable, non-idempotent Same as PMP mmio
MainPipe ECC error Main pipeline detects MetaArray/DataArray ECC error See ECC section, handled as ITLB af in older versions, automatic re-fetch in newer versions

DataArray Per-Bank Low-Power Design

Currently, each cacheline in the ICache is divided into 8 banks, bank0-7. A fetch block requires 34B of instruction data, so accessing requires accessing 5 consecutive banks. There are two cases:

  1. These 5 banks are located within a single cacheline (starting address is in bank0-3). Assuming the starting address is in bank2, the required data is in bank2-6. As shown in figure a below.
  2. Spanning across cachelines (starting address is in bank4-7). Assuming the starting address is in bank6, the data is in cacheline0's bank6-7 and cacheline1's bank0-2. This is somewhat similar to a circular buffer. As shown in figure b below.

DataArray Bank Division Diagram

When fetching a cacheline from SRAM or MSHR, the data is placed into the corresponding banks based on the address.

Since each access only needs 5 banks of data, the port from ICache to IFU actually only needs a 64B port. The selected banks from the two cachelines are extracted and concatenated together before being returned to IFU (completed within the DataArray module); IFU copies and concatenates this 64B data together, and then can directly select the fetch block data based on the fetch block's starting address. Diagrams for the non-spanning and spanning cases are as follows:

DataArray Data Return Diagram (Single Line)

DataArray Data Return Diagram (Multiple Lines)

You can also refer to the comments in IFU.scala.

Flushing

During backend/IFU redirection, BPU redirection, or execution of the fence.i instruction, the storage structures and pipeline stages within the ICache need to be flushed as necessary. Possible flush targets/actions are:

  1. All pipeline stages of MainPipe and IPrefetchPipe
    • Simply set s0/1/2_valid to false.B when flushing
  2. valid bits in MetaArray
    • Simply set valid to false.B when flushing
    • tag and code do not need to be flushed because their validity is controlled by valid
    • Data in DataArray does not need to be flushed because its validity is controlled by valid in MetaArray
  3. WayLookup
    • Reset read/write pointers
    • Set gpf_entry.valid to false.B
  4. All MSHRs in MissUnit
    • If the MSHR has not yet sent a request to the bus, invalidate it directly (valid === false.B)
    • If the MSHR has already sent a request to the bus, record it as needing flushing (flush === true.B or fencei === true.B). Invalidate it only when a grant response is received on the d channel, and do not return the grant data to MainPipe/PrefetchPipe or write to SRAM.
    • Note that when a grant response is received on the d channel simultaneously with a flush request (io.flush === true.B or io.fencei === true.B), the MissUnit will still not write to SRAM but will return the data to MainPipe/PrefetchPipe to avoid introducing port latency into the response logic. In this case, MainPipe/PrefetchPipe also simultaneously receives the flush request and will therefore discard the data.

Flush targets for each flush reason:

Flush Reason 1 2 3 4
Backend/IFU Redirect Y Y Y
BPU Redirect Y2
fence.i Y3 Y Y3 Y

The ICache does not accept fetch/prefetch requests (io.req.ready === false.B) when performing a flush.

ITLB Flushing

ITLB flushing is special. Its cached page table entries only need to be flushed when the sfence.vma instruction is executed, and this flush path is handled by the backend. Therefore, the frontend/ICache generally does not need to manage ITLB flushing. There is only one exception: currently, to save resources, the ITLB does not store gpaddr. Instead, when a gpf occurs, it re-fetches from L2TLB. The re-fetch status is controlled by a gpf cache. This requires the ICache, upon receiving ITLB.resp.excp.gpf_instr, to satisfy one of the following two conditions:

  1. Resend the same ITLB.req.vaddr until ITLB.resp.miss is deasserted (at which point both gpf and gpaddr are valid and can be sent to the backend for normal processing). The ITLB will then flush its gpf cache.
  2. Assert ITLB.flushPipe. The ITLB will flush its gpf cache upon receiving this signal.

If the ITLB's gpf cache is not flushed and it receives a request with a different ITLB.req.vaddr and gpf occurs again, it will cause the core to hang.

Therefore, whenever IPrefetchPipe's s1 pipeline stage is flushed, regardless of the reason, the ITLB's gpf cache must also be flushed simultaneously (i.e., assert ITLB.flushPipe).

ECC

First, it should be pointed out that the ICache, under default parameters, uses parity code, which only has 1-bit error detection capability and lacks error correction capability. Strictly speaking, it is not ECC (Error Correction Code). However, on one hand, it can be configured to use secded code; on the other hand, we extensively use "ECC" in our code to name functions related to error detection and recovery (ecc_error, ecc_inject, etc.). Therefore, this document will continue to use the term ECC to refer to error detection, error recovery, and error injection related functionalities to maintain consistency with the code.

The ICache supports error detection, error recovery, and error injection functions, which are part of the RAS4 capability. Refer to the RISC-V RERI5 manual. These are controlled by the CtrlUnit.

Error Detection

When the MissUnit refills data into MetaArray and DataArray, it calculates checksums for both meta and data. The former is stored in Meta SRAM along with the meta, and the latter is stored in a separate Data Code SRAM.

When a fetch request reads from SRAM, the checksums are read simultaneously and checked for meta and data in MainPipe's s1/s2 pipeline stages, respectively. Software can enable/disable this function by writing specific values to the corresponding bits in a CSR. In versions from June to December, this was a custom CSR sfetchctl. Subsequently, it was changed to an MMIO-mapped CSR. For details, see the CtrlUnit document.

Regarding checksum design, the checksum used by the ICache is parameter-controlled. By default, parity code is used, meaning the checksum is the reduction XOR of the data: \(code = \oplus data\). When checking, the checksum and data are simply reduction XORed together: \(error = (\oplus data) \oplus code\). If the result is 1, an error has occurred; otherwise, it is assumed there is no error (an even number of errors might occur but won't be detected here).

In versions after #4044, the ICache supports error injection, which requires the ICache to support writing incorrect checksums to MetaArray/DataArray. Therefore, a poison bit was implemented. When it is asserted, it flips the written code, i.e., \(code = (\oplus data) \oplus poison\).

To reduce cases of undetected errors, the data is currently divided into DataCodeUnit (default 64 bits) units, and parity checks are performed separately for each. Thus, for each 64B cache line, a total of \(8 (data) + 1 (meta) = 9\) checksums are calculated.

When MainPipe's s1/s2 pipeline stage detects an error, the following processing occurs:

In versions from June to November:

  1. Error Handling: Triggers an access fault exception, handled by software.
  2. Error Reporting: Reports the error to the BEU, which triggers an interrupt to report the error to software.
  3. Request Cancellation: When an error is detected in MetaArray, the read ptag is unreliable, making the hit/miss judgment unreliable. Therefore, regardless of whether it hits or misses, no request is sent to L2 Cache. Instead, the exception is directly passed to IFU and then to the backend for processing.

In subsequent versions (after #3899), automatic error recovery is implemented, so only the following processing is performed:

  1. Error Handling: Re-fetch instructions from L2 Cache, see next section.
  2. Error Reporting: Reports the error to the BEU as above.

Automatic Error Recovery

Note that unlike the DCache, the ICache is read-only. Therefore, its data is necessarily not dirty, meaning we can always retrieve correct data from the lower level storage structure (L2/3 Cache, memory). Thus, the ICache can achieve automatic error recovery by re-issuing a miss request to the L2 Cache.

Implementing the re-fetch function itself only requires reusing the existing miss fetch path, following the MainPipe -> MissUnit -> MSHR --TileLink-> L2 Cache request path. When the MissUnit refills data into SRAM, it naturally calculates and stores the new checksum, so after re-fetching, it will return to an error-free state without requiring extra processing.

The difference in pseudo-code behavior between versions from June-November and subsequent versions is as follows:

- exception = itlb_exception || pmp_exception || ecc_error
+ exception = itlb_exception || pmp_exception

- should_fetch = !hit && !exception
+ should_fetch = (!hit || ecc_error) && !exception

It is important to note that to avoid multi-hit after re-fetching (i.e., multiple ways within the same set having the same ptag), the corresponding valid bits in metaArray need to be cleared before re-fetching:

  • If MetaArray error: the ptag stored in meta may itself be incorrect, making the hit result (one-hot waymask) unreliable. "Corresponding position" refers to all ways in that set.
  • If DataArray error: the hit result is reliable. "Corresponding position" refers to the way where the waymask is asserted in that set.

Error Injection

According to the RERI manual5, to allow software to test ECC functionality and thus better determine if the hardware is functioning correctly, error injection functionality needs to be provided, which means actively triggering ECC errors.

The ICache's error injection functionality is controlled by the CtrlUnit and triggered by writing specific values to the corresponding bits in an MMIO-mapped CSR. See the CtrlUnit document for details.

Currently, the ICache supports:

  • Injecting at a specific paddr. If the requested paddr misses, the injection fails.
  • Injecting into MetaArray or DataArray.
  • Injection fails if ECC checking functionality itself is not enabled.

Pseudo-code demonstrating the software injection process:

inject_target:
  # maybe do something
  ret

test:
  la t0, $BASE_ADDR     # Load MMIO-mapped CSR base address
  la t1, inject_target  # Load injection target address
  jalr ra, 0(t1)        # Jump to injection target to ensure it's loaded into ICache
  sd t1, 8(t0)          # Write injection target address to CSR
  la t2, ($TARGET << 2 | 1 << 1 | 1 << 0)  # Set injection target, injection enable, check enable
  sd t1, 0(t0)          # Write injection request to CSR
loop:
  ld t1, 0(t0)          # Read CSR
  andi t1, t1, (0b11 << (4+1)) # Read injection status
  beqz t1, loop         # If injection not complete, continue waiting

  addi t1, t1, -1
  bnez t1, error        # If injection failed, jump to error handling

  jalr ra, 0(t1)        # Injection succeeded, jump to injection target address to trigger error
  j    finish           # Finish

error:
  # handle error
finish:
  # finish

We have written a test case, see this repository, which tests the following scenarios:

  1. Normal MetaArray injection
  2. Normal DataArray injection
  3. Injecting an invalid target
  4. Injecting when ECC checking is not enabled
  5. Injecting a missed address
  6. Attempting to write to a read-only CSR field

References

  1. Glenn Reinman, Brad Calder, and Todd Austin. "Fetch directed instruction prefetching." 32nd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO). 1999.

  1. This document also refers to error checking & error recovery & error injection related functionalities as ECC, starting from ECC

  2. The BPU accurate predictor (BPU s2/s3 provides results) may override the simple predictor (BPU s0 provides results). Clearly, its redirect request reaches the ICache at most 1-2 cycles after the prefetch request. Therefore, only the following is needed:

    BPU s2 redirect: Flush IPrefetchPipe s0

    BPU s3 redirect: Flush IPrefetchPipe s0/1

    When the request in the corresponding IPrefetchPipe stage comes from a software prefetch (isSoftPrefetch === true.B), no flushing is required.

    When the request in the corresponding IPrefetchPipe stage comes from a hardware prefetch but the ftqIdx does not match the flush request, no flushing is required. 

  3. Logically, fence.i should flush MainPipe and IPrefetchPipe (because the data in these stages might be invalid), but in practice, io.fencei being high is always accompanied by a backend redirect. Therefore, in the current implementation, there is no need to flush MainPipe and IPrefetchPipe. 

  4. This RAS (Reliability, Availability, and Serviceability) is not the same as RAS (Return Address Stack). 

  5. RERI (RAS Error-record Register Interface). Refer to the RISC-V RERI manual