XiangShan ICache Design Document

Version: V2R2
Status: OK
Date: 2025/03/07
Commit: 4b2c87ba1d7965f6f2b0a396be707a6e2f6fb345

Terminology

Abbreviation	Full Name	Description
ICache/I$	Instruction Cache	L1 Instruction Cache
DCache/D$	Data Cache	L1 Data Cache
L2 Cache/L2$	Level Two Cache	L2 Cache
IFU	Instruction Fetch Unit	Instruction Fetch Unit
ITLB	Instruction Translation Lookaside Buffer	Address translation buffer
PMP	Physical Memory Protection	Physical Memory Protection module
PMA	Physical Memory Attribute	Physical Memory Attribute module (part of PMP)
BEU	Bus Error Unit	Bus Error Unit
FDIP	Fetch-directed Instruction Prefetch	Fetch-directed instruction prefetch algorithm
MSHR	Miss Status Holding Register	Miss Status Holding Register
a/(g)pf	Access / (Guest) Page Fault	Access fault / (Guest) Page fault
v/(g)paddr	Virtual / (Guest) Physical Address	Virtual Address / (Guest) Physical Address
PBMT	Page-Based Memory Types	Page-Based Memory Types, see Privileged Architecture Svpbmt extension

Submodule List

Submodule	Description
MainPipe	Main Pipeline
IPrefetchPipe	Prefetch Pipeline
WayLookup	Metadata buffer queue
MetaArray	Metadata SRAM
DataArray	Data SRAM
MissUnit	Miss Handling Unit
Replacer	Replacement Policy Unit
CtrlUnit	Control Unit, currently only used for controlling error checking/error injection

Design Specifications

Cache instruction data
Request data from L2 via TileLink bus on a miss
Software maintains L1 I/D Cache coherence (fence.i)
Supports fetch requests spanning cachelines
Supports flushing (BPU redirect, backend redirect, fence.i)
Supports prefetch requests
Hardware prefetch uses the FDIP algorithm
Software prefetch uses the Zicbop extension prefetch.i instruction
Supports configurable replacement algorithms
Supports a configurable number of Miss Status Holding Registers
Supports checking for address translation errors and physical memory protection errors
Supports error checking & error recovery & error injection¹
Uses parity code by default
Achieves error recovery by re-fetching from L2
Software can access error injection control registers via MMIO space
DataArray supports banked storage for fine-grained access and lower power consumption

Parameter List

Parameter	Default Value	Description	Requirement
nSets	256	Number of SRAM sets	Power of 2
nWays	4	Number of SRAM ways
nFetchMshr	4	Number of fetch MSHRs
nPrefetchMshr	10	Number of prefetch MSHRs
nWayLookupSize	32	WayLookup depth, also limits the maximum prefetch distance via backpressure
DataCodeUnit	64	Checksum unit size in bits, each 64 bits corresponds to 1 checksum bit
ICacheDataBanks	8	Number of banks to divide a cacheline into
ICacheDataSRAMWidth	66	Width of the basic DataArray SRAM unit	Greater than the sum of data and code width per bank

Functional Overview

The FTQ stores prediction blocks generated by the BPU. fetchPtr points to the fetch prediction block, and prefetchPtr points to the prefetch prediction block. On reset, prefetchPtr is the same as fetchPtr. fetchPtr increments every time a fetch request is successfully sent, and prefetchPtr increments every time a prefetch request is successfully sent. For detailed information, refer to the FTQ Design Document.

The structure of the ICache is shown in the figure below. There are two pipelines: MainPipe and IPrefetchPipe. MainPipe receives fetch requests from the FTQ, and IPrefetchPipe receives hardware/software prefetch requests from the FTQ/MemBlock. For prefetch requests, IPrefetch queries the MetaArray and stores the metadata (which way hit, ECC checksum, whether an exception occurred, etc.) into the WayLookup in the IPrefetchPipe s1 stage. If the request misses, it is sent to the MissUnit for prefetching. For fetch requests, MainPipe first reads hit information from the WayLookup. If no valid information is available in the WayLookup, MainPipe stalls until IPrefetchPipe writes the information into the WayLookup. This approach separates MetaArray and DataArray access, accessing only one way of the DataArray at a time, achieving lower power consumption, at the cost of a one-cycle redirect latency.

The MissUnit handles fetch requests from the MainPipe and prefetch requests from the IPrefetchPipe, managing them via MSHRs. All MSHRs share a set of data registers to reduce area.

The Replacer is the replacement unit, using a PLRU replacement policy by default. It receives hit updates from the MainPipe and provides the waymask for replacement to the MissUnit.

The MetaArray is divided into two banks, odd and even, to support dual-line access across cachelines.

Each cacheline in the DataArray is stored in 8 banks by default. Each bank stores 64 bits of valid data. Additionally, each 64 bits requires 1 bit for the checksum. Since 65-bit wide SRAM performs poorly, 256*66-bit SRAM is chosen as the basic unit, with a total of 32 such basic units. Each access requires 34 Bytes of instruction data, necessitating access to 5 banks ($8\times5>34$), chosen based on the starting address.

Detailed Functionality

(Pre)fetch Requests

The FTQ sends (pre)fetch requests to the corresponding (pre)fetch pipeline for processing. As mentioned earlier, IPrefetch queries the MetaArray and ITLB and stores metadata (which way hit, ECC checksum, whether an exception occurred, etc.) in the IPrefetchPipe s1 pipeline stage into the WayLookup, for MainPipe s0 pipeline stage to read.

During power-on reset or redirection, the WayLookup is empty, and the FTQ's prefetchPtr and fetchPtr are reset to the same position. The MainPipe s0 pipeline stage is forced to stall, waiting for the IPrefetchPipe s1 pipeline stage to write. This introduces an extra cycle of redirection latency. However, as the BPU fills prediction blocks into the FTQ and the MainPipe/IFU stalls for various reasons (e.g., miss, IBuffer full), the IPrefetchPipe will operate ahead of the MainPipe (prefetchPtr > fetchPtr), and there will be sufficient metadata in the WayLookup. At this point, the work of MainPipe s0 stage and IPrefetchPipe s0 stage will be parallel.

Relationship between ICache's Two Pipelines

For detailed fetch processes, refer to the MainPipe Submodule Document, IPrefetchPipe Submodule Document, and WayLookup Submodule Document.

Hardware Prefetch and Software Prefetch

After V2R2, the ICache may accept prefetch requests from two sources:

Hardware prefetch requests from the Ftq, based on the FDIP algorithm.
Software prefetch requests from LoadUnit within Memblock, which are essentially the prefetch.i instruction from the Zicbop extension. Please refer to the RISC-V CMO manual.

However, PrefetchPipe can only handle one prefetch request per cycle, so arbitration is required. The ICache top level is responsible for buffering software prefetch requests and selecting between a buffered software prefetch request and a hardware prefetch request from Ftq to send to PrefetchPipe. Software prefetch requests have higher priority than hardware prefetch requests.

Logically, each LoadUnit may issue a software prefetch request, thus potentially up to the number of LoadUnits (currently default parameter is LduCnt=3) software prefetch requests per cycle. However, for implementation cost and performance benefit considerations, the ICache receives and processes at most one per cycle; any additional requests are discarded, with the lowest port index having priority. Furthermore, if PrefetchPipe stalls and a software prefetch request is already buffered within the ICache, the original software prefetch request will be overwritten.

ICache Prefetch Request Reception and Arbitration

After being sent to PrefetchPipe, the processing of software prefetch requests is almost identical to that of hardware prefetch requests, except that: - Software prefetch requests do not affect control flow, i.e., they do not get sent to MainPipe (and subsequent Ifu, IBuffer, etc.). They only perform: 1) check if miss or exception occurs; 2) if miss and no exception, send to MissUnit for prefetch and SRAM refill.

For details on PrefetchPipe, please refer to the submodule documentation.

Exception Handling/Special Case Processing

The ICache is responsible for checking permissions for fetch request addresses (via ITLB and PMP), and receiving responses from L2. Exceptions that may occur during this process include:

Source	Exception	Description	Handling
ITLB	af	Access fault occurred during virtual address translation	Forbid fetching, mark the fetch block as af, send via IFU, IBuffer to backend for processing
ITLB	gpf	Guest page fault	Forbid fetching, mark the fetch block as gpf, send via IFU, IBuffer to backend for processing, send valid `gpaddr` and `isForNonLeafPTE` to backend's GPAMem for later use
ITLB	pf	Page fault	Forbid fetching, mark the fetch block as pf, send via IFU, IBuffer to backend for processing
backend	af/pf/gpf	Same as ITLB af/gpf/pf	Same as ITLB af/gpf/pf
PMP	af	No permission to access physical address	Same as ITLB af
MissUnit	L2 corrupt	L2 cache response is corrupt	Mark the fetch block as af, send via IFU, IBuffer to backend for processing

It should be noted that for a normal fetch flow, there is no "backend exception" item. However, XiangShan, for the sake of saving hardware resources, only transmits 41/50 bits of the PC in the frontend (Sv39*4 / Sv48*4). For instructions like jr, jalr, etc., the jump target comes from a 64-bit register. According to the RISC-V specification, an address is illegal if the high bits are neither all 0s nor all 1s, and this should trigger an exception. This check can only be performed by the backend and is sent to the Ftq along with the backend redirect signal, and then sent to the ICache along with the fetch request. It is essentially a type of ITLB exception, so the description and handling are the same as ITLB.

Additionally, an L2 cache response being corrupt via the TileLink bus can be due to an L2 ECC error (d.corrupt), or it could be a denied access due to no permission to access the bus address space (d.denied). The TileLink manual specifies that when d.denied is asserted, d.corrupt must also be asserted. Since both situations require marking the fetch block as an access fault, there is currently no need to distinguish between these two cases in the ICache (i.e., no need to specifically check d.denied, which might be automatically optimized away by Chisel and thus not appear in the Verilog).

There is a priority among these exceptions: backend exception > ITLB exception > PMP exception > MissUnit exception. This is natural: 1. When a backend exception occurs, the vaddr sent to the frontend is incomplete and illegal, so the ITLB address translation process is meaningless, and any detected exceptions are invalid. 2. When an ITLB exception occurs, the translated paddr is invalid, so the PMP check process is meaningless, and any detected exceptions are invalid. 3. When a PMP exception occurs, the paddr has no access permission, and no (pre)fetch request will be sent, so no response will be received from the MissUnit.

For the three types of backend exceptions and the three types of ITLB exceptions, the backend and ITLB internally perform a prioritized selection, ensuring that at most one is asserted simultaneously.

Furthermore, some mechanisms can cause special situations, also referred to as exceptions in older documentation/code, but which do not actually trigger an exception as defined in the RISC-V manual. To avoid confusion, these will be referred to as special cases hereafter:

Source	Special Case	Description	Handling
PMP	mmio	Physical address is in MMIO space	Forbid fetching, mark the fetch block as mmio, perform non-speculative fetching by IFU
ITLB	pbmt.NC	Page attribute is non-cacheable, idempotent	Forbid fetching, perform speculative fetching by IFU
ITLB	pbmt.IO	Page attribute is non-cacheable, non-idempotent	Same as PMP mmio
MainPipe	ECC error	Main pipeline detects MetaArray/DataArray ECC error	See ECC section, handled as ITLB af in older versions, automatic re-fetch in newer versions

DataArray Per-Bank Low-Power Design

Currently, each cacheline in the ICache is divided into 8 banks, bank0-7. A fetch block requires 34B of instruction data, so accessing requires accessing 5 consecutive banks. There are two cases:

These 5 banks are located within a single cacheline (starting address is in bank0-3). Assuming the starting address is in bank2, the required data is in bank2-6. As shown in figure a below.
Spanning across cachelines (starting address is in bank4-7). Assuming the starting address is in bank6, the data is in cacheline0's bank6-7 and cacheline1's bank0-2. This is somewhat similar to a circular buffer. As shown in figure b below.

When fetching a cacheline from SRAM or MSHR, the data is placed into the corresponding banks based on the address.

Since each access only needs 5 banks of data, the port from ICache to IFU actually only needs a 64B port. The selected banks from the two cachelines are extracted and concatenated together before being returned to IFU (completed within the DataArray module); IFU copies and concatenates this 64B data together, and then can directly select the fetch block data based on the fetch block's starting address. Diagrams for the non-spanning and spanning cases are as follows:

DataArray Data Return Diagram (Single Line)

DataArray Data Return Diagram (Multiple Lines)

You can also refer to the comments in IFU.scala.

Flushing

During backend/IFU redirection, BPU redirection, or execution of the fence.i instruction, the storage structures and pipeline stages within the ICache need to be flushed as necessary. Possible flush targets/actions are:

All pipeline stages of MainPipe and IPrefetchPipe
- Simply set s0/1/2_valid to false.B when flushing
valid bits in MetaArray
- Simply set valid to false.B when flushing
- tag and code do not need to be flushed because their validity is controlled by valid
- Data in DataArray does not need to be flushed because its validity is controlled by valid in MetaArray
WayLookup
- Reset read/write pointers
- Set gpf_entry.valid to false.B
All MSHRs in MissUnit
- If the MSHR has not yet sent a request to the bus, invalidate it directly (valid === false.B)
- If the MSHR has already sent a request to the bus, record it as needing flushing (flush === true.B or fencei === true.B). Invalidate it only when a grant response is received on the d channel, and do not return the grant data to MainPipe/PrefetchPipe or write to SRAM.
- Note that when a grant response is received on the d channel simultaneously with a flush request (io.flush === true.B or io.fencei === true.B), the MissUnit will still not write to SRAM but will return the data to MainPipe/PrefetchPipe to avoid introducing port latency into the response logic. In this case, MainPipe/PrefetchPipe also simultaneously receives the flush request and will therefore discard the data.

Flush targets for each flush reason:

Flush Reason	1	2	3	4
Backend/IFU Redirect	Y		Y	Y
BPU Redirect	Y²
`fence.i`	Y³	Y	Y³	Y

The ICache does not accept fetch/prefetch requests (io.req.ready === false.B) when performing a flush.

ITLB Flushing

ITLB flushing is special. Its cached page table entries only need to be flushed when the sfence.vma instruction is executed, and this flush path is handled by the backend. Therefore, the frontend/ICache generally does not need to manage ITLB flushing. There is only one exception: currently, to save resources, the ITLB does not store gpaddr. Instead, when a gpf occurs, it re-fetches from L2TLB. The re-fetch status is controlled by a gpf cache. This requires the ICache, upon receiving ITLB.resp.excp.gpf_instr, to satisfy one of the following two conditions:

Resend the same ITLB.req.vaddr until ITLB.resp.miss is deasserted (at which point both gpf and gpaddr are valid and can be sent to the backend for normal processing). The ITLB will then flush its gpf cache.
Assert ITLB.flushPipe. The ITLB will flush its gpf cache upon receiving this signal.

If the ITLB's gpf cache is not flushed and it receives a request with a different ITLB.req.vaddr and gpf occurs again, it will cause the core to hang.

Therefore, whenever IPrefetchPipe's s1 pipeline stage is flushed, regardless of the reason, the ITLB's gpf cache must also be flushed simultaneously (i.e., assert ITLB.flushPipe).

ECC

First, it should be pointed out that the ICache, under default parameters, uses parity code, which only has 1-bit error detection capability and lacks error correction capability. Strictly speaking, it is not ECC (Error Correction Code). However, on one hand, it can be configured to use secded code; on the other hand, we extensively use "ECC" in our code to name functions related to error detection and recovery (ecc_error, ecc_inject, etc.). Therefore, this document will continue to use the term ECC to refer to error detection, error recovery, and error injection related functionalities to maintain consistency with the code.

The ICache supports error detection, error recovery, and error injection functions, which are part of the RAS⁴ capability. Refer to the RISC-V RERI⁵ manual. These are controlled by the CtrlUnit.

Error Detection

When the MissUnit refills data into MetaArray and DataArray, it calculates checksums for both meta and data. The former is stored in Meta SRAM along with the meta, and the latter is stored in a separate Data Code SRAM.

When a fetch request reads from SRAM, the checksums are read simultaneously and checked for meta and data in MainPipe's s1/s2 pipeline stages, respectively. Software can enable/disable this function by writing specific values to the corresponding bits in a CSR. In versions from June to December, this was a custom CSR sfetchctl. Subsequently, it was changed to an MMIO-mapped CSR. For details, see the CtrlUnit document.

Regarding checksum design, the checksum used by the ICache is parameter-controlled. By default, parity code is used, meaning the checksum is the reduction XOR of the data: $code = \oplus data$. When checking, the checksum and data are simply reduction XORed together: $error = (\oplus data) \oplus code$. If the result is 1, an error has occurred; otherwise, it is assumed there is no error (an even number of errors might occur but won't be detected here).

In versions after #4044, the ICache supports error injection, which requires the ICache to support writing incorrect checksums to MetaArray/DataArray. Therefore, a poison bit was implemented. When it is asserted, it flips the written code, i.e., $code = (\oplus data) \oplus poison$.

To reduce cases of undetected errors, the data is currently divided into DataCodeUnit (default 64 bits) units, and parity checks are performed separately for each. Thus, for each 64B cache line, a total of $8 (data) + 1 (meta) = 9$ checksums are calculated.

When MainPipe's s1/s2 pipeline stage detects an error, the following processing occurs:

In versions from June to November:

Error Handling: Triggers an access fault exception, handled by software.
Error Reporting: Reports the error to the BEU, which triggers an interrupt to report the error to software.
Request Cancellation: When an error is detected in MetaArray, the read ptag is unreliable, making the hit/miss judgment unreliable. Therefore, regardless of whether it hits or misses, no request is sent to L2 Cache. Instead, the exception is directly passed to IFU and then to the backend for processing.

In subsequent versions (after #3899), automatic error recovery is implemented, so only the following processing is performed:

Error Handling: Re-fetch instructions from L2 Cache, see next section.
Error Reporting: Reports the error to the BEU as above.

Automatic Error Recovery

Note that unlike the DCache, the ICache is read-only. Therefore, its data is necessarily not dirty, meaning we can always retrieve correct data from the lower level storage structure (L2/3 Cache, memory). Thus, the ICache can achieve automatic error recovery by re-issuing a miss request to the L2 Cache.

Implementing the re-fetch function itself only requires reusing the existing miss fetch path, following the MainPipe -> MissUnit -> MSHR --TileLink-> L2 Cache request path. When the MissUnit refills data into SRAM, it naturally calculates and stores the new checksum, so after re-fetching, it will return to an error-free state without requiring extra processing.

The difference in pseudo-code behavior between versions from June-November and subsequent versions is as follows:

- exception = itlb_exception || pmp_exception || ecc_error
+ exception = itlb_exception || pmp_exception

- should_fetch = !hit && !exception
+ should_fetch = (!hit || ecc_error) && !exception

It is important to note that to avoid multi-hit after re-fetching (i.e., multiple ways within the same set having the same ptag), the corresponding valid bits in metaArray need to be cleared before re-fetching:

If MetaArray error: the ptag stored in meta may itself be incorrect, making the hit result (one-hot waymask) unreliable. "Corresponding position" refers to all ways in that set.
If DataArray error: the hit result is reliable. "Corresponding position" refers to the way where the waymask is asserted in that set.

Error Injection

According to the RERI manual⁵, to allow software to test ECC functionality and thus better determine if the hardware is functioning correctly, error injection functionality needs to be provided, which means actively triggering ECC errors.

The ICache's error injection functionality is controlled by the CtrlUnit and triggered by writing specific values to the corresponding bits in an MMIO-mapped CSR. See the CtrlUnit document for details.

Currently, the ICache supports:

Injecting at a specific paddr. If the requested paddr misses, the injection fails.
Injecting into MetaArray or DataArray.
Injection fails if ECC checking functionality itself is not enabled.

Pseudo-code demonstrating the software injection process:

inject_target:
  # maybe do something
  ret

test:
  la t0, $BASE_ADDR     # Load MMIO-mapped CSR base address
  la t1, inject_target  # Load injection target address
  jalr ra, 0(t1)        # Jump to injection target to ensure it's loaded into ICache
  sd t1, 8(t0)          # Write injection target address to CSR
  la t2, ($TARGET << 2 | 1 << 1 | 1 << 0)  # Set injection target, injection enable, check enable
  sd t1, 0(t0)          # Write injection request to CSR
loop:
  ld t1, 0(t0)          # Read CSR
  andi t1, t1, (0b11 << (4+1)) # Read injection status
  beqz t1, loop         # If injection not complete, continue waiting

  addi t1, t1, -1
  bnez t1, error        # If injection failed, jump to error handling

  jalr ra, 0(t1)        # Injection succeeded, jump to injection target address to trigger error
  j    finish           # Finish

error:
  # handle error
finish:
  # finish

We have written a test case, see this repository, which tests the following scenarios:

Normal MetaArray injection
Normal DataArray injection
Injecting an invalid target
Injecting when ECC checking is not enabled
Injecting a missed address
Attempting to write to a read-only CSR field

References

Glenn Reinman, Brad Calder, and Todd Austin. "Fetch directed instruction prefetching." 32nd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO). 1999.

This document also refers to error checking & error recovery & error injection related functionalities as ECC, starting from ECC. ↩
The BPU accurate predictor (BPU s2/s3 provides results) may override the simple predictor (BPU s0 provides results). Clearly, its redirect request reaches the ICache at most 1-2 cycles after the prefetch request. Therefore, only the following is needed:

BPU s2 redirect: Flush IPrefetchPipe s0

BPU s3 redirect: Flush IPrefetchPipe s0/1

When the request in the corresponding IPrefetchPipe stage comes from a software prefetch (isSoftPrefetch === true.B), no flushing is required.

When the request in the corresponding IPrefetchPipe stage comes from a hardware prefetch but the ftqIdx does not match the flush request, no flushing is required. ↩
Logically, fence.i should flush MainPipe and IPrefetchPipe (because the data in these stages might be invalid), but in practice, io.fencei being high is always accompanied by a backend redirect. Therefore, in the current implementation, there is no need to flush MainPipe and IPrefetchPipe. ↩↩
This RAS (Reliability, Availability, and Serviceability) is not the same as RAS (Return Address Stack). ↩
RERI (RAS Error-record Register Interface). Refer to the RISC-V RERI manual. ↩↩