CoupledL2

List of Submodules

CoupledL2 top level is divided into (default) 4 Slices, MMIOBridge, and Prefetcher.

The Prefetcher includes the local L2 prefetcher, Best-Offset Prefetch (BOP), and the L1 DCache prefetch receiver, used to receive prefetch requests trained in DCache but needing to be fetched to L2.

MMIOBridge is an MMIO request forwarding bridge used to convert MMIO requests on the TileLink bus to CHI requests, arbitrate them with CHI requests for the cacheable address space, and access the interconnect bus through a unified CHI interface.

The 4 Slices of CoupledL2 are partitioned based on the lower bits of the address. Requests or prefetch addresses for different addresses are distributed to different Slices.

The list of submodules within each Slice is as follows:

Submodule	Description
SinkA	Upstream TileLink bus A channel controller
SinkC	Upstream TileLink bus C channel controller
GrantBuffer	Upstream TileLink bus D/E channel controller
TXREQ	Downstream CHI bus TXREQ channel controller
TXDAT	Downstream CHI bus TXDAT channel controller
TXRSP	Downstream CHI bus TXRSP channel controller
RXSNP	Downstream CHI bus RXSNP channel controller
RXDAT	Downstream CHI bus RXDAT channel controller
RXRSP	Downstream CHI bus RXRSP channel controller
Directory	Directory, SRAM storing metadata information
DataStorage	Data SRAM
RefillBuffer	Refill data register file
ReleaseBuffer	Release data register file
RequestBuffer	A channel request buffer
RequestArb	Request arbitrator, main pipeline stages s0~s2
MainPipe	Main pipeline stages s3~s5
MSHRCtl	MSHR (Miss Status Handling Registers) control module, includes 16 MSHR entries by default

Design Specifications

Interconnect with upstream L1Cache / PTW uses TileLink bus protocol
Interconnect with downstream HN-F uses CHI bus protocol, supports CHI bus versions B/C/E.b (default E.b)
Supports the following CHI Read transactions:
- ReadNoSnp (B/C/E.b) (only for MMIO and Uncache requests)
- ReadNotSharedDirty (B/C/E.b)
- ReadUnique (B/C/E.b)
Supports the following CHI Dataless transactions:
- MakeUnique (B/C/E.b)
- Evict (B/C/E.b)
- CleanShared (B/C/E.b)
- CleanInvalid (B/C/E.b)
- MakeInvalid (B/C/E.b)
Supports the following CHI Write transactions:
- WriteNoSnpPtl (B/C/E.b) (only for MMIO and Uncache requests)
- WriteBackFull (B/C/E.b)
- WriteCleanFull (B/C/E.b)
- WriteEvictOrEvict (E.b)
Supports the following CHI Snoop transactions:
- SnpOnceFwd (B/C/E.b)
- SnpOnce (B/C/E.b)
- SnpStashUnique (B/C/E.b)
- SnpStashShared (B/C/E.b)
- SnpCleanFwd (B/C/E.b)
- SnpClean (B/C/E.b)
- SnpNotSharedDirtyFwd (B/C/E.b)
- SnpNotSharedDirty (B/C/E.b)
- SnpSharedFwd (B/C/E.b)
- SnpShared (B/C/E.b)
- SnpUniqueFwd (B/C/E.b)
- SnpUnique (B/C/E.b)
- SnpUniqueStash (B/C/E.b)
- SnpCleanShared (B/C/E.b)
- SnpCleanInvalid (B/C/E.b)
- SnpMakeInvalid (B/C/E.b)
- SnpMakeInvalidStash (B/C/E.b)
- SnpQuery (E.b)
1MB capacity, 8-way set-associative structure, partitioned into 4 Slices based on the lower bits of the address
Cache line size 64B, bus data width 32B, a complete cache line transfer requires 2 beats of data transfer
Uses a MESI-like cache consistency protocol
Uses a strict inclusion policy with DCache, and a non-strict inclusion policy with ICache / PTW
Uses a non-blocking main pipeline structure
Maximum access parallelism is 4 × 16 (each Slice contains 16 MSHR entries, total 4 Slices), up to 15 MSHR entries per Slice can be used for L1Cache / PTW accesses
Supports parallel access for requests to the same set
Supports selecting the replacement way and performing replacement only after receiving refill data for an L2 cache miss
Supports merging of memory access requests and prefetch requests
Supports generating an L2 Refill Hint signal for early wake-up of Load instructions
Supports BOP prefetcher
Supports handling prefetch requests trained in L1 and written back to L2
Supports replacement algorithms like DRRIP / PLRU, default is DRRIP
Supports hardware handling of Cache Aliasing
Supports handling of MMIO requests. MMIO requests are converted from TileLink bus to CHI bus in CoupledL2 and arbitrated with cacheable requests initiated by the 4 Slices.

Functional Description

CoupledL2 receives TileLink write-back and replacement requests from the XiangShan core's DCache / ICache / PTW, completes the transfer of corresponding data blocks and consistency state transitions, and acts as an RN-F in the on-chip network, maintaining the XiangShan core's cache consistency within the on-chip interconnect system.

The CoupledL2 module receives requests through the upstream TileLink channel controllers (SinkA / SinkC) and converts them into internal requests. Requests enter the main pipeline through the request arbitrator, read the directory to obtain the cache block status, and determine whether they can be processed based on the cache block status and request information:

If the request can be processed directly by this level of cache, it continues in the main pipeline for data read, directory update, etc., then enters the GrantBuffer to be converted into a TileLink bus response.
If interaction with other caches is required to process the request, an MSHR is allocated for it. The MSHR sends sub-requests to upstream/downstream caches as needed. After receiving responses and meeting the release conditions, the task is released and re-enters the main pipeline for reading buffers, reading/writing data, updating the directory, etc., then enters the channel controller modules to be converted into a TileLink bus response.

When all operations required for a request are completed in the MSHR, the MSHR is released, waiting to receive new requests.

Uses a MESI-like Cache Consistency Protocol

The XiangShan core's cache subsystem follows the rules of the TileLink consistency tree. The cache line states in CoupledL2 include 4 states: N (Nothing), B (Branch), T (Trunk), TT (Tip):

N: Invalid
B: Read-only permission
T: The current core has write permission, but the write permission is located in an upstream cache. This cache hierarchy is not readable or writable.
TT: Readable and writable.

The consistency tree grows from bottom to top in the order of Memory, L3, L2, L1. Memory as the root node has readable and writable permissions, and the permissions of child nodes cannot exceed those of the parent node. TT represents the uppermost child node with T permission (also the leaf node of the T permission tree), indicating that only N or B permissions exist above this node. Conversely, a node with T permission but not TT permission indicates that there must be T/TT permission nodes above it. For detailed rules, please refer to the TileLink manual.

Uses a Directory to Record Cache Line Information

CoupledL2 is an Inclusive Cache based on a directory structure (the "directory" here is used in a broad sense, including metadata and Tags). The metadata includes: state bits / dirty bit / whether clients are in upstream caches / alias bit in upstream / whether prefetched up prefetch / from which prefetcher prefetchSrc / whether accessed.

In pipeline stage s1, RequestArb initiates a read request to the directory to read the Tag Array and determine if it hits. If it hits, the hit way is selected. If it misses, a replacement way is selected according to the replacement algorithm, and the metadata information of the selected way is returned to the s3 stage of MainPipe.

Uses a Non-blocking Pipeline Structure

CoupledL2 uses a main pipeline architecture. Requests from various channels pass through arbitration to enter the main pipeline, perform directory operations, and then arrange the corresponding operations based on the request information and directory results:

Acquire Request Processing Flow

As shown in 此图.

Snoop Request Processing Flow

As shown in 此图.

Release Request Processing Flow

The Release request processing flow is as follows:

Receives a Release request from L1 DCache from SinkC and converts it into an internal request.
The s1 Release request enters the pipeline and queries the directory.
s3 gets the directory query result (since L1 DCache and L2 have a strict inclusion relationship, Release will definitely hit); s3 writes the directory. If there is dirty data, it needs to be written to DataStorage in s3.
s3 generates a ReleaseAck response. It leaves the pipeline at one of the pipeline stages s3~s5 and enters the GrantBuffer to return the ReleaseAck to L1.

Delays Replacement Way Selection and Replacement until Receiving Refill Data

When a cache receives a new request but the set is full, according to conventional logic, it first needs to select a replacement way, write it to the lower level cache, thereby freeing up space for the upcoming refill of missing data, and then wait for the new data block to be refilled from the lower level cache before writing it. However, this method has some issues:

On the one hand, refilling from a lower level cache often requires a long latency (tens to thousands of cycles). During this time, the old data block has been released, and the new data block has not yet been received, so this position actually has no valid data, causing idleness and waste of cache resources and reducing the effective capacity of the cache.
On the other hand, if during this time, the upper level cache needs to access the replaced data block again, because the data block has already been released, it can only be obtained from the lower level cache again, which greatly increases access latency.

CoupledL2 delays the selection of the replacement way and the release of replacement data until the refill data is received. Specifically, when a request enters the cache, it needs to read directory information to determine if it hits. If it hits, it reads the data and returns it (standard process). If it misses, CoupledL2 does not select a replacement block based on the directory read result and arrange for its release. Instead, it only allocates an MSHR entry for it and sends a request to the lower level cache to obtain data. After the lower level returns the refill data, the MSHR task reads the directory again, selects the replacement block at this point, reads the data of the replacement block from the data storage unit, and releases it to the lower level cache. Finally, the new data block is written to the storage unit.

Since the interaction with DataStorage only occurs at the s3 stage of MainPipe, and the DataStorage SRAM is single-ported, we cannot use one MSHR Task to simultaneously complete (1) reading out the content of the data block to be replaced and releasing it to the lower level cache, and (2) writing the new data block. Therefore, these 2 steps need to be divided into MSHR Refill and MSHR Release tasks, with Refill being issued before Release. Based on two additional considerations: (1) reading old data must happen before writing new data, and (2) data must be returned to L1 as soon as possible, we arrange the following tasks for the two MSHR Tasks respectively:

MSHR Refill: Read RefillBuffer to get refill data and feedback to L1; read DataStorage to get old data and store it in ReleaseBuf; update directory with metadata of new data.
MSHR Release: Read ReleaseBuf and release the data to L3; read RefillBuffer and write the refill data to DataStorage.

Supports Parallel Access for Requests to the Same Set

CoupledL2 supports parallel access for multiple requests to the same set. For multiple requests to the same set, these requests do not need to select a replacement way before receiving the refill data, so they can be accessed in parallel until the refill data is received. After receiving the refill data, the MSHR starts selecting a replacement way and writes the replacement block to the lower level cache. The directory needs to ensure that it does not select a way that is currently being replaced when selecting a replacement way, thereby ensuring that multiple requests to the same set will select different replacement ways.

Load Instruction Early Wake-up

Each time CoupledL2 refills data to L1 DCache, it sends a Refill Hint signal 3 cycles before the GrantData is issued to the LoadQueue inside the core. Upon receiving the wake-up signal, LoadQueueReplay immediately wakes up the Loads that need to be reissued and sends them to the LoadUnit. Load instructions will receive the refill data at pipeline stages s2/s3 of the LoadUnit, thereby reducing the access latency for Loads that miss in L1.

Supports Hardware Prefetching

CoupledL2's hardware prefetcher simultaneously receives BOP prefetch requests and prefetch requests sent by L1 DCache and sends these requests into the prefetch queue (PrefetchQueue). When the prefetch queue is full, it automatically discards the oldest prefetch request at the head of the queue, allowing newer prefetch requests to enter the queue, thus ensuring the timeliness of prefetching.

Supports Request Fusion

Experimental observations show that L2 Cache has a significant proportion of untimely prefetches, meaning that the prefetcher predicts data needed in the future, but the request is sent late. When a cache miss caused by prefetch is waiting in the MSHR for data return from the lower level cache, an Acquire request for the same address has already arrived at L2. To prevent such Acquire requests from being blocked at the RequestBuffer entry, leading to the L2 entry being full and subsequent requests unable to enter, the current L2 is designed with a mechanism for request fusion that merges untimely Prefetches with subsequent Acquire requests for the same address. The request fusion function is implemented as follows:

At the entry RequestBuffer of the SinkA channel, determine if an A request from L1 meets the merging conditions. The conditions are: the new request is an Acquire, there is a miss request in the MSHRs which is a Prefetch, and the address is the same as the Acquire.
If the merging conditions are met, the new request does not need to enter the queue and be blocked, but instead directly enters the MSHR entry corresponding to the Prefetch with the same address, marks this entry with mergeA, and adds a series of request status information to make it contain the content of both requests.
When the target data returns from L3, the MSHR entry is woken up and sends a task to the main pipeline for processing. In the main pipeline, a replacement way is selected and the new data is refilled, while the metadata of the data block is updated to the state after the Acquire request is processed. At the same time, this request will also pass information to the prefetcher for training.
When processing the request response, this merged request will enter the GrantBuffer from the main pipeline. For Prefetch requests, L2 returns a prefetch response; for Acquire requests, L2 returns data and a response to the upstream node that issued the Acquire through the grantQueue.

Supports Hardware Handling of Cache Aliasing

All L1 Caches in the XiangShan core use the VIPT indexing method. The DCache is a 64KB 4-way set-associative structure. The index and block offset used to access DCache exceed the page offset of a 4KB page, which introduces the Cache Alias problem. As shown in 此图, when two virtual pages map to the same physical page, the alias bits of the two virtual pages (the portion of the index exceeding the 4KB page offset) have a high probability of being different. If no additional processing is done, cache blocks from the two virtual pages will be located in different sets in DCache after VIPT indexing, causing the same physical address to be cached twice in DCache. If DCache does not do additional processing, it will introduce cache consistency errors.

XiangShan core solves the Cache Alias problem in hardware through CoupledL2. The specific solution is that CoupledL2 records the alias bits of the upstream data, ensuring that a physical address cache block has at most one type of alias bit in the L1 DCache. When the upstream cache sends an Acquire request, it carries the alias bit. L2 Cache checks the directory. If it hits but the alias is inconsistent, it probes the upstream cache with the previously recorded alias bit and writes the Acquire's alias bit to the directory.

Overall Design

Overall Block Diagram

The XSTile (including the XiangShan core and CoupledL2) structure block diagram is shown in 此图.

The CoupledL2 microarchitecture block diagram is shown in 此图.