Memory Access Pipeline LSU

Sub-module List

Sub-module	Description
LoadUnit	Load Instruction Execution Unit
StoreUnit	Store Address Execution Unit
StdExeUnit	Store Data Execution Unit
AtomicsUnit	Atomic Instruction Execution Unit
VLSU	Vector Memory Access
LSQ	Load-Store Queue
Uncache	Uncache Processing Unit
SBuffer	Store Commit Buffer
LoadMisalignBuffer	Load Misalignment Buffer
StoreMisalignBuffer	Store Misalignment Buffer

Design Specifications

Instruction Set Specifications

Supports execution and writeback of Load / Store instructions in the RVI instruction set
Supports RVA atomic instruction extension
Supports RVH virtualization extension
Supports RVV vector extension
Supports Load / Store / Atomic access to Cacheable address spaces
Supports Load / Store access to MMIO and Uncache address spaces (excluding vector memory access and misaligned memory access instructions)
Supports cache operation instructions like Zicbom and Zicboz, supports Zicbop software prefetch instruction
Supports misaligned memory access (Zicclsm), and guarantees atomicity of misaligned memory access within a 16B alignment range (Zama16b)
Supports Sv39 and Sv48 paging mechanisms
Supports contiguous page address translation (Svnapot)
Supports page-based memory attributes (Svpbmt)
Supports Pointer masking (Supm, Ssnpm, Sspm)
Supports Compare-and-Swap atomic instruction (Zacas)
Supports RVWMO memory consistency model
Supports custom fault injection instructions

Microarchitectural Features

Supports out-of-order scheduling of Load / Store instructions, including accesses to Cacheable and Uncache (non-MMIO) address spaces
Supports out-of-order scheduling of vector memory access based on the scalar pipeline
Supports element merging for unit-stride vector memory access
Supports address and data separation for Store instruction issue and execution
Supports Load instruction replay mechanism based on LoadQueue
Supports non-speculative execution of atomic instructions
Supports SBuffer to optimize Store instruction performance
Supports data forwarding mechanism based on StoreQueue and SBuffer
Supports detection and recovery of RAR / RAW memory access violations
Supports MESI cache coherence protocol
Supports multi-level cache access based on the TileLink bus
Supports DCache SECDED check
Supports software-configurable hardware prefetchers such as Stream, Stride, SMS

Parameter Configuration

Parameter	Configuration
VAddr Bits	(Sv39) 39, (Sv48) 48
GPAddr Bits	(Sv39x4) 41, (Sv48x4) 50
LoadUnit	3 x 8B/16B
StoreUnit	2 x 8B/16B
StoreExeUnit	2
LoadQueue	72
LoadQueueRAR	72
LoadQueueRAW	32
LoadQueueReplay	72
LoadUncacheBuffer	4
StoreQueue	56
StoreBuffer	16 x 64B
VLMergeBuffer	16
VSMergeBuffer	16
VSegmentBuffer	8
VFOFBuffer	1
Load TLB	48-entry fully associative
Store TLB	48-entry fully associative
L1 Prefetch TLB	48-entry fully associative
L2 Prefetch TLB	48-entry fully associative
DCache	64KB 4-way set associative
DCache MSHR	16
DCache Probe Queue	8
DCache Way Predictor	Off

Functional Description

The memory access pipeline is responsible for receiving memory access instructions (including Load / Store instructions for memory, MMIO, and Uncache address spaces, and atomic instructions for memory address space) from the issue queue, performing memory access operations based on the instruction type, obtaining the instruction execution result, and writing the result back to the register file. Simultaneously, it notifies the forwarding bypass network to wake up subsequent instructions and perform data forwarding.

Dispatch of Memory Access Instructions

Load and Store instructions have complex control mechanisms, such as ordering, forwarding, violations, etc., thus requiring a queue to store Load and Store instructions to ensure FIFO order and perform related control. This queue is the LoadQueue and StoreQueue. After instructions complete decoding and renaming, Load / Store instructions need to be dispatched to ROB and LSQ, allocated corresponding robIdx, lqIdx, and sqIdx, and then enter the corresponding issue queue, waiting for all source operands to be ready before being issued to the MemBlock pipeline. Load / Store instructions carry lqIdx and sqIdx throughout their lifecycle in MemBlock, used for instruction ordering during memory access violation detection and data forwarding.

For scalar memory access instructions, one instruction is allocated one LoadQueue or StoreQueue entry.

For vector memory access instructions, one instruction is broken down into several uops during the decode stage, each uop containing several elements, where each element is equivalent to one memory access operation. During the dispatch stage, a uop is allocated several LSQ entries, the number of allocated entries equals the number of elements contained in one uop.

Execution of Memory Access Instructions

The memory access unit includes 3 Load pipelines, 2 Store address pipelines, and 2 Store data pipelines. Each pipeline independently receives instructions issued from the corresponding issue queue and executes them.

The Load pipeline has a 4-stage structure:

s0: Calculates the memory access address, completes request arbitration from different sources (misaligned Load, Load replay, MMIO, prefetch, scalar Load, vector Load, etc.), accesses TLB, accesses DCache directory, sends writeback wake-up signals.
s1: Receives the response from TLB address translation, receives the result of DCache directory read and performs way selection, accesses DCache data SRAM; performs RAW violation detection with Store instructions in StoreUnit s1; queries StoreQueue / LoadQueueUncache / SBuffer / DCache MSHR for data forwarding.
s2: Queries LoadQueueRAR and LoadQueueRAW for violation checks by subsequent Load / Store instructions; if DCache miss, needs to allocate MSHR in s2; performs RAW violation detection with Store instructions in StoreUnit s1.
s3: Writeback; if no writeback, needs to cancel the wake-up; if a memory access violation occurs, flushes the pipeline; if replay is needed, enters LoadQueueReplay.

The Store address pipeline has a 4-stage structure:

s0: Calculates the memory access address, completes request arbitration from different sources (misaligned Store, scalar Store, vector Store, etc.), accesses TLB.
s1: Receives the response from TLB address translation; performs RAW violation detection with Load instructions in LoadUnit s1 and s2; queries LoadQueueRAW for violation checks.
s2: Marks as address ready in StoreQueue.
s3: Writeback.

The Store data pipeline receives data from the issue queue and writes the data to StoreQueue, marking it as data ready.

Execution of Vector Memory Access Instructions

For vector memory access instructions except Segment, VLSplit and VSSplit receive uops issued from the vector memory access issue queue and split the uops into several elements. VLSplit and VSSplit issue these elements to LoadUnit / StoreUnit for execution, and the execution process is the same as for scalar memory access. After element execution is complete, they are written back to VLMerge / VSMerge. The Merge modules are responsible for collecting elements, combining them into a uop, and writing back to the vector register file.

Segment instructions are handled by the independent VSegmentUnit module.

Load Instruction Replay

Load instructions do not support replay in the issue queue. Therefore, when a Load instruction encounters the following special situations, it needs to enter LoadQueueReplay to wait for re-execution:

C_MA: Memory access violation prediction algorithm (MDP) believes the Load has an address dependency with an older Store, and the Store's address is not yet ready.
C_TM: TLB miss.
C_FF: Load has an address dependency with an older Store, but the Store's data is not yet ready.
C_DR: DCache miss and MSHR is full, or an MSHR for the same address cannot currently accept a new Load.
C_DM: DCache miss, the current Load is successfully accepted by MSHR.
C_WF: Way predictor prediction failure (way predictor is off by default).
C_BC: Accessing DCache causes a bank conflict.
C_RAR: LoadQueueRAR is full.
C_RAW: LoadQueueRAW is full.
C_NK: Memory access violation occurred with a Store instruction from StoreUnit.
C_MF: LoadMisalignBuffer is full.

LoadQueueReplay will replay based on the reason for replay, with the priorities from high to low as listed above.

Store Instruction Replay

Store instruction replay is handled by the issue queue. After a Store instruction is issued from the issue queue, the issue queue does not immediately clear this Store instruction but waits for feedback from StoreUnit. StoreUnit sends corresponding feedback to the issue queue based on whether the TLB hits. If a TLB miss occurs, the issue queue is responsible for instruction replay.

Detection and Recovery of RAR Memory Access Violations

RAR Memory Access Violation: According to the RVWMO model, when (1) a write operation to the same address (including overlapping addresses) is inserted between two read operations to the same address, and (2) the results returned by these two read operations come from different write operations, these two read operations must maintain the program order. In a single-core scenario, although the memory access unit executes Load instructions out of order, it ensures that the execution results of two Loads to the same address will maintain program order through the data forwarding mechanism. However, in a multi-core scenario, when a write operation from another core (note it's a write operation, not a write instruction) is inserted between two same-address Loads whose order has been disturbed, the older Load will read the updated value after the write operation, and the younger Load will read the old value before the write operation, which is an RAR memory access violation.

Detection of RAR Memory Access Violation: The LoadQueueRAR module in LoadQueue uses a FreeList structure to record all Load instructions that may have the same address but are older and have not yet executed. When a Load instruction is executed to stage s2 in the LoadUnit (at this point, address translation and PMA / PMP check are completed), it allocates a LoadQueueRAR entry. When all program-order older Loads in LoadQueueRAR have been written back, this Load instruction can be released from LoadQueueRAR. When a Load instruction accesses LoadQueueRAR and finds a younger Load with the same address, and the younger Load might have been accessed by another core (the address has been replaced, or probed), an RAR memory access violation occurs, requiring rollback.

Recovery from RAR Memory Access Violation: When an RAR violation is detected, LoadUnit initiates a rollback, flushing the pipeline starting from the instruction after the older Load that caused the violation.

Detection and Recovery of RAW Memory Access Violations

RAW Memory Access Violation: The result of a Load instruction executed by a processor core should come from the most recent write operation visible to the current processor core in the global memory order. Specifically, if the most recent write operation comes from a Store instruction of the current core, the Load should receive the data written by this Store. Out-of-order superscalar processors speculatively execute Loads to optimize performance. Therefore, a Load instruction may execute before an older Store instruction to the same address, reading the old value before the Store, which is a RAW memory access violation.

Detection of RAW Memory Access Violation: The LoadQueueRAW module in LoadQueue uses a FreeList structure to record all Load instructions that may have the same address but are older Stores that have not yet executed. When a Load instruction is executed to stage s2 in the LoadUnit (at this point, address translation and PMA / PMP check are completed), it allocates a LoadQueueRAW entry. When all Store addresses in StoreQueue are ready, all Loads in LoadQueueRAW can be released; or when all program-order older Stores for instructions in LoadQueueRAW have their addresses ready, this Load can be released from LoadQueueRAW. When a Store instruction queries LoadQueueRAW and finds a younger Load with the same address, a RAW memory access violation occurs, requiring rollback.

Recovery from RAW Memory Access Violation: When a RAW violation is detected, LoadQueueRAW initiates a rollback, flushing the pipeline starting from the instruction after the Store that caused the violation.

SBuffer Optimizing Store Instruction Performance

According to the RVWMO model, in a multi-core scenario, (without FENCE or other barrier-semantic instructions) a Store instruction from one core can become visible to other cores later than a younger Load instruction to a different address. This memory model rule is mainly to optimize the performance of Store instructions. Weak consistency models, including RVWMO, allow the inclusion of an SBuffer in the processor core to temporarily buffer write operations of committed Store instructions, merge these write operations, and then write them to DCache. This reduces contention for the DCache SRAM ports by Store instructions, thereby increasing the execution bandwidth of Load instructions.

SBuffer is a 16 × 512B fully associative structure. When multiple Store addresses fall within the same cache block, SBuffer merges these Stores.

SBuffer can accept up to 2 Store instructions per cycle, and the write data width for each Store instruction is 16B (specifically, cbo.zero instruction operates on one cache block at a time).

SBuffer Eviction:

When the capacity of SBuffer exceeds a certain threshold, an eviction operation is performed, selecting the replacement block based on the PLRU replacement algorithm to be written to DCache.
SBuffer supports a passive flush mechanism; when FENCE / Atomic / Vector Segment instructions are executed, SBuffer is cleared.
SBuffer supports a timeout flush mechanism; data blocks that have not been replaced for more than \(2^{20}\) cycles are evicted.

Store-to-Load Data Forwarding

The existence of SBuffer and speculative execution of Load instructions necessitate that Load instructions not only access DCache but also access SBuffer and StoreQueue. Therefore, SBuffer and StoreQueue are required to provide Store-to-Load data forwarding capability. When multiple sources hit simultaneously, LoadUnit needs to merge data from multiple sources. The merging priority is StoreQueue > SBuffer > DCache.

Execution of MMIO Instructions

The XiangShan core only allows scalar memory access instructions to access the MMIO address space. MMIO accesses are strongly-ordered with any other memory access operations. Therefore, MMIO instructions must wait until they become the head of the RoB before execution, meaning all instructions before it have completed. For MMIO Load instructions, virtual-to-physical address translation must be completed and PMA / PMP physical address check must pass; for MMIO Store instructions, virtual-to-physical address translation must be completed and physical address check must pass, and write data must be ready. Then LSQ is responsible for sending the memory access request to the Uncache module. The Uncache module accesses peripherals via the bus, obtains the result, and LSQ writes it back to RoB.

Atomic instructions and vector instructions do not support MMIO access. If such instructions attempt to access the MMIO address space, a corresponding AccessFault exception will be reported.

Execution of Uncache Instructions

In addition to supporting access to non-idempotent, strongly-ordered MMIO address spaces, the XiangShan core also supports access to idempotent, weak-consistency (RVWMO) Non-cacheable address spaces, abbreviated as NC. Software configures the PBMT bit field of the page table to NC to override the original PMA attributes. Unlike MMIO access, NC access allows out-of-order memory access. NC Load execution has no side effects, so it can be speculatively executed.

Memory access instructions identified as NC addresses (PBMT = NC) in the LoadUnit / StoreUnit pipeline are marked in LSQ. LSQ is responsible for sending NC accesses to the Uncache module. Uncache supports processing multiple NC requests simultaneously, supports request merging, and is responsible for forwarding Stores to NC Loads currently executing in LoadUnit.

Atomic instructions and vector instructions do not support NC access. If such instructions attempt to access the NC address space, a corresponding AccessFault exception will be reported.

Misaligned Memory Access

The XiangShan core supports misaligned access for scalar and vector memory access instructions to the Memory space.

Scalar misaligned access that does not cross a 16B boundary can be accessed normally without extra processing.
Scalar misaligned access that crosses a 16B boundary is split into 2 aligned memory access operations in the MisalignBuffer. After completion, the MisalignBuffer is responsible for concatenation and writeback.
Vector non-Segment Unit-stride instructions access a contiguous address space. After element merging, they access a contiguous 16B at once, so no extra processing is needed.
Vector non-Segment instructions other than Unit-stride, element splitting and address calculation are completed in the VSplit module and sent to the pipeline. If an element is misaligned, it is sent to MisalignBuffer. The remaining flow is the same as for misaligned scalar access, with the difference that MisalignBuffer eventually writes back to VMerge instead of directly to the backend.
Misaligned processing for Vector Segment instructions is completed independently by VSegmentUnit and does not reuse the scalar memory access path, but is done through an independent state machine.

Atomic instructions do not support misaligned access. Neither MMIO nor NC address spaces support misaligned access. These cases will report an AccessFault exception.

Execution of Atomic Instructions

The XiangShan core supports the RVA and Zacas instruction sets. The current XiangShan design will cache the cache block accessed by an atomic instruction into DCache before executing the atomic operation.

The memory access unit snoops the address and data issued by the Store issue queue. If it is an atomic instruction, it enters AtomicsUnit. AtomicsUnit completes TLB address translation, clears SBuffer, accesses DCache, and performs a series of other operations.

Overall Design

Overall Block Diagram and Pipeline Stages