跳转至

Store Queue StoreQueue

Function Description

StoreQueue is a queue used to hold all store instructions. Its functions are as follows:

  • Tracks the execution status of store instructions.

  • Stores store data and tracks data status (whether it has arrived).

  • Provides query interface for loads, allowing loads to forward data from stores with the same address.

  • Responsible for the execution of MMIO store and NonCacheable store instructions.

  • Writes stores committed by ROB to the sbuffer.

  • Maintains address and data ready pointers, used for releasing LoadQueueRAW and waking up LoadQueueReplay.

Store instructions are optimized with decoupled address and data dispatch, meaning the StoreUnit is the pipeline for dispatching the store address, and the StdExeUnit is the pipeline for dispatching the store data. They are two different reservation stations. When the store data is ready, it can be dispatched to the StdExeUnit. When the store address is ready, it can be dispatched to the StoreUnit.

  • Each entry in the StoreQueue saves the basic information of a store instruction:
Basic information stored in StoreQueue
Field Description
uop store instruction uop
dataModule 128bits data and valid mask
paddrModule Physical address
vaddrModule Virtual address
  • Each entry in the StoreQueue has several status bits to indicate the state of the store:
Status information stored in StoreQueue
Field Description
allocated Sets the allocated status of this entry, starting the lifecycle tracking for this store.
When this store instruction is committed to the Sbuffer, the allocated status is cleared.
addrvalid Indicates whether address translation to a physical address has been completed, used for CAM comparison during load forward checks.
datavalid Indicates whether the store data has been dispatched and is available.
committed Whether the store has been committed by the ROB.
unaligned Unaligned Store
cross16Byte Crosses 16-byte boundary
pending Whether this store is in MMIO space, primarily used to control the MMIO state machine.
nc NonCacheable store
mmio MMIO store
atomic Atomic store
memBackTypeMM Whether the PMA is of main memory type
prefetch Whether prefetching is needed when committed to Sbuffer.
isVec Vector store
vecLastFlow Last uop of a vector store flow
vecMbCommit Vector Store committed from merge buffer to ROB
hasException Store instruction has an exception
waitStoreS2 Waiting for MMIO and exception results from Store Unit s2

Feature 1: Data Forwarding

  • Loads need to query the StoreQueue to find the data from the most recent dependent store with the same address that is older than the load.

    • Compare the query bus (io.forwrd.sqIdx) with the StoreQueue's enqPtr pointer to find all entries in the StoreQueue that are older than the load instruction. This is divided into 2 cases based on whether the flag is the same or different.

    • If the flag is the same, the range of older Stores is [tail, sqIdx - 1], as shown in Figure \ref{fig:LSQ-StoreQueue-Forward-Mask} a). Otherwise, the range of older Stores is [tail, VirtualLoadQueueSize - 1] and [0, sqIdx], as shown in Figure \ref{fig:LSQ-StoreQueue-Forward-Mask} b).

    StoreQueue Forward Range Generation

    • The query bus queries using both virtual and physical addresses simultaneously. If a physical address matches but the virtual address does not, or if a virtual address matches but the physical address does not, the load needs to be set as a replayInst, and the instruction will be refetched and executed after the load reaches the ROB head.

    • If only one matching entry is found and the data is ready, forward directly.

    • If only one matching entry is found and the data is not ready, the reservation station is responsible for resending the request.

    • If multiple matches are found, choose the oldest store to forward data from.

    • The StoreQueue operates on a 1-byte unit and uses tree-based data selection logic, as shown in Figure \ref{fig:LSQ-StoreQueue-Forward}.

\newpage

StoreQueue Forward Data Selection

  • Stores participating in data forwarding need to satisfy:

    • allocated: The store is still in the store queue and has not been written to the sbuffer.

    • datavalid: The data for this store is ready.

    • addrvalid: Address translation for this store has completed, and the physical address has been obtained.

    • If the memory dependency predictor is enabled, the SSID (Store-Set-ID) indicates historical information about previous load prediction failures. If the current load hits an SSID from the previous history, it will wait for all previous older stores to complete execution. If it doesn't hit, it will only wait for older Stores with the same physical address to complete execution.

Feature 2: Unaligned Store Instructions

StoreQueue supports processing unaligned Store instructions. Each unaligned Store instruction occupies one entry and is written to the dataBuffer after address and data alignment.

Feature 3: Vector Store Instructions

As shown in Figure \ref{fig:LSQ-StoreQueue-Vector}, StoreQueue pre-allocates some entries for vector store instructions. StoreQueue controls the commitment of vector stores through vecMbCommit:

  • For each store, obtain corresponding information from the feedback vector fbk.

    Check if the store meets the commitment conditions (valid and marked as commit or flush) and if it matches the instruction corresponding to uop(i) (via robIdx and uopIdx). Only when all conditions are met will the store be marked as committed. Check if there is an instruction within VecStorePipelineWidth that satisfies the condition. If so, the vector store is considered committed, otherwise not.

  • Special case handling (Store crossing pages):

    In special cases (when a store crosses a page and there is the same uop in storeMisalignBuffer), if the store meets the condition io.maControl.toStoreQueue.withSameUop, vecMbCommit will be forced to true, indicating that this store is committed regardless.

Vector Store Instructions

Feature 4: CMO

StoreQueue supports CMO instructions. CMO instructions share the MMIO state machine control:

  • s_idle: Idle state. Enters s_req state upon receiving a CMO store request.

  • s_req: Flushes the Sbuffer. After flushing is complete, sends a CMO operation request via CMOReq and enters the s_resp state.

  • s_resp: Receives the response returned by CMOResp and enters the s_wb state.

  • s_wb: Waits for the ROB to commit the CMO instruction and enters the s_idle state.

Feature 5: CBO

StoreQueue supports CBO.zero instructions:

  • For CBO.zero instructions, the data part writes 0 to dataModule.

  • When CBO.zero writes to Sbuffer: Flush the Sbuffer. After flushing is complete, write back via cboZeroStout.

Feature 6: MMIO and NonCacheable Store Instructions

  • MMIO Store instruction execution

  • MMIO space stores can only be executed when they reach the head of the ROB. However, unlike loads, when a store reaches the head of the ROB, it may not necessarily be at the tail of the store queue. Some stores might have been committed but are still in the store queue and have not been written to the sbuffer. This MMIO store needs to wait for these stores to be written to the sbuffer before it can execute.

  • Use a state machine to control the execution of MMIO stores.

    • s_idle: Idle state. Enters s_req state upon receiving an MMIO store request.

    • s_req: Sends a request to the MMIO channel. After the request is accepted by the MMIO channel, enters the s_resp state.

    • s_resp: Receives the response from the MMIO channel, records whether an exception occurred, and enters the s_wb state.

    • s_wb: Converts the result into internal signals and writes back to the ROB. If successful and there is an exception, enters s_idle. Otherwise, enters the s_wait state.

    • s_wait: Waits for the ROB to commit this store instruction. After commitment, returns to the s_idle state.

  • NonCacheable Store instruction execution

  • NonCacheable space store instructions need to wait until committed before requests can be sent from the StoreQueue in order.

  • Use a state machine to control the execution of NonCacheable stores.

    • nc_idle: Idle state. Enters nc_req state upon receiving a NonCacheable store request.

    • nc_req: Sends a request to the NonCacheable channel. After the request is accepted by the NonCachable channel, if uncacheOutstanding feature is enabled, enters nc_idle, otherwise enters nc_resp state.

    • nc_resp: Receives the response from the NonCacheable channel and enters nc_idle state.

Feature 7: Store Instruction Commitment and Writing to SBuffer

StoreQueue uses early commitment. * Early commitment rules:

  • Check conditions for entering the commitment stage.

    • Instruction is valid.

    • Instruction's ROB head pointer does not exceed the pointer of instructions to be committed.

    • Instruction does not need to be canceled.

    • Instruction is not waiting for a Store operation to complete, or is a vector instruction.

  • If it is the first instruction in the CommitGroup:

    • Check MMIO status: No MMIO operation or there is an MMIO operation and the MMIO store has been committed.

    • If it is a vector instruction, otherwise must satisfy the vecMbCommit condition.

  • If it is not the first instruction in the CommitGroup:

    • Commitment status depends on the commitment status of the previous instruction.

    • If it is a vector instruction, must satisfy the vecMbCommit condition.

After commitment, they can be written to the sbuffer in order. First, these stores are written to the dataBuffer. The dataBuffer is a two-entry buffer (channels 0, 1) used to handle the read latency from the large StoreQueue. Only channel 0 can write unaligned instructions. To simplify the design, even if both ports encounter an exception, only one unaligned instruction is dequeued.

  • Write enable signal generation:

  • When channel 0 instruction is unaligned and crosses a 16-byte boundary:

    • Channel 0 instruction is allocated and committed.

    • dataBuffer channels 0 and 1 can simultaneously accept instructions.

    • Channel 0 instruction is not a vector instruction, and address and data are valid; or it is a vector and vsMergeBuffer has been committed.

    • Does not cross a 4K page boundary; or crosses a 4K page boundary but can be dequeued, AND 1) if it is channel 0: allows data with exception to be written; 2) if it is channel 1: does not allow data with exception to be written.

    • Previous instructions do not include NonCacheable instructions. If it is the first instruction, it itself cannot be a NonCacheable instruction.

  • Otherwise, need to satisfy:

    • Instruction is allocated and committed.

    • Not a vector, and address and data are valid; or is a vector and vsMergeBuffer has been committed.

    • Previous instructions do not include NonCacheable and MMIO instructions. If it is the first instruction, it itself cannot be a NonCacheable and MMIO instruction.

    • If it is an unaligned store, it must not cross a 16-byte boundary, and address and data are valid or have an exception.

  • Address and Data Generation:

  • Address is split into high and low parts:

    • Low address: 8-byte aligned address.

    • High address: Low address plus an 8-byte offset.

  • Data is split into high and low parts:

    • Data crossing 16-byte boundary: Original data left-shifted by the number of bytes contained in the lower 4 bits of the address offset.

    • Low data: Lower 128 bits of the data crossing the 16-byte boundary.

    • High data: Upper 128 bits of the data crossing the 16-byte boundary.

  • Write selection logic:

    • If dataBuffer can accept unaligned instruction writes, and channel 0's instruction is unaligned and crosses a 16-byte boundary:

    • Check if it crosses a 4K page boundary and can be dequeued while crossing a 4K page boundary: Channel 0 uses the low address and low data to write to dataBuffer; Channel 1 uses the StoreMisaligBuffer's physical address and high data to write to dataBuffer.

    • Otherwise: Channel 0 uses the low address and low data to write to dataBuffer; Channel 1 uses the high address and high data to write to dataBuffer.

    • If the channel instruction does not cross a 16-byte boundary and is unaligned, use the 16-byte aligned address and aligned data to write to dataBuffer.

    • Otherwise, write the original data and address to dataBuffer.

Feature 7: Forced Sbuffer Flush

StoreQueue uses a dual-threshold method to control forced Sbuffer flushing: an upper threshold and a lower threshold. When the number of valid entries in the StoreQueue exceeds the upper threshold, the StoreQueue forces an Sbuffer flush until the number of valid entries in the StoreQueue falls below the lower threshold, at which point the Sbuffer flush stops.

\newpage

Overall Block Diagram

StoreQueue Overall Framework

Interface Timing

Enqueue Interface Timing Example

StoreQueue Overall Framework

\newpage

Data Update Interface Timing

Data Update Interface Timing

Address Update Interface Timing

StoreQueue address updates are similar to data updates. The StoreUnit updates the address via io_lsq in the s1 stage and updates the exception via io_lsq_replenish in the s2 stage. Unlike data updates, updating the address only takes one cycle, not two.

MMIO Interface Timing Example

MMIO Interface Timing Example

\newpage

NonCacheable Interface Timing Example

NonCacheable Interface Timing Example

CBO Interface Timing Example

CBO Interface Timing Example

\newpage

CMO Interface Timing Example

CMO Interface Timing Example