跳转至

Load Replay Queue LoadQueueReplay

Function Description

LoadQueueReplay is used to store Load instructions that need to be replayed. It wakes up instructions based on different wake-up conditions and schedules them to enter the LoadUnit for execution. It mainly includes the following states and stored information:

LoadQueueReplay Stored Information
Field Description
allocated Whether it has been allocated, also representing whether this item is valid.
scheduled Whether it has been scheduled, representing that this item has been selected and has been or will be sent to the LoadUnit for replay.
uop Information about the uops included in the load instruction execution.
vecReplay Information related to vector load instructions.
vaddrModule Virtual address of the Load instruction.
cause The reason for replaying the load instruction corresponding to a load replay queue item, including:
C_MA(bit 0): store-load prediction violation
C_TM(bit 1): tlb miss
C_FF(bit 2): store-to-load-forwarding failed because store data is not ready
C_DR(bit 3): DCache miss occurred, but MSHR could not be allocated
C_DM(bit 4): DCache miss occurred
C_WF(bit 5): Way predictor prediction error
C_BC(bit 6): Bank conflict
C_RAR(bit 7): LoadQueueRAR has no space to accept the instruction
C_RAR(bit 8): LoadQueueRAW has no space to accept the instruction
C_NK(bit 9): LoadUnit detected store-to-load-forwarding violation
C_MF(bit 10): LoadMisalignBuffer has no space to accept the instruction
blocking Load instruction is currently blocked
strict Memory dependency predictor determines whether the instruction needs to wait for all preceding store instructions to complete execution before entering the scheduling stage.
blockSqIdx StoreQueue Index of the store instruction related to the load instruction.
missMSHRId Acceptance ID of the dcache miss request for the load instruction.
tlbHintId Acceptance ID of the tlb miss request for the load instruction.
replacementUpdated Whether the DCcahe replacement algorithm has been updated.
replayCarry DCache way predictor prediction information.
missDbUpdated Update of Miss-related conditions in ChiselDB.
dataInLastBeatReg Data required by the Load instruction is in the last beat of the two refill requests.

\newpage

Feature 1: Out-of-Order Allocation

  • After a load request is received by LoadUnit S3, it first needs to be determined whether it needs to be enqueued. If it does not need to be replayed, an exception occurs, or it is flushed due to redirect, it does not need to be enqueued. LoadQueueReplay manages queue free space using a freelist. The size of the Freelist is the number of entries in the load replay queue, the allocation width is the width of Load (number of LoadUnits), and the release width is 4. At the same time, the freelist can feedback the free entries in the load replay queue and whether it is full. LoadQueueReplay uses a Freelist for queue free space management. The size of the Freelist is the number of LoadQueueReplay entries, the allocation width is the width of Load (number of LoadUnits), and the release width is 4.

    • Allocation

      • LoadQueueReplay selects an entry index for each LoadUnit from the free entries in the Freelist (i.e., the Valid entries in Figure \ref{fig:LSQ-LoadQueueReplay-Freelist}) (it tries its best to select free entries, for example, if valid entries are 5 and 10, and LoadUnit0 and LoadUnit2 are valid, then LoadUnit0 is allocated 5, and LoadUnit2 is allocated 10). After that, the instruction information is filled into the corresponding index entry according to the index.

      Freelist

    • Recycling

      • Entries occupied by load instructions that are successfully replayed or flushed need to be recycled. LoadQueueReplay uses a bitmap FreeMask to store the entries being released. The Freelist can recycle at most 4 entries per cycle.

      Freelist Recycling

Feature 2: Wake-up

  • Different blocking conditions have different wake-up conditions:

    • C_MA: If strict == 1, it needs to wait for the addresses of all preceding store instructions of the load instruction to be calculated before waking up. Otherwise, it only needs to wait for the address of the Store instruction corresponding to blockSqIdx to be calculated before waking up.

    • C_TM: If TLB has no extra space to handle the miss request, it can be marked as replayable and wait for scheduling; otherwise, it needs to wait for the TLB to return a hint signal matching tlbHintId to wake it up.

    • C_FF: It needs to wait for the data of the Store instruction corresponding to blockSqIdx to be ready before waking up.

    • C_DR: It can be marked as replayable and wait for scheduling.

    • C_DM: Wait for the L2 Hint signal matching missMSHRId to wake it up.

    • C_WF: It can be marked as replayable and wait for scheduling.

    • C_BC: It can be marked as replayable and wait for scheduling.

    • C_RAR: It can be woken up when LoadQueueRAR has free space or when this instruction is the oldest load instruction.

    • C_RAW: It can be woken up when LoadQueueRAW has free space or when the addresses of all preceding store instructions for this load instruction have been calculated.

    • C_MF: It can be woken up when LoadMisalignBuffer has free space.

Feature 3: Selection Scheduling

  • LoadQueueReplay has 3 types of selection scheduling:

    • Based on Enqueue Age

      • LoadQueueReplay uses 3 age matrices (one for each Bank) to record the enqueue time. The age matrix will select the instruction with the longest enqueue time from the instructions that are ready for replay and schedule it for replay.
    • Based on Load Instruction Age

      • LoadQueuReplay can determine to replay instructions closer to the oldest load instruction based on LqPtr, with a determination width of OldestSelectStride=4.
    • Prioritize Scheduling of Load Instructions Related to DCache Data

      • LoadQueueReply first schedules replays triggered by L2 Hint (after a dcache miss, the lower-level cache L2 Cache needs to be queried. 2 or 3 cycles before the L2 Cache refills, the L2 Cache will proactively send a wake-up signal to LoadQueueReplay, called L2 Hint). Upon receiving the L2 Hint, LoadQueueReplay can wake up the Load instruction blocked due to the dcache miss for replay earlier.

      • If there is no L2 Hint situation, the reasons for other Load Replays will be divided into high priority and low priority. High priority includes replays caused by dcache miss or st-ld forward, while other reasons are categorized as low priority. If a Load instruction satisfying the replay conditions (valid, not scheduled, and not blocked waiting for wake-up) can be found in LoadQueueReplay, that Load instruction is selected for replay. Otherwise, according to the enqueue order, the AgeDetector module searches for the earliest enqueued item among a series of load replay queue items for replay.

\newpage

Overall Block Diagram

LoadQueueReplay Overall Block Diagram

Interface Timing

Enqueue Timing

  • Replay Enqueue

LoadQueueReplay Replay Enqueue Timing Diagram

\newpage

  • Non-Replay Enqueue

LoadQueueReplay Non-Replay Enqueue Timing Diagram

Replay Timing

LoadQueueReplay Replay Queue Timing Diagram

\newpage

Freelist Timing

  • Allocation Timing

Freelist Allocation Timing Diagram

  • Recycling Timing

Freelist Recycling Timing Diagram