Load Instruction Execution Unit LoadUnit

Functional Description

The load instruction pipeline receives load instructions sent by the load dispatch queue, processes them in the pipeline, and writes the results back to the LoadQueue and ROB for instruction commitment and waking up subsequent instructions that depend on this instruction. At the same time, the LoadUnit needs to provide necessary feedback to the dispatch queue and Load/StoreQueue. The LoadUnit supports 128bits data width.

Feature 1: LoadUnit Pipeline Stage Functions

stage 0

Receives requests from different sources and performs arbitration.
The instruction that wins arbitration sends query requests to the TLB and DCache.
Pipeline flows to stage 1.

The arbitration priority from high to low is listed in the table below.

Table: LoadUnit Request Priority

Stage 0 Request Source	Priority
Load requests from MisalignBuffer	High
LoadQueueReplay resend due to DCache miss
LoadUnit fast resend
Uncache requests
NC requests
Other LoadQueueReplay resends
High confidence hardware prefetch requests
Vector load requests
Scalar load/software prefetch requests
Load point chasing requests
Low confidence hardware prefetch requests	Low

Currently, the Kunminghu architecture does not support load point chasing.

stage 1
- Receives requests from stage 0.
- s1_kill: When fast replay virtual/physical address matching fails, L2L forwarding fails, or the redirect signal is active, the s1_kill signal will be set to true.
- May send kill signals to the TLB or DCache.
- Receives response from the TLB, queries the DCache based on the physical address; for hint cases, sends them to the DCache simultaneously.
- Queries StoreQueue && SBuffer for ST-LD forward.
- Receives StoreUnit requests, determines if an ST-LD violation exists.
- Checks if an exception occurred.
- If it is an NC instruction, performs PBMT check.
- If it is a prf_i instruction, sends a request to the frontend.
stage 2
- Receives requests from stage 1.
- Receives response from PMP check, determines if an exception occurred; integrates exception sources.
- Receives response information from the DCache, determines if resend is needed, etc.
- Queries LoadQueue and StoreQueue for LD-LD or ST-LD violations.
- Sends fast wake-up signal to the backend.
- Integrates resend reasons.
- If it is an NC instruction, performs PMA & PMP check.
stage 3
- Receives requests from stage 2.
- Sends prefetch requests to the SMS prefetcher and L1 prefetcher.
- Receives data returned by DCache or forwarded data, performs concatenation and selection.
- Receives uncache load request writeback.
- Writes completed load requests back to the backend.
- Updates the execution status of load instructions in the LoadQueue.
- Sends redirect request to the backend.

Feature 2: Support for Vector Load Instructions

LoadUnit processes vector load instructions similarly to scalar ones, with lower priority than scalar loads. Specifically:
- stage 0:
  - Accepts execution requests from vlSplit, with higher priority than scalar requests, and does not need to compute virtual address.
- stage 1:
  - Computes vecVaddrOffset and vecTriggerMask.
- stage 3:
  - Does not need to send feedback_slow response to the backend.
  - Vector load initiates Writeback, sent to the backend via vecldout.

Feature 3: Support for MMIO Load Instructions

MMIO load instructions are solely for waking up consumer instructions that depend on them.
- MMIO load instructions send a wake-up request to the backend in s0.
- MMIO load writes back data in stage s3.

Feature 4: Support for Noncacheable Load Instructions

LoadUnit processes Noncacheable load instructions similarly to scalar ones, with higher priority than scalar requests. Specifically, Noncacheable load instructions go up the pipeline twice:
- First time up the pipeline, determines the instruction's NC attribute.
- Second time up the pipeline:
  - stage 0: Stage determines it's an NC instruction, no TLB translation needed.
  - stage 1: Sends forwarding request to StoreQueue.
  - stage 2: Determines store data forwarding status (data not ready - resend processing, virtual/physical address mismatch - redirect N processing). Sends RAR/RAW violation request.
  - stage 3: Determines violation status (LD-LD vio - redirect, ST-LD vio - redirect processing), if RAR or RAW is full/not ready, requires LoadQueueUncache resend. If resend is not needed, writes back via ldout.
Misaligned Noncacheable load instructions are not supported.
Supports obtaining forwarded data from LoadQueueUncache.

Feature 5: Support for Misaligned Load Instructions

Misaligned load instructions go up the pipeline four times:
- First time up the pipeline, determines if it is a misaligned instruction; if it is a misaligned instruction, sends a misaligned request to LoadMisalignBuffer;
- Second time up the pipeline, executes the first aligned load instruction of the split; upon successful execution, sends a response to LoadMisalignBuffer, otherwise resends from LoadMisalignBuffer;
- Third time up the pipeline, executes the second aligned load instruction of the split; upon successful execution, sends a response to LoadMisalignBuffer, otherwise resends from LoadMisalignBuffer;
- Fourth time up the pipeline, in s0, wakes up consumers after the load instruction, and at the same time, the load instruction writes back from LoadMisalignBuffer.
Load processing of misaligned Store instructions is similar to scalar ones, specifically:
- stage 0:
  - Accepts requests from LoadMisalignBuffer, with higher priority than vector and scalar requests, and does not need to compute virtual address.
- stage 3:
  - If the request is not from LoadMisalignBuffer and is a misaligned request that does not cross a 16-byte boundary, it needs to enter LoadMisalignBuffer for processing, sending an en queue request to LoadMisalignBuffer via the io_misalign_buf interface.
  - If the request is from LoadMisalignBuffer and does not cross a 16-byte boundary, it needs to send a resend or writeback response to LoadMisalignBuffer, sending the response via the io_misalign_ldout interface.
  - If misalignNeedWakeUp == true, it writes back directly, otherwise it needs to enter LoadMisalignBuffer for resend.

Feature 6: Support for Prefetch Requests

LoadUnit accepts two types of prefetch requests:
- High confidence prefetch (confidence > 0)
- Low confidence prefetch (confidence == 0)
Supports prefetch training
- stage s2:
  - Trains L1 prefetch via io_prefetch_train_l1.
  - Trains SMS prefetch via io_prefetch_train.

\newpage

Overall Block Diagram

\newpage

Interface Timing

LoadUnit Interface Timing Example

After a load instruction enters the LoadUnit, it requests the TLB and DCache in stage 0, gets the paddr returned by the TLB in stage 1, and determines if it hits the DCache in stage 2. In stage 2, RAW and RAR violation checks are performed, and in stage 3, LoadQueue is updated via io_lsq_ldin. Writeback occurs in stage 3 via ldout.

\newpage

Stage 0 Different Source Arbitration Timing Example

The figure illustrates the arbitration of load instructions from different sources in stage 0. In the third clock cycle, only io_ldin_valid is active, and the handshake is successful, entering stage 1 in the next cycle. In the fifth clock cycle, both io_ldin_valid and io_replay_valid are active simultaneously. Since the replay request has higher priority than a scalar load, the replay request wins arbitration and enters stage 1.