Vector Load Split Unit VLSplit

Function Description

Accepts and processes uops for Vector Load instructions. It splits the Uop, calculates the Uop offset relative to the base address, and generates control signals for the scalar memory access pipeline. VLSplit is generally divided into two implementation modules: VLSplitPipeline and VLSplitBuffer.

Feature 1: VLSplitPipeline performs secondary decoding for uops

The splitting pipeline for Vector Load instructions. It accepts the Vector Load instruction Uops issued by the Vector Load Issue Queue. It performs more fine-grained decoding and calculates Mask and address offset in the pipeline before sending them to the VLSplitBuffer. At the same time, VLSplitPipeline also requests entries in VLMergeBuffer based on the decoding and calculation results.

VLSplitPipeline is divided into two pipeline stages:

S0:

Performs more fine-grained decoding based on the input Uop information.
Generates alignedType based on the instruction type, using alignedType to indicate the memory access width of the Load Pipeline.
Generates the preIsSplit signal based on the instruction type. preIsSplit is set high if it is not a Unit-Stride instruction.
Generates the Mask for this Uop based on instruction type and information like vm, emul, lmul, eew, sew, etc.
Calculates the VdIdx for this Uop to be used for subsequent backend data merging and write-back. Due to out-of-order execution, Uops of the same instruction may not execute back-to-back. Therefore, VdIdx needs to be calculated in this stage based on instruction type, emul, lmul, and uopidx.

S1:

Calculates UopOffset and Stride.
Calculates the required FlowNum for this Uop. Here, the FlowNum sent to VMergeBuffer is different from the FlowNum used by VSplitBuffer. The FlowNum in MergeBuffer is used to determine if this Uop has completed all valid memory accesses. The FlowNum used in VSplitBuffer is needed for splitting.
Requests a VLMergeBuffer entry. One entry is requested for each Uop.
Sends information to VLSplitBuffer.

Mask Calculation:

First, we calculate and generate the SrcMask representing this Vector Load instruction based on vm, v0, vstart, and evl. Here, evl is the effective vector length. For different types of Vector Load instructions, there are different methods for calculating evl:
- For Load Whole instructions, evl = NFIELDS*VLEN/EEW.
- For Load Unit-Stride Mask instructions, evl = ceil(vl/8).
- For Vector Load instructions other than the above two types, evl = vl.
Then, we use the [FlowNum of all Uops before the current Uop] and [FlowNum of all Uops including the current Uop] of this instruction, along with the [FlowNum of all Vds before the current Uop] to calculate the actual FlowMask used. Here, due to the special nature of Load Indexed, when $signed(emul) > $signed(lmul)$ for an Indexed instruction, we need to ensure that the FlowNum of Uops with the same VdIdx is offset within the VdIdx. A specific example is as follows:
- First, assume the following configuration for the vector vluxei instruction:
  - vsetvli t1,t0,e8,m1,ta,ma lmul = 1
  - vluxei16.v v2,(a0),v8 emul = 2
  - vl = 9, v0 = 0x1FF
- Under this configuration, because $signed(emul) > $signed(lmul)$, two Uops will actually be generated, indicating that the index needs to be fetched from two vector registers respectively. The destination register corresponding to the two Uops is the same Vd. This means the VdIdx of the two Uops should be the same, and they should be written to the same destination register. Therefore, the following results will be produced here:
  - uopIdxInField = 0, vdIdxInField = 0, flowMask = 0x00FF, toMergeBuffMask = 0x01FF
  - uopIdxInField = 1, vdIdxInField = 0, flowMask = 0x0001, toMergeBuffMask = 0x01FF
  - uopIdxInField = 0, vdIdxInField = 0, flowMask = 0x0000, toMergeBuffMask = 0x0000
  - uopIdxInField = 0, vdIdxInField = 0, flowMask = 0x0000, toMergeBuffMask = 0x0000
- The FlowNum calculated for each Uop is 8. More specific explanations can be found in VSplit .scala.

Feature 2: VLSplitBuffer performs splitting based on secondary decoding information generated by VLSplitPipeline

VLSplitBuffer is a Buffer with only one entry, which accepts related information from VLSplitPipeline and buffers the Vector Load Uop that needs to be split.

VLSplitBuffer will split a Uop into multiple pieces of information that can be sent to the scalar Load PipeLine pipeline based on the Uop's information, and send them to the scalar Load PipeLine pipeline for actual memory access.

Enqueue Logic:

VLSplitBuffer accepts entry requests and related information from VLSplitPipeline. When a VLSplitBuffer entry is free, it allocates a VLSplitBuffer entry for each request and sets the Valid signal of the corresponding entry high.

Dequeue Logic:

VLSplitBuffer accepts entry requests and related information from VLSplitPipeline. When a VLSplitBuffer entry is free, it allocates a VLSplitBuffer entry for each request and sets the Valid signal of the corresponding entry high.

Splitting:

VLSplitBuffer performs splitting based on the instruction type.
For Unit-Stride instructions:
- When the base address is aligned (does not cross CacheLine), it accesses 128 Bit at a time.
- When the base address is unaligned (crosses CacheLine), we perform splitting and initiate two 128Bit memory accesses.
For other Vector Load instructions, we perform splitting element by element according to the requirements of instruction semantics and access memory element by element.
Each split sends the related information generated after the split to the scalar Load PipeLine pipeline for actual memory access.
Splitting is determined based on the splitIdx counter. splitIdx indicates the number of splits already performed for the current entry. When splitIdx is less than the number of splits required and it can be sent to the scalar Load PipeLine pipeline, one split is performed, and the value of the splitIdx counter is incremented with each split. When splitIdx is greater than or equal to the number of splits required, splitting ends, the entry is dequeued, and the splitIdx counter is reset to zero.

Address Calculation:

During splitting, relevant information to be sent to the scalar Load PipeLine pipeline also needs to be calculated, mainly calculating the virtual address for memory access after each split.
The calculation method for the virtual address varies depending on the instruction type and splitting method.
For Unit-Stride instructions:
- When the base address is aligned (does not cross CacheLine), a single 128Bit aligned access is sufficient.
- When the base address is unaligned (crosses CacheLine), we perform splitting and access using two consecutive 128Bit aligned addresses.
For other Vector Load instructions, we perform splitting element by element according to the requirements of instruction semantics, and the virtual address is calculated based on the element and semantics.

Redirect and Exception Handling: When a redirect signal arrives, the relevant entries in VLSplitBuffer are flushed based on the redirect related information.

Feature 3: Backpressure based on the Threshold signal from VLMergeBuffer

See Threshold Backpressure. When receiving the signal from VLMergeBuffer, VLSplitPipeline backpressures enqueue requests, preventing the backend from sending new uops until VLMergeBuffer removes the threshold backpressure.

Overall Block Diagram

No block diagram for a single module.

Main Ports

Only lists external interfaces of VLSplit, does not include internal VLSplitPipe and VLSplitBuffer interfaces.

Port	Direction	Description
redirect	In	Redirect port
in	In	Receives uop issue from Issue Queue
toMergeBuffer.req	Out	Requests MergeBuffer entry
toMergeBuffer.resp	In	MergeBuffer response
out	Out	Sends memory access requests to Load Unit
threshold	In	Receives threshold signal from VLMergeBuffer

Interface Timing

Interface timing is relatively simple, only providing text descriptions.

Port	Description
redirect	Has Valid. Data is valid when Valid is high.
in	Has Valid and Ready. Data is valid when Valid && ready are high.
toMergeBuffer.req	Has Valid and Ready. Data is valid when Valid && ready are high.
toMergeBuffer.resp	Has Valid. Data is valid when Valid is high.
out	Has Valid and Ready. Data is valid when Valid && ready are high.
threshold	Does not have Valid. Data is always considered valid, responds as soon as the corresponding signal is generated.