Vector Store Split Unit VSSplit

Function Description

Accepts and processes uops for Vector Store instructions. Splits the Uop, calculates the offset of the Uop relative to the base address, and generates control signals for the Scalar Store Pipeline. VSSplit is broadly divided into two implementation modules: VSSplitPipeline and VSSplitBuffer.

Feature 1: VSSplitPipeline performs secondary decoding for uops

The splitting pipeline for Vector Store instructions. Accepts Uops for Vector Store instructions issued from the Vector Store issue queue. It performs finer-grained decoding and calculates the Mask and address offset within the pipeline before sending them to VSSplitBuffer. At the same time, VSSplitPipeline will also apply for table entries in VLMergeBuffer based on the results of the decoding calculation. VSSplitPipeline is divided into two pipeline stages:

S0:

Performs finer-grained decoding based on the incoming Uop information.
Generates alignedType based on the instruction type, using alignedType to indicate the access width of the Store Pipeline.
Generates the preIsSplit signal based on the instruction type. preIsSplit being high indicates it is not a Unit-Stride instruction.
Generates the Mask for this Uop based on the instruction type and information such as vm, emul, lmul, eew, sew, etc.
Calculates the VdIdx for this Uop for subsequent backend data merging and writeback. Due to out-of-order execution, Uops of the same instruction are not necessarily executed back-to-back, so VdIdx needs to be calculated at this stage based on the instruction type, emul, lmul, and uopidx.

Mask Calculation:

First, we calculate and generate the SrcMask representing this Vector Store instruction based on vm, v0, vstart, and evl. Here, evl is the effective vector length. For different types of Vector Store instructions, there are different methods for calculating evl:
- For Store Whole instructions, its evl = NFIELDS * VLEN / EEW.
- For Store Unit-Stride Mask instructions, its evl = ceil(vl / 8).
- For Vector Store instructions other than the two types mentioned above, its evl = vl.
Then, we use the [FlowNum of all Uops before the current Uop] and [FlowNum of all Uops including the current Uop] of this instruction, along with the [FlowNum of all Vd before the current Uop] to calculate the actually used FlowMask. Here, due to the special nature of Store Indexed, when $signed(emul) > $signed(lmul)$, we need to ensure that the FlowNum of Uops with the same VdIdx is offset within the VdIdx. A specific example is as follows:
- First, we assume the following configuration for the vector vluxei instruction:
  - vsetvli t1,t0,e8,m1,ta,ma lmul = 1
  - vsuxei16.v v2,(a0),v8 emul = 2
  - vl = 9, v0 = 0x1FF
- Under this configuration, because $signed(emul) > $signed(lmul)$, two Uops will actually be generated, indicating that the index needs to be taken from two vector registers separately. The destination register corresponding to the two Uops is the same Vd. That is, the VdIdx of the two Uops should be the same, and they are to be written to the same destination register. Therefore, the following results will be produced here:
  - uopIdxInField = 0, vdIdxInField = 0, flowMask = 0x00FF, toMergeBuffMask = 0x01FF
  - uopIdxInField = 1, vdIdxInField = 0, flowMask = 0x0001, toMergeBuffMask = 0x01FF
  - uopIdxInField = 0, vdIdxInField = 0, flowMask = 0x0000, toMergeBuffMask = 0x0000
  - uopIdxInField = 0, vdIdxInField = 0, flowMask = 0x0000, toMergeBuffMask = 0x0000
- The FlowNum calculated for each Uop is 8. More specific explanations can be found in VSplit .scala.

S1:

Calculates UopOffset and Stride.
Calculates the FlowNum required for this Uop. Here, the FlowNum sent to VMergeBuffer is different from the FlowNum sent to VSplitBuffer. The FlowNum in MergeBuffer is used to determine whether this Uop has completed all effective accesses. The FlowNum used in VSplitBuffer is needed for splitting.
Applies for a VSMergeBuffer table entry. Each Uop applies for one entry.
Sends information to VSSplitBuffer.

Feature 2: VSSplitBuffer performs splitting based on the secondary decoding information generated by VSSplitPipeline

VSplitBuffer is a single-entry Buffer that accepts relevant information from VSSplitPipeline and buffers the Vector Store Uop that needs to be split.

VSSplitBuffer will split a Uop into multiple pieces of information that can be sent to the Scalar Store Pipeline based on the Uop's information, and send them to the Scalar Store Pipeline for actual memory access.

Enqueue Logic:

VSSplitBuffer accepts table entry applications and relevant information from VSSplitPipeline. When VSSplitBuffer has free entries, it allocates a VSSplitBuffer entry for each application and sets the Valid bit of the corresponding entry to high.

Dequeue Logic:

VSSplitBuffer accepts table entry applications and relevant information from VSSplitPipeline. When VSSplitBuffer has free entries, it allocates a VSSplitBuffer entry for each application and sets the Valid bit of the corresponding entry to high.

Splitting:

VsSplitBuffer will perform splitting based on the instruction type.
For Unit-Stride instructions:
When the base address is aligned (not crossing CacheLine), it will access 128 Bit at a time.
When the base address is not aligned (crossing CacheLine), we will perform splitting and initiate two 128Bit accesses.
For other Vector Store instructions, we split according to the instruction semantics' requirements by element and perform memory access by element.
Each split will send the relevant information generated after splitting to the Scalar Store Pipeline for actual memory access.
Splitting is determined by the splitIdx counter. splitIdx indicates the number of splits performed for the current entry. When splitIdx is less than the required number of splits and it can be sent to the Scalar Store Pipeline, a split will be performed, and each split will increment the value of the splitIdx counter. When splitIdx is greater than or equal to the required number of splits, splitting ends, the entry is dequeued, and the splitIdx counter is reset to zero.

Address Calculation:

During splitting, relevant information to be sent to the Scalar Store Pipeline also needs to be calculated, mainly calculating the virtual address for each memory access after splitting.
The virtual address calculation varies depending on the instruction type and splitting method.
For Unit-Stride instructions:
- When the base address is aligned (not crossing CacheLine), a single 128Bit aligned access is sufficient.
- When the base address is not aligned (crossing CacheLine), we will perform splitting and use two consecutive 128Bit aligned addresses for access.
For other Vector Store instructions, we split according to the instruction semantics' requirements by element, and the virtual address will be calculated based on the element and semantics.

Data Calculation:

During splitting, relevant information to be sent to the Store Queue also needs to be calculated, mainly calculating the data to be stored after each split.
The calculation of the data to be stored varies depending on the instruction type and splitting method. Please refer to the requirements in the address calculation section above; it only needs to be aligned with the granularity of the address.

Redirection and Exception Handling:

When a redirect signal arrives, the relevant entries in VSSplitBuffer will be flushed based on the redirect information.

Overall Block Diagram

No block diagram for a single module.

Main Ports

Only lists VSSplit's external interfaces, excluding internal VSSplitPipe and VSSplitBuffer interfaces.

Port	Direction	Description
redirect	In	Redirection port
in	In	Receives uop issuance from Issue Queue
toMergeBuffer.req	Out	Requests MergeBuffer entry
toMergeBuffer.resp	In	MergeBuffer response
out	Out	Sends memory access request to Store Unit
vstd	Out	Updates the status of the Store queue entry when the finished uop writes back to the backend
vstdMisalign	In	Receives misalign related signals from Store Unit and Store Misalign Buffer

Interface Timing

Interface timing is relatively simple, only providing textual descriptions.

Port	Description
redirect	Has Valid. Data is valid with Valid
in	Has Valid, Ready. Data is valid with Valid && ready
toMergeBuffer.req	Has Valid, Ready. Data is valid with Valid && ready
toMergeBuffer.resp	Has Valid. Data is valid with Valid
out	Has Valid, Ready. Data is valid with Valid && ready
vstd	Has Valid, Ready. Data is valid with Valid && ready
vstdMisalign	Does not have Valid, data is always considered valid, response occurs when the signal is generated