CtrlBlock

Version: V2R2
Status: OK
Date: 2025/01/15
commit: xxx

Glossary

Glossary
Abbreviation	Full Name	Description
-	Decode Unit	译码单元
-	Fusion Decoder	指令融合
ROB	Reorder Buffer	重排序缓存
RAT	Register Alias Table	重命名映射表
-	Rename	重命名
LSQ	Load Store Queue	访存指令队列
-	Dispatch	派遣
IntDq	Int Dispatch Queue	定点派遣队列
fpDq	Float Point Dispatch Queue	浮点派遣队列
lsDq	Load Store Dispatch Queue	访存派遣队列
-	Redirect	指令重定向
pcMem	PC MEM	指令地址缓存

Submodule List

Submodule List
Submodule	Description
dispatch	Instruction Dispatch Module
decode	Instruction Decode Module
fusionDecoder	Instruction Fusion Module
rat	Rename Table
rename	Rename Module
redirectGen	Redirect Generation Module
pcMem	Instruction Address Cache
rob	Reorder Buffer
trace	Instruction Trace Module
snpt	Snapshot Module

Design Specifications

Decode Width: 6

Rename Width: 6

Dispatch Width: 6

ROB Commit Width: 8

RAB Commit Width: 6

ROB Size: 160

Snapshot Size: 4 entries

Number of Integer Physical Registers: 224

Number of Floating-Point Physical Registers: 192

Number of Vector Physical Registers: 128

Number of Vector v0 Physical Registers: 22

Number of Vector vl Physical Registers: 32

Supports Rename Snapshot

Supports Trace Extension

Functionality

The CtrlBlock module includes Instruction Decode (Decode), Instruction Fusion (FusionDecoder), Register Rename (Rename, RenameTable), Instruction Dispatch (Dispatch), Commit Unit (ROB), Redirect Handling (RedirectGenerator), and Snapshot Rename Recovery (SnapshotGenerator).

The decode functional unit fetches 6 instructions from the head of the instruction queue each clock cycle for decoding. The decoding process translates the instruction code into internal codes convenient for functional units, identifying the instruction type, required register numbers, and immediate values potentially contained in the instruction code, for the subsequent register renaming stage. For complex instructions, after selection, they are split instruction by instruction via the complex decoder DecodeCompunit. For vset instructions, information is stored in Vtype to guide instruction splitting. Finally, 6 uops are selected each cycle, with complex instructions preceding simple instructions, and passed to the renaming stage. The decode stage also includes issuing read requests to the RenameTable.

Instruction fusion pairs the 6 uops obtained from instruction decoding into (uop0, uop1), (uop1, uop2), (uop2, uop3), (uop3, uop4), (uop4, uop5) forms, generating up to 5 pairs of instructions to be fused. Then it determines whether each pair of instructions can undergo instruction fusion. Currently, we support two types of instruction fusion: fusion into a single instruction with new control signals, and replacing the operation code of the first instruction with another. After determining that instruction fusion is possible, we reassign operands for the uop, such as logical register numbers, and select new operands. Additionally, HINT type instructions do not support instruction fusion; for example, the fence instruction cannot be fused.

The renaming stage is responsible for managing and maintaining the mapping between registers and physical registers. By renaming logical registers, it eliminates instruction dependencies and enables out-of-order instruction scheduling. The renaming module primarily includes the Rename and RenameTable modules, respectively responsible for controlling the Rename pipeline stage, and maintaining the (architectural/speculative) rename table. Rename includes the FreeList and CompressUnit modules, responsible for maintaining free registers and ROB compression.

The dispatch stage distributes the renamed instructions to 4 schedulers based on instruction type, corresponding to integer, floating-point, vector, and load/store respectively. Each scheduler is further divided into several issue queues based on different operation types, with each issue queue entry size being 2.

The instruction flow within CtrlBlock is as follows: CtrlBlock reads the ctrlflow corresponding to the 6 instructions passed from the Frontend, passes through decode to add decoded logical registers, operation types, and other information, complex instructions pass through DecodeComp to add instruction splitting information, 6 uops are selected and output per cycle, and RAT read requests are issued. For uops that can undergo instruction fusion, fusion and clearing occur upon entering rename. After passing through rename to add physical register information and ROB compression information, they are passed to dispatch. Finally, through dispatch, they enter ROB / RAB / Vtype to request entries, and are output to the issue queue based on instruction type. Among these modules, only the issue queue is in-order entry, out-of-order exit; other modules are all in-order entry, in-order exit.

Decode

The decode process for scalar instructions is the same as in Nanhu.

For vector instructions, decoding is first performed using a decode table with the same structure as for scalar instructions. While decoding, the instruction splitting type is obtained, and splitting is then performed based on the instruction splitting type. The splitting process is equivalent to modifying the source register number, source register type, destination register number, destination register type, and updating the number of uops, used to control the number of write-backs required for a single instruction during ROB write-back. Only after all split uops complete the rename process can the decode ready signal be set to 1.

Since scalar floating-point instructions other than i2f now run using the vector floating-point module, only 4 types of decode signals used in fpdecoder (typeTagOut, wflags, typ, rm) are utilized, with usage similar to Nanhu. For floating-point instructions running on the vector floating-point module, the used futype and fuoptype need to be obtained in the vector decode unit, and distinguished using a 1-bit isFpToVecInst signal, indicating whether the floating-point instruction is a floating-point instruction or a vector floating-point instruction, allowing differentiation when sharing the vector execution unit.

Decode Stage Inputs

Besides receiving the instruction stream from the frontend, the decode stage also needs to receive Vtype-related information from the ROB: walk, commit, vsetvl information, to guide the decoding of vector complex instructions.

Decode Outputs

To FusionDecode: Outputs the instruction stream and controls whether instruction fusion is enabled.

To Rename: Pipelines out 6 uops; if a redirect occurs, it blocks until the redirect in CtrlBlock is sent to the frontend, allowing the frontend to issue the correct instruction stream.

To RAT: Decode issues speculative rename read requests.

FusionDecoder

The instruction fusion module is responsible for identifying if there is a certain relationship between the uops decoded by the decode module, so that the work of multiple uops (currently only supporting the fusion of two instructions) can be fused into what one uop can accomplish.

Instruction fusion pairs the 6 uops obtained from instruction decoding into (uop0, uop1), (uop1, uop2), (uop2, uop3), (uop3, uop4), (uop4, uop5) forms, generating up to 5 pairs of instructions to be fused. Then it determines whether each pair of instructions can undergo instruction fusion. Currently, we support two types of instruction fusion: fusion into a single instruction with new control signals, and replacing the operation code of the first instruction with another. After determining that instruction fusion is possible, we reassign the operands of the uop, such as logical register numbers, and select new operands. Additionally, HINT type instructions do not support instruction fusion; for example, the fence instruction cannot be fused.

For example, slli r1, r0, 32 and srli r1, r1, 32 shift the value in r0 left by 32 bits and store it in r1, and then shift it right by 32 bits. This is equivalent to add.uw r1, r0, zero (pseudo-instruction zext.w r1, r0), which extends the value in r0 and moves it to r1.

Inputs are the up to 6 uops after decoding, their original instruction encoding, and corresponding valid signals. Here the input inready has only 5 bits (i.e., decode width minus one) because we need to pair the uops with an offset into up to 5 pairs of instructions to be fused. inReady[i] indicates that in(i+1) is ready to be accepted.

Output width is decode width minus one, including instruction fusion replacement, requiring replacement of fuType, fuOpType, lsrc2 (logical register number of the second operand, if any), src2Type (type of the second operand), selImm (immediate type). As well as instruction fusion information, such as rs2 coming from rs1/rs2/zero. Additionally, a boolean vector 'clear' with decode width needs to be output, indicating whether the uop needs to be cleared due to instruction fusion. Currently, the design assumes that after each instruction fusion, the second instruction will be cleared. The 'clear' for the 0th uop will not be true, as we default to fusing later instructions onto earlier ones; therefore, whether fused or not, uop 0 will never disappear due to instruction fusion.

Output validity requirements: The instruction pair is valid (the uop pair passed from the decode module is valid), it must not be cleared by instruction fusion, there is a feasible instruction fusion result, and it must not be a Hint type instruction. Simultaneously, assign information such as fuType, src2Type, rs2FromZero, etc.

Redirect

In CtrlBlock, it is mainly responsible for the generation of redirects and sending them to various modules.

RedirectGenerator

The RedirectGenerator module manages redirect signals from different sources (such as execution units and load), and determines whether a redirect occurs and how to refresh related information. It ensures the correctness of data flow through multi-level registers and synchronization mechanisms, and guarantees the correctness of instruction execution through address translation and error detection.

The fullTarget field is obtained by concatenating the fullTarget of the current oldest execution redirect with cfiUpdate.target. Additionally, if the current oldest execution redirect does not originate from the CSR, the validity of addresses such as IAF, IPF, and IGPF also needs to be checked based on the translation type of the instruction address.

Then select the oldest redirect from the oldest execution redirect and load redirect, while also ensuring that this oldest redirect will not be flushed by robFlush or previous redirects.

Redirect Generation

Redirects generated in CtrlBlock primarily include two sources:

Errors occurring during processor execution summarized by redirectgen (including branch prediction and memory access violations) (these redirects are subsequently referred to as exuredirects).
As well as robflush generated by rob exceptiongen: interrupt (csr)/exception/pipeline flush (csr+fence+load+store+varith+vload+vstore) + frontend exception. Exception/interrupt/pipeline flush redirects from the ROB are handled similarly.

For redirects summarized by redirectgen:

Redirects written back by functional units (jump, brh) are input to the redirectgen module after a one-cycle delay and provided they are not cancelled by an older, already processed redirect.
Violations written back by the Memblock (memory access violations) are input to redirectgen after a one-cycle delay and provided they are not cancelled by an older, already processed redirect.

Redirectgen selects the oldest redirect and waits for one cycle after input, then outputs it after adding the data read back from pcMem.

For robflush signals, after reception, they also need to wait for one cycle before adding the data read back from pcMem.

When generating Redirects, CtrlBlock prioritizes redirecting robflush signals; only when there is no robflush will it process exuredirects.

The overall block diagram for the above part is as follows:

Redirect Distribution

After generating the Redirect signal, CtrlBlock distributes the redirect signal to various modules in the pipeline stages.

To decode: sends the current redirect or redirectpending (i.e., decode waits until the redirect sent from CtrlBlock to the frontend is ready, allowing the frontend to receive the correct instruction stream before the pipeline can continue).
To rename, rat, rob, dispatch, snpt, mem: sends the current redirect.
To issueblock, datapath, exublock: sends the redirect after a one-cycle delay.

Especially notable is the redirect sent to the frontend. The redirect sent to the frontend and its effects include three parts in total: rob_commit, redirect, ftqIdx (readAhead, seloh).

For rob commit

Since the flush signal transmitted to the frontend may be delayed by several cycles, and continuing to commit before the flush could lead to errors after committing and flushing. Therefore, we treat all flushes as exceptions to ensure consistent frontend behavior. When the ROB commits an instruction with a flush signal, we need to directly flush the commit with robflush in CtrlBlock, informing the frontend to flush but not to commit.

As for exuredirects, their corresponding instructions can only be committed after writing back to the ROB and waiting for the walk to complete. Therefore, these two types of redirects do not require special handling; their commitment is guaranteed to be after their write-back.

For redirect:

The redirect signal sent to the frontend also includes additional CFIupdate, while ftq information is updated via additional readAhead and seloh.

For exuredirects, their CFIupdate and ftqidx information are already included when passed back from the functional unit, so no special handling is required.

For flush and exception issued by the ROB, the target address for CFI update needs to be obtained from the CSR: first, the ROB issues a flush signal, generates an exception, sends a redirect to the CSR indicating an exception occurred, and receives the Trap Target back from the CSR to CtrlBlock, finally issuing a redirect to the frontend.

For target address updates caused by other pipeline flushes, the base pc is obtained through previous interaction with pcmem, and the target address is generated in CtrlBlock by adding an offset based on whether it is flushing itself.

A special case is XRet during pipeline flush issued by the CSR. In this case, the target address update also needs to be obtained from the CSR. However, the path for CSR generating Xret does not need to rely on exceptions sent back from the ROB; it can directly interact with CtrlBlock via csrio.

For ftqIdx:

CtrlBlock mainly sends two sets of data: ftqIdxAhead and ftqIdxSelOH.

ftqIdxAhead is used by the frontend to read redirect-related ftqidx one cycle in advance. ftqIdxAhead is a vector of FtqPtr with a size of 3, where the first is the execution redirect (jmp/brh), the second is the load redirect, and the third is robflush.

ftqIdxSelOH is used to select the valid ftqidx: the first two are selected by the one-hot code output by redirectgen, and the third is selected by whether the redirect sent to the frontend is valid.

Ensuring Redirect Issue Order

To ensure correct execution, newer redirects cannot be dispatched before older redirects. The following describes four cases:

(1) New exuredirect issued after an old robflush:

When writing back, exuredirect looks ahead to see if there is an older redirect already. When robflush arrives, later generated exuredirects will be directly flushed in the exublock; for earlier generated exuredirects that haven't been flushed by robflush yet, it will check if there is an older redirect, and if so, they will also be flushed.

(2) New exuredirect after an old exuredirect:

When writing back, exuredirect looks ahead to see if there is an older redirect already. When a redirect occurs, later generated exuredirects will also be directly flushed in the exublock; for earlier generated exuredirects that haven't been flushed by the current redirect yet, it will check if there is an older redirect, and if so, they will also be cancelled.

(3) New robflush after an old redirect:

In this case, the ROB guarantees that this will not happen. The robflush output indicates the instruction at the current robdeq has an exception/interrupt flag, and robdeq, being the current oldest robidx, is definitely older than any existing redirect.

(4) New robflush after an old robflush:

This part is mainly guaranteed in the ROB. exceptionGen obtains the oldest robflush, and simultaneously, when robflush is issued, it checks the previous flushout; newer robflushes will be cancelled.

Snapshot Recovery

For rename recovery, Kunminghu currently employs a snapshot recovery stage: upon redirection, it does not necessarily recover to the architectural state, but may recover to a certain snapshot state. A snapshot is the speculative state saved during the renaming stage according to certain rules, including ROB enqptr; Vtypebuffer enqptr; RAT spec table; freelist Headptr (dequeue pointer); and ctrlblock used for overall control of robidx. Currently, the above modules each maintain four snapshots.

SnapshotGenerator

The SnapshotGenerator module is mainly used for generating and storing/maintaining snapshots. Its essence is a circular queue, maintaining up to four snapshots.

Enqueue: If the circular queue is not full and the enqueue signal is not cancelled by a redirect, enqueue at enqptr in the next cycle and update enqptr.

Dequeue: If the dequeue signal is not cancelled by a redirect, dequeue at deqptr in the next cycle and update deqptr.

Flush: Flush the corresponding snapshots in the next cycle based on the flush vector.

Update enqptr: If there is an empty snapshot, select the one closest to deqptr as the new enq pointer.

Snapshots: snapshots queue register outputs directly.

Snapshot Creation

Regarding the timing of snapshot creation, it is currently managed in the rename stage. Observing that the main source of redirects affecting performance is still redirects caused by branch errors, snapshots are chosen to be created at branch jump instructions. Simultaneously, to allow other redirects to use snapshot recovery even without branch jumps, a snapshot is taken every fixed commitwidth*4=32 uops.

The Rename module marks all six output uops with a snapshot flag, indicating whether the uop needs a snapshot. In CtrlBlock, the snapshot flags on the six uops are summarized onto the first uop. This operation is to address the correctness of the snapshot mechanism under blockBackward: if a blockBackward occurs among the six uops, and a snapshot needs to be taken after the blockBackward, this snapshot would fail to be taken in the ROB due to the blockBackward. Placing all snapshots on the first uop resolves this issue.

Snapshot creation for Rat, freelist, and ctrlblock is controlled by the snapshot flag output by the rename module. Stored data is managed by each module itself.

Snapshot creation for Rob and vtype, in addition to the snapshot flag flowing from rename output to rob, also needs to consider non-blockBackward and whether rab, rob, vtypebuffer are not full. Snapshot creation for rob and vtype and the snapshot write for the aforementioned modules may not occur in the same cycle, but by having the snapshot flag flow with the rename output to rob, we can ensure synchronization by having the same robidx written.

Snapshot Deletion

Snapshot deletion mainly includes two cases: one is deleting expired snapshots during commit; the other is deleting snapshots on the incorrect path during redirect.

For deleting snapshots during commit: Ctrlblock controls the deq signal to delete snapshots. If one of the eight robcommit uops matches the first uop in the snapshot pointed to by the current deqptr, the expired snapshot is deleted. Ctrlblock passes the deq signal to the aforementioned modules to synchronize the deletion of committed expired snapshots.

For deletion during redirect: Ctrlblock deletes snapshots on the incorrect path by providing the flushvec signal. It checks if the first uop of the snapshot is newer than the current redirect (circle-around cases need attention here). If it is older, this snapshot is flushed, meaning the corresponding position in flushvec is set to 1. Ctrlblock passes the flushvec to the aforementioned modules to synchronize the flushing of snapshots on the incorrect path.

Snapshot Management

CtrlBlock maintains a snapshot copy storing robidx within itself, which can conveniently inform various modules whether a snapshot is hit and the hit snapshot number when a redirect arrives. CtrlBlock iterates through the snapshots. If a snapshot older than the current redirect exists (or equal if it's a self-flushing uop), snapshot recovery is allowed, and the hit snapshot number is recorded and passed to the aforementioned modules.

Spec state recovery via snapshot is controlled by each module itself.

Overall block diagram for the above part:

Snapshot Generation, Deletion, and Management

pcMem

pcMem essentially instantiates a SyncDataModuleTemplate and needs to provide multiple read ports and 1 write port. Size is 64 entries, each entry only includes startAddr.

What is read out from pcMem is the base PC, and the complete PC needs to be obtained by adding the Ftq Offset.

Under the current configuration, 14 read ports are required: 1 read port for redirect, 1 read port for robFlush, 3 read ports each for bjuPC and bjuTarget, 3 read ports for load, and 3 read ports for trace.

Inputs include write enable, write address, and write data from the frontend Ftq, as well as read requests and read addresses from different sources, outputting read results respectively.

GPAMem

The GPAMem module is similar to pcMem, instantiating a SyncDataModuleTemplate, but only needs to provide 1 read port and 1 write port, with a size of 64 entries. Each entry mainly includes a gpaddr, storing the gpaddr information corresponding to the frontend's ftq.

The ROB issues a gpaddr read request one cycle before the exception output to read the ftq information for the address, and receives the returned gpaddr information in the second cycle. Finally, it interacts directly with the CSR via robio.

Inputs include write enable, write address, and write data from the frontend IFU, as well as read requests and read addresses from the ROB, outputting read results to the ROB.

Trace

The trace submodule of CtrlBlock is used to collect instruction trace information. It receives information from ROB instruction commits and performs secondary compression on top of ROB compression (compressing instructions that don't need a PC with instructions that do need a PC into one entry and storing them in the trace buffer) to reduce pressure on pcMem reads.

Feature Support

The current implementation of in-core trace in KMH (Kunminghu) only supports instruction trace. Instruction trace information collected in-core includes: priv, cause, tval, itype, iretire, ilastsize, iaddr; the itype field supports all types.

Trace Pipeline Stage Functionality:

There are three cycles in CtrlBlock:

Stage 0: Delay rob commitInfo by one cycle;
Stage 1: commitInfo compression, blocking commit signal generation;
Stage 2: Read basePc from pcmem based on the compressed ftqptr, obtain the priv, xcause, xtval corresponding to the currently committed instruction from the csr;

MemBlock

Stage 3: Calculate the final iaddr using ftqOffset and the basePc read from pcmem;

Trace Buffer Compression Mechanism

Before each set of commitInfo enters the trace buffer, compression needs to be performed, i.e., each commitinfo entry that requires a PC is compressed with the entry before it into a single entry and sent to the trace buffer. Before entering the trace buffer, it calculates whether all entries entering the trace buffer in the current cycle can be fully dequeued in the next cycle. If not, it blocks ROB commit. This block will continue until the commitInfo that generated the block signal is fully dequeued from the trace buffer. The commitInfo that generates the blockCommit signal will enter the trace buffer unconditionally, but the commitinfo in the next cycle will definitely be blocked.