跳转至

ExuUnit

  • Version: V2R2
  • Status: OK
  • Date: 2025/01/20
  • commit: xxx

Glossary

fu Glossary
fu Description
alu Arithmetic Logic Unit
mul Multiplication Unit
bku B Extension Bit Manipulation and Crypto Unit
brh Branch Unit
jmp Jump Unit
i2f Integer to Float Unit
i2v Integer to Vector Move Unit
VSetRiWi vset unit for Reading Integer Writing Integer
VSetRiWvf vset unit for Reading Integer Writing Vector Float
csr Control and Status Register Unit
fence Memory Synchronization Instruction Unit
div Division Unit
falu Floating-point Arithmetic Logic Unit
fcvt Floating-point Conversion Unit
f2v Float to Vector Move Unit
fmac Floating-point Fused Multiply-Add
fdiv Floating-point Division Unit
vfma Vector Floating-point Fused Multiply-Add Unit
vialu Vector Integer Arithmetic Logic Unit
vimac Vector Integer Multiply-Add Unit
vppu Vector Permutation Processing Unit
vfalu Vector Floating-point Arithmetic Logic Unit
vfcvt Vector Floating-point Conversion Unit
vipu Vector Integer Processing Unit
VSetRvfWvf vset unit for Reading Vector Writing Vector Float
vfdiv Vector Floating-point Division Unit
vidiv Vector Integer Division Unit

Inputs and Outputs

flush is a Redirect input with a valid signal.

in is the ExuInput generated based on the specific ExeUnit parameter configuration.

out is the ExuOutput generated based on the specific ExeUnit parameter configuration.

csrio, csrin, and csrToDecode exist only if a CSR is present in this ExeUnit.

Similarly, fenceio exists only if a fence is present in this ExeUnit. frm exists only if this ExeUnit needs frm as a source. vxrm exists only if this ExeUnit needs vxrm as a source.

vtype, vlIsZero, and vlIsVlmax exist only if this ExeUnit needs to write Vconfig.

Additionally, for cases where the ExeUnit contains JmpFu or BrhFu, the instruction address translation type instrAddrTransType is required as input.

Functionality

Each ExuUnit generates a series of corresponding FU modules based on its configuration parameters.

busy is used to indicate whether the current ExeUnit is busy. For ExeUnits with determined latency, the functional unit is never marked busy because the latency is fixed, and all tasks are completed in sequence. In this case, busy is directly set to false, indicating that the functional unit is always idle. For ExeUnits with non-determined latency, busy is asserted when an input fires and deasserted when the output fires. Additionally, if the currently inputting uop or the currently computing uop needs to be redirect flushed, busy is also deasserted.

Furthermore, the ExeUnit checks for mixed latency types, i.e., whether functional units on the same port have different latency types (determined and non-determined). If such a mixed situation exists, for non-determined latency functional units, their priority is ensured to be the maximum value. This design logic ensures that when processing functional units with different types of latency, the write-back port's priority is appropriately configured, avoiding priority conflicts or inconsistencies.

Besides having various FUs, each ExuUnit also has a submodule in1ToN, which is a Dispatcher. Its function is to further dispatch the single ExuInput that enters the ExuUnit to different FUs. It must be ensured here that the same ExuInput must enter exactly one FU and not more than one.

Additionally, there is a set of registers inPipe, which is a vector of (valid, input) pairs of size latencyMax + 1. It records the input and which cycle of computation the input is currently in. For FUs that need to control the pipeline, they can obtain the original data through inPipe.

Finally, the results from different FUs need to be collected and one FU's output result selected as the output of the ExeUnit.

ExuUnit Overview

Design Specification

In the Backend, there are a total of 3 ExuBlocks: intExuBlock, fpExuBlock, and vfExuBlock, which are the execution blocks for integer, floating-point, and vector operations, respectively. Each ExuBlock contains several ExeUnit units.

intExuBlock contains 8 ExeUnits. The function of each ExeUnit is as follows:

Fu Included in Each ExeUnit in intExuBlock
ExeUnit Function
exus0 alu, mul, bku
exus1 brh, jmp
exus2 alu, mul, bku
exus3 brh, jmp
exus4 alu
exus5 brh, jmp, i2f, i2v, VSetRiWi, VSetRiWvf
exus6 alu
exus7 csr, fence, div

fpExuBlock contains 5 ExeUnits. The function of each ExeUnit is as follows:

Fu Included in Each ExeUnit in fpExuBlock
ExeUnit Function
exus0 falu, fcvt, f2v, fmac
exus1 fdiv
exus2 falu, fmac
exus3 fdiv
exus4 falu, fmac

vfExuBlock contains 5 ExeUnits. The function of each ExeUnit is as follows:

Fu Included in Each ExeUnit in vfExuBlock
ExeUnit Function
exus0 vfma, vialu, vimac, vppu
exus1 vfalu, vfcvt, vipu, VSetRvfWvf
exus2 vfma, vialu
exus3 vfalu
exus4 vfdiv, vidiv

Clock Gating

ExuUnit also supports Clock Gating for Functional Units (FUs). By controlling the clock enable signal clk_en for each functional unit FU, power consumption can be reduced. The clock is enabled only when the functional unit is needed. The clock gating enable signal is dynamically calculated based on the functional unit's latency setting and whether non-determined latency is enabled, thereby achieving power optimization.

Simply put, for FUs with fixed latency and latency greater than 0 cycles, two vectors fuVldVec and fuRdyVec of length latReal + 1 are used. When the FU input is valid, fuVldVec(0) is 1, and the 1 shifts backward by one position each cycle. Additionally, for fuRdyVec(i), its value depends on fuRdyVec(i+1) and fuVldVec(i+1). Thus, when there is a 1 in fuVldVec, it indicates that there is a valid computation currently.

For FUs with non-determined latency, uncer_en_reg is used to record when the FU input fires and is cleared when the FU output fires.

Therefore, for FUs that can use clock gating, the condition for clk_en to be high is: for zero-latency FUs, the FU input fires; for multi-cycle latency FUs, the input fires or there is a valid computation currently in the FU; for non-determined latency FUs, the FU input fires or there is a valid computation currently in the FU. Clock gating is achieved using these conditions.