ExuUnit
- Version: V2R2
- Status: OK
- Date: 2025/01/20
- commit: xxx
Glossary
fu | Description |
---|---|
alu | Arithmetic Logic Unit |
mul | Multiplication Unit |
bku | B Extension Bit Manipulation and Crypto Unit |
brh | Branch Unit |
jmp | Jump Unit |
i2f | Integer to Float Unit |
i2v | Integer to Vector Move Unit |
VSetRiWi | vset unit for Reading Integer Writing Integer |
VSetRiWvf | vset unit for Reading Integer Writing Vector Float |
csr | Control and Status Register Unit |
fence | Memory Synchronization Instruction Unit |
div | Division Unit |
falu | Floating-point Arithmetic Logic Unit |
fcvt | Floating-point Conversion Unit |
f2v | Float to Vector Move Unit |
fmac | Floating-point Fused Multiply-Add |
fdiv | Floating-point Division Unit |
vfma | Vector Floating-point Fused Multiply-Add Unit |
vialu | Vector Integer Arithmetic Logic Unit |
vimac | Vector Integer Multiply-Add Unit |
vppu | Vector Permutation Processing Unit |
vfalu | Vector Floating-point Arithmetic Logic Unit |
vfcvt | Vector Floating-point Conversion Unit |
vipu | Vector Integer Processing Unit |
VSetRvfWvf | vset unit for Reading Vector Writing Vector Float |
vfdiv | Vector Floating-point Division Unit |
vidiv | Vector Integer Division Unit |
Inputs and Outputs
flush
is a Redirect input with a valid signal.
in
is the ExuInput generated based on the specific ExeUnit parameter configuration.
out
is the ExuOutput generated based on the specific ExeUnit parameter configuration.
csrio
, csrin
, and csrToDecode
exist only if a CSR
is present in this ExeUnit.
Similarly, fenceio
exists only if a fence
is present in this ExeUnit. frm
exists only if this ExeUnit needs frm
as a source. vxrm
exists only if this ExeUnit needs vxrm
as a source.
vtype
, vlIsZero
, and vlIsVlmax
exist only if this ExeUnit needs to write Vconfig.
Additionally, for cases where the ExeUnit contains JmpFu or BrhFu, the instruction address translation type instrAddrTransType
is required as input.
Functionality
Each ExuUnit generates a series of corresponding FU modules based on its configuration parameters.
busy
is used to indicate whether the current ExeUnit is busy. For ExeUnits with determined latency, the functional unit is never marked busy because the latency is fixed, and all tasks are completed in sequence. In this case, busy
is directly set to false, indicating that the functional unit is always idle. For ExeUnits with non-determined latency, busy
is asserted when an input fires and deasserted when the output fires. Additionally, if the currently inputting uop or the currently computing uop needs to be redirect flushed, busy
is also deasserted.
Furthermore, the ExeUnit checks for mixed latency types, i.e., whether functional units on the same port have different latency types (determined and non-determined). If such a mixed situation exists, for non-determined latency functional units, their priority is ensured to be the maximum value. This design logic ensures that when processing functional units with different types of latency, the write-back port's priority is appropriately configured, avoiding priority conflicts or inconsistencies.
Besides having various FUs, each ExuUnit also has a submodule in1ToN
, which is a Dispatcher. Its function is to further dispatch the single ExuInput that enters the ExuUnit to different FUs. It must be ensured here that the same ExuInput must enter exactly one FU and not more than one.
Additionally, there is a set of registers inPipe
, which is a vector of (valid, input) pairs of size latencyMax + 1
. It records the input and which cycle of computation the input is currently in. For FUs that need to control the pipeline, they can obtain the original data through inPipe
.
Finally, the results from different FUs need to be collected and one FU's output result selected as the output of the ExeUnit.
Design Specification
In the Backend, there are a total of 3 ExuBlocks: intExuBlock
, fpExuBlock
, and vfExuBlock
, which are the execution blocks for integer, floating-point, and vector operations, respectively. Each ExuBlock contains several ExeUnit units.
intExuBlock
contains 8 ExeUnits. The function of each ExeUnit is as follows:
ExeUnit | Function |
---|---|
exus0 | alu, mul, bku |
exus1 | brh, jmp |
exus2 | alu, mul, bku |
exus3 | brh, jmp |
exus4 | alu |
exus5 | brh, jmp, i2f, i2v, VSetRiWi, VSetRiWvf |
exus6 | alu |
exus7 | csr, fence, div |
fpExuBlock
contains 5 ExeUnits. The function of each ExeUnit is as follows:
ExeUnit | Function |
---|---|
exus0 | falu, fcvt, f2v, fmac |
exus1 | fdiv |
exus2 | falu, fmac |
exus3 | fdiv |
exus4 | falu, fmac |
vfExuBlock
contains 5 ExeUnits. The function of each ExeUnit is as follows:
ExeUnit | Function |
---|---|
exus0 | vfma, vialu, vimac, vppu |
exus1 | vfalu, vfcvt, vipu, VSetRvfWvf |
exus2 | vfma, vialu |
exus3 | vfalu |
exus4 | vfdiv, vidiv |
Clock Gating
ExuUnit also supports Clock Gating for Functional Units (FUs). By controlling the clock enable signal clk_en
for each functional unit FU, power consumption can be reduced. The clock is enabled only when the functional unit is needed. The clock gating enable signal is dynamically calculated based on the functional unit's latency setting and whether non-determined latency is enabled, thereby achieving power optimization.
Simply put, for FUs with fixed latency and latency greater than 0 cycles, two vectors fuVldVec
and fuRdyVec
of length latReal + 1
are used. When the FU input is valid, fuVldVec(0)
is 1, and the 1 shifts backward by one position each cycle. Additionally, for fuRdyVec(i)
, its value depends on fuRdyVec(i+1)
and fuVldVec(i+1)
. Thus, when there is a 1 in fuVldVec
, it indicates that there is a valid computation currently.
For FUs with non-determined latency, uncer_en_reg
is used to record when the FU input fires and is cleared when the FU output fires.
Therefore, for FUs that can use clock gating, the condition for clk_en
to be high is: for zero-latency FUs, the FU input fires; for multi-cycle latency FUs, the input fires or there is a valid computation currently in the FU; for non-determined latency FUs, the FU input fires or there is a valid computation currently in the FU. Clock gating is achieved using these conditions.