VFPU

Version: V2R2
Status: OK
Date: 2025/01/20
commit: xxx

Terminology

Abbreviation	Full Name	Description
VFPU	Vector Float Point Unit	Vector Floating-Point Unit
IQ	Issue Queue	Issue Queue

Design Specifications

Supports vector floating-point Mul computation
Supports vector floating-point FMA computation
Supports vector floating-point Div computation
Supports vector floating-point Sqrt computation
Supports fp32, fp64, fp16 computation
Supports computation of RV-V1.0 version vector floating-point instructions

Functionality

The VFPU module receives uop information issued from the Issue Queue. Based on information such as fuType and fuOpType, it completes the computation of vector floating-point instructions. It mainly includes four modules: VFAlu, VFMA, VFDivSqrt, and VFCvt.

VFAlu is primarily responsible for fadd-related instructions and some other simple instructions, such as comparison instructions and sign-injection instructions. Notably, reduction sum instructions are also computed in this module by decomposing uops.

VFMA is primarily responsible for multiplication and fused multiply-add (FMA) related instructions.

VFDivSqrt is primarily responsible for division and square root related instructions.

VFCvt is primarily responsible for format conversion and reciprocal estimation related instructions.

Algorithm Design

The difficulty in designing a vector floating-point unit lies in supporting computations for multiple single-precision formats (where operands and results have the same floating-point format) and mixed-precision formats (where operands and results have different floating-point formats). Taking the common half-precision (\(f16\)), single-precision (\(f32\)), and double-precision (\(f64\)) as examples, we compare the differences between scalar floating-point units and vector floating-point units.

Taking typical floating-point addition as an example, a scalar floating-point unit only needs to support computations for three single-precision formats. The input operands and output result of this unit should all be \(64\)-bit. Thus, it needs to support three computation formats:

(1) \(1 \ f64 = f64 + f64\);

(2) \(1 \ f32 = f32 + f32\);

(3) \(1 \ f16 = f16 + f16\).

It may seem that three separate modules are needed to perform these three types of computations. However, because floating-point numbers are composed of a sign bit, an exponent, and a significand, and because higher-precision floating-point numbers have wider exponents and significands than lower-precision ones, the hardware design for higher-precision floating-point numbers can fully satisfy the hardware requirements for lower-precision ones. With minor adjustments, multiple single-precision formats can be supported by adding \(Mux\) (multiplexer) in the hardware, with only a slight increase in area.

A vector floating-point unit must support vector operations. A characteristic of vector operations is their high data bandwidth utilization. For example, although a scalar unit has a \(64\)-bit interface, when computing \(f32/f16\), its effective data is only \(32/16\) bits, reducing bandwidth utilization to \(50\%/25\%\). A vector unit also has a \(64\)-bit interface, but when computing single-precision formats \(f32/f16\), it can perform \(2/4\) operations simultaneously, achieving \(100\%\) bandwidth utilization. The supported single-precision computation formats are as follows:

(1) \(1 \ f64 = f64 + f64\);

(2) \(2 \ f32 = f32 + f32\);

(3) \(4 \ f16 = f16 + f16\).

The need to perform multiple additions of floating-point numbers of the same format simultaneously makes hardware design more difficult than for scalar units, but hardware for higher-precision formats can also be reused for lower-precision formats. Additionally, another feature that vector floating-point units must support is mixed-precision computation. The \(RISC-V\) vector instruction set extension defines a series of \(widening\) instructions that require mixed-precision computation, requiring the floating-point addition unit to also support the following four computation formats:

(1) \(1 \ f64 = f64 + f32\);

(2) \(1 \ f64 = f32 + f32\);

(3) \(2 \ f32 = f32 + f16\);

(4) \(2 \ f32 = f16 + f16\).

The design difficulty of mixed-precision computation is much greater than that of multiple single-precision formats. On one hand, operands of different data formats need to be converted to the same format as the result before computation, increasing logical complexity. On the other hand, format conversion puts significant pressure on circuit timing, especially when converting low-precision denormalized numbers to high-precision floating-point numbers. Therefore, this document specifically designs a fast data format conversion algorithm to address timing issues.

In summary, the design challenges for a vector floating-point unit lie in the implementation of multiple single-precision formats and mixed-precision formats. This section will introduce the vector floating-point addition algorithm, floating-point sequential accumulation algorithm, vector floating-point fused multiply-add algorithm, and vector floating-point division algorithm one by one to address these design challenges and achieve a high-performance vector floating-point unit capable of reaching a frequency of \(3GHz\).

Vector Floating-Point Addition

Floating-point addition is one of the most commonly used arithmetic operations in scientific computing. Although conceptually simple, traditional single-path floating-point addition algorithms require two to three signed addition steps, which is a relatively time-consuming operation. Dual-path floating-point addition algorithms have only one signed addition operation on the critical path in the worst case, thus offering a significant speed advantage over single-path algorithms. Based on the dual-path floating-point addition algorithm, this document designs a faster, improved dual-path floating-point addition algorithm. This section first introduces single-precision single-path floating-point addition, dual-path floating-point addition, the improved dual-path floating-point addition algorithm, and finally, the vector floating-point addition algorithm.

The floating-point addition formula is expressed as: \(fp\_result = fp\_a + fp\_b\). When the signs of \(fp\_a\) and \(fp\_b\) are the same, addition is performed after aligning the significands; this is called effective addition. When the signs of \(fp\_a\) and \(fp\_b\) are different, subtraction is performed after aligning the significands; this is called effective subtraction. For a denormalized number, an exponent of \(0\) corresponds to the same normalized exponent as a normalized number with an exponent of \(1\). Therefore, when calculating the exponent difference, an exponent of \(0\) should be treated as \(1\) (called the normalized exponent). The absolute difference of the normalized exponents is the normalized exponent difference.

Single-Path Floating-Point Addition Algorithm

The traditional single-path floating-point addition operation is shown in the figure and consists of the following steps:

(1) Normalized exponent subtraction (\(ES\)): Calculate the normalized exponent difference \(d = |Ea - Eb|\), where \(Ea\) and \(Eb\) are normalized exponents.

(2) Alignment (\(Align\)): Right-shift the significand of the smaller operand by \(d\) bits. The larger exponent is denoted as \(Ef\).

(3) Significand addition (\(SA\)): Perform addition or subtraction based on the effective operation \(Eo\). \(Eo\) is the actual arithmetic operation performed by the adder in the floating-point addition unit, determined by the sign bits of the two floating-point operands.

(4) Conversion (\(Conv\)): If the significand addition result is negative, convert the result to sign-magnitude representation. The conversion is done in one addition step, and the converted result is denoted as \(Sf\).

(5) Leading zero detection (\(LZD\)): Calculate the required left or right shift amount, denoted as \(En\). A right shift is positive; otherwise, it is negative.

(6) Normalization (\(Norm\)): Normalize the significand by shifting it \(En\) bits, and add \(En\) to \(Ef\).

(7) Rounding (\(Round\)): Round according to the \(IEEE\)-\(754\) standard. If necessary, add \(1\) to the \(LSB\) of \(Sf\). This step may cause an overflow, requiring the mantissa result to be right-shifted by one bit, and the exponent \(Ef\) to be incremented by \(1\).

Dual-Path Floating-Point Addition Algorithm

The aforementioned single-path floating-point algorithm is slow because the steps in the addition operation are mostly executed serially. This algorithm can be improved in the following ways:

(1) In the single-path floating-point addition algorithm, the \(Conv\) step is only needed when the result is negative, and it can be avoided by swapping the significands of the two operands. By checking the sign of the \(ES\) step result and swapping (\(Swap\)) the significands accordingly, we always compute the larger significand minus the smaller significand. If the exponents are equal, the result might still be negative and require conversion, but rounding is not needed in this case. Therefore, the swap step makes rounding and conversion mutually exclusive, allowing them to be parallelized. Note that swapping also has the advantage of requiring only one shifter.

(2) The leading zero detection step can be performed in parallel with the significand addition step, removing it from the critical path. This optimization is very important in the case of subtraction when a large left shift is needed.

(3) So far, we have been able to reduce the critical path steps to: normalized exponent subtraction, swap, alignment, significand addition \(||\) leading zero detection, conversion \(||\) rounding, normalization (where \(||\) indicates steps that can be executed in parallel). The alignment and normalization steps are mutually exclusive and can be further optimized. Normalization requires a large left shift only when \(d \leq 1\) or in effective subtraction. Conversely, the alignment step requires a large right shift only when \(d > 1\). By distinguishing these two cases, there is only one large shift in the critical path, either alignment or normalization.

The steps in the single-path and dual-path floating-point addition algorithms are shown in the table. In the dual-path floating-point addition algorithm, the preprocessing step (\(Pred\)) in the \(d \leq 1\) path determines whether a one-bit right shift is needed to align the significands based on the value of \(d\). The dual-path floating-point addition algorithm improves speed by executing more steps in parallel, thus requiring more hardware.

Steps in Two Floating-Point Addition Algorithms
Single-Path FP Addition	Dual-Path Floating-Point Addition Algorithm
Single-Path FP Addition	\(d \leq 1\) and Effective Sub	\(d>1\) or Effective Add
Normalized Exponent Add	Preprocessing + Swap	Normalized Exponent Sub
Alignment	--	Alignment
Significand Addition	Significand Addition or LZD	Significand Addition
Conversion	Conversion or Rounding	Rounding
Leading Zero Detection	--	--
Normalization	Normalization	--
Rounding	Select Path	Select Path

In the dual-path floating-point addition algorithm, during the \(SA\) step in effective subtraction, one of the significands is in 2's complement form. This complementation step and the rounding step are mutually exclusive and can therefore be parallelized. The optimized dual-path floating-point addition algorithm is shown in the table.

Optimized Dual-Path Floating-Point Addition Algorithm
\(d \leq 1\) and Effective Subtraction	\(d>1\) or Effective Addition
Preprocessing + Swap	Normalized Exponent Subtraction + Swap
Significand Addition Conversion \(\\|\) Rounding \(\\|\) LZD	Alignment
Normalization	Significand Addition \(\\|\) Rounding
Select Path	Select Path

In \(IEEE\) round-to-nearest (RTN) mode, computing \(A+B\) and \(A+B+1\) is sufficient to handle all normalization possibilities (when rounding towards positive/negative infinity, \(A+B+2\) also needs to be computed). By using \(Cin\) to select the final rounded mantissa result from multiple significand adder results, complementation and rounding can be done simultaneously, saving an addition step. Since normalization in floating-point addition may require a right shift of one bit, no shift, or a left shift (the number of left shift bits can be as large as the significand length), \(Cin\) needs to consider all these normalization possibilities so that the finally selected result is the rounded result.

Improved Dual-Path Floating-Point Addition Algorithm

This section details the improved dual-path floating-point addition algorithm presented in this document. The path for effective addition or effective subtraction with \(d > 1\) is called the \(far\) path. The path for effective subtraction with \(d \leq 1\) is called the \(close\) path. Cases involving infinity or \(NaN\) operands are handled separately and do not belong to the \(far\) or \(close\) paths.

\(Far\) Path

The \(far\) path algorithm is shown in the figure, with the main steps as follows:

Step 1: In the \(far\) path, the exponent difference \(d\) is greater than \(1\). The significand of the smaller operand needs to be right-shifted by \(d\) bits to align with the significand of the larger operand. First, calculate the normalized exponent difference. To speed up the calculation, two adders are used to compute the normalized exponent difference, while simultaneously comparing the magnitudes of \(Efp\_a\) and \(Efp\_b\). The correct normalized exponent difference is selected based on the comparison result of the exponent magnitudes.

Step 2: Based on the exponent magnitude comparison from Step 1, and in parallel with calculating the normalized exponent difference in Step 1, select the significand of the operand with the larger exponent and the significand of the operand with the smaller exponent. Also, select the larger exponent \(EA\). In an effective subtraction, \(EA\) is decremented by \(1\) (at this point, \(EA\) cannot be \(0\), because if it were, it would be the \(close\) path). The purpose of this is to adjust the value range of the significand after subtraction, unifying it with the effective addition value range to facilitate the selection of the final result later. The value range of the result of significand addition or subtraction after adjustment is between \([1,4)\), divided into two cases: between \([1,2)\) and between \([2,4)\).

Step 3: Right-shift the smaller significand. This right shift is handled in two ways: in effective subtraction, the smaller significand is first inverted and then arithmetically right-shifted, which saves some time compared to right-shifting then inverting; in effective addition, it is logically right-shifted. To save shifter stages, when the high bits of the normalized exponent difference are all \(0\), the low bits of the normalized exponent difference (the specific number of bits depends on the significand width) are used for the right shift. When the high bits of the normalized exponent difference are not all \(0\), the right-shift result is \(0\). Here, the results from the two normalized exponent difference adders in Step 1 are used, and the lowest bit is used first for the right shift (because the lowest bit of the addition result is obtained first). Specifically: if \(fp\_a\) has the larger exponent, only the significand of \(fp\_b\) is right-shifted, and the shift amount is the normalized exponent of \(fp\_a\) minus the normalized exponent of \(fp\_b\); if \(fp\_b\) has the larger exponent, only the significand of \(fp\_a\) is right-shifted, and the shift amount is the normalized exponent of \(fp\_b\) minus the normalized exponent of \(fp\_a\). Then, based on the exponent magnitude relationship and the value of the normalized exponent difference, the final right-shifted significand is selected, and the \(grs\) (\(guard\), \(round\), \(sticky\)) bits after the right shift are calculated. Here, to correctly round for the two cases in Step 2, two sets of \(grs\) bits need to be calculated for when the significand addition/subtraction result is between \([1,2)\) and between \([2,4)\).

Step 4: Perform significand addition. Since the smaller significand was already inverted before the right shift in effective subtraction, let the larger significand be \(A\) and the smaller right-shifted significand be \(B\). Two significand adders compute \(A+B\) and \(A+B+2\), respectively. The final rounded result is selected from the results of these two adders.

Step 5: Generate the final result. Based on whether the result of \(A+B\) is between \([1,2)\) (Case 1) or between \([2,4)\) (Case 2), and using the two sets of \(grs\) calculated during the previous right shift and the rounding mode, determine the conditions for selecting between the two significand adders for Case 1 and Case 2 separately. Finally, use a one-hot 4-to-1 multiplexer to select the mantissa result. The exponent result is \(EA\) (if Case 1 and mantissa after rounding \(<1\)) or \(EA+1\) (if Case 2 or Case 1 and rounded to \(=2\)). Note whether the exponent overflows after rounding. The final result is selected between the overflow result and the normal computation result based on overflow. Exception flags in the \(far\) path will only generate overflow and inexact.

\(Close\) Path

The \(close\) path is always for effective subtraction and \(d \leq 1\), further divided into \(d=0\) or \(d=1\). The algorithm is shown in the figure, with specific steps as follows:

Step 1: Perform four sets of significand subtractions in parallel. Based on \(d=0\) (\(fp\_a\) mantissa larger, \(fp\_b\) mantissa larger) and \(d=1\) (\(fp\_a\) normalized exponent larger, \(fp\_b\) normalized exponent larger), four cases of effective subtraction are combined. First subtractor: \(fp\_a\) significand \(-\) \(fp\_b\) significand; Second subtractor: \(fp\_b\) significand \(-\) \(fp\_a\) significand; Third subtractor: \(fp\_a\) significand \(\times 2\) \(-\) \(fp\_b\) significand; Fourth subtractor: \(fp\_b\) significand \(\times 2\) \(-\) \(fp\_a\) significand. Simultaneously, calculate the \(grs\) bits based on the exponent magnitude relationship. When \(d=0\), all \(grs\) bits are \(0\). When \(d=1\), only \(g\) might be non-zero. These four sets of adders cannot produce all rounding results. A fifth, slower adder is added: significand of operand with larger exponent \(-\) (significand of operand with smaller exponent right-shifted by one bit).

Step 2: Determine the four conditions for selecting from the four sets of significand subtractions. This is determined based on the value of \(d\), the most significant bit of the adder result, \(grs\), and the rounding mode. After selecting a subtraction result from the four adders, the result needs \(LZD\) \(+\) left shift. Here, pay attention to the value of the larger exponent \(EA\). The left shift is controlled by both \(LZD\) and \(EA\). A \(mask\) value is generated based on \(EA\) (same width as the subtraction result, but at most one bit is \(1\)). This mask is ORed with the subtraction result before performing \(LZD +\) left shift.

Step 3: Determine the condition for selecting the fifth subtractor. When the result from the fifth subtractor is selected, no left shift is needed, so a slower adder is used. The final mantissa result can then be selected.

Step 4: Exponent result and sign bit result. The exponent result needs to be \(EA\) minus the \(LZD\) value from Step 2. If the fifth subtractor's result is chosen as the mantissa result, the exponent remains unchanged. When \(d=1\), the sign bit is the sign of the operand with the larger exponent. When \(d=0\), the sign bit is selected based on the mantissa magnitudes. Note that if the result is \(0\) and rounded downwards, the sign bit is \(1\).

Vector Floating-Point Addition Algorithm

The vector floating-point adder has an output signal width of \(64\) bits, supports mixed precision, and supports \(widening\) instructions. It must support the following data format computations:

(1) \(1 \ f64 = f64 + f64\);

(2) \(1 \ f64 = f64 + f32\);

(3) \(1 \ f64 = f32 + f32\);

(4) \(2 \ f32 = f32 + f32\);

(5) \(2 \ f32 = f32 + f16\);

(6) \(2 \ f32 = f16 + f16\);

(7) \(4 \ f16 = f16 + f16\).

Module Partitioning

The computation approach is to use one module to compute the first three formats, as their output results are all \(64\)-bit. The single-precision floating-point adder for \(f64 = f64 + f64\) is reused to enable computation of \(f64 = f64 + f32\) and \(f64 = f32 + f32\). This document proposes a fast data format conversion algorithm. After converting \(f32\) operands to \(f64\), the \(f64 = f64 + f64\) computation can be performed, yielding an \(f64\) result format.

The same approach is taken for computation formats where the output result is \(f32\). Since the timing pressure for \(f32\) is not as high, one \(f16 = f16 + f16\) computation is also integrated into the module that produces \(f32\) results to save area, enabling it to support:

(1) \(1 \ f32 = f32 + f32\);

(2) \(1 \ f32 = f32 + f16\);

(3) \(1 \ f32 = f16 + f16\);

(4) \(1 \ f16 = f16 + f16\).

Obviously, this module needs to be instantiated twice. Finally, \(2 \ f16 = f16 + f16\) computations are still needed. Two single-precision floating-point adders that only compute \(f16 = f16 + f16\) are instantiated separately. In total, four modules implement all vector addition computation formats.

Fast Format Conversion Algorithm

Let's use the conversion from \(f16\) to \(f32\) as an example to introduce the fast format conversion algorithm.

When an \(f16\) is a normalized number, its conversion to \(f32\) will also be a normalized number. The \(f16\) exponent needs to be biased to the \(f32\) exponent. Since the \(f32\) exponent range is larger, there's no concern about out-of-range issues after exponent conversion. Additionally, the \(f16\) significand is \(10\) bits, and the \(f32\) significand is \(23\) bits. Simply appending \(13\) zeros to the \(f16\) significand yields the \(f32\) significand. This is a low-precision to high-precision conversion, so the result is definitely exact.

For the exponent of a normalized \(f16\) (width \(5\) bits), the actual exponent is \(Ereal = Ef16 – 15\). For the exponent of a normalized \(f32\) (width \(8\) bits), \(Ereal = Ef32 – 127\). So, using \(Ereal\) to convert \(Ef16\) to \(Ef32\): \(Ef16 – 15 = Ef32 – 127\), which gives \(Ef32 = Ef16 – 15 + 127\), so \(Ef32 = Ef16 + 112\). The \(8\)-bit binary representation of \(112\) is \(01110000\). Calculating \(Ef16 + 112\) requires an adder for a variable plus a constant. This adder can be avoided by observing a pattern:

When the MSB of \(Ef16\) is \(0\), \(Ef16 + 112 = (0111, Ef16(3,0))\)

When the MSB of \(Ef16\) is \(1\), \(Ef16 + 112 = (1000, Ef16(3,0))\)

Using this pattern, a \(Mux\) can quickly convert \(Ef16\) to \(Ef32\). So, converting a normalized \(f16\) to \(f32\) is fast: one \(Mux\) for the exponent, append \(0\)s for the significand, and the sign bit remains unchanged. The difficulty lies when \(f16\) is a denormalized number. In this case, the \(f16\) exponent is all \(0\)s, and the number of leading zeros in the significand determines the \(f32\) exponent. When the \(f16\) exponent bits are all zero and only the LSB of the significand is \(1\), the converted \(f32\) exponent is minimized to \(-15-9=-24\), which is still within the \(f32\) normalized range. Therefore, for \(f16\) denormalized numbers, leading zero detection (\(lzd\)) and left shift of the significand are required.

\(Chisel\)'s built-in priority encoder can implement the \(lzd\) function. Our tests show it synthesizes better than traditional \(lzd\) written using a binary search method. The syntax is: \(PriorityEncoder(Reverse(Cat(in,1.U)))\). For an \(in\) of width \(5\), the generated \(Verilog\) code is as follows:

module LZDPriorityEncoder(
  input        clock,
  input        reset,
  input  [4:0] in,
  output [2:0] out
);
  wire [5:0] _out_T = {in,1'h1};
  wire [5:0] _out_T_15 = {_out_T[0],_out_T[1],_out_T[2],_out_T[3],_out_T[4],_out_T[5]};
  wire [2:0] _out_T_22 = _out_T_15[4] ? 3'h4 : 3'h5;
  wire [2:0] _out_T_23 = _out_T_15[3] ? 3'h3 : _out_T_22;
  wire [2:0] _out_T_24 = _out_T_15[2] ? 3'h2 : _out_T_23;
  wire [2:0] _out_T_25 = _out_T_15[1] ? 3'h1 : _out_T_24;
  assign out = _out_T_15[0] ? 3'h0 : _out_T_25;
endmodule

Although this code appears to use many cascaded \(Mux\)es, synthesizers handle this type of code well for timing. Inspired by this, this document designs a new priority-based left shift algorithm to accelerate \(lzd +\) left shift. Its \(Chisel\) code is as follows:

def shiftLeftPriorityWithF32EXPResult(srcValue: UInt, priorityShiftValue: UInt): UInt = {
  val width = srcValue.getWidth
  val lzdWidth = srcValue.getWidth.U.getWidth
  def do_shiftLeftPriority(srcValue: UInt, priorityShiftValue: UInt, i:Int): UInt = {
    if (i==0) Cat(
      Mux(
        priorityShiftValue(i),
        Cat(srcValue(0),0.U((width-1).W)),
        0.U(width.W)
      ),
      Mux(
        priorityShiftValue(i),
        "b01110000".U-(width-i-1).U(8.W),
        "b01110000".U-(width-i).U(8.W)
      )
    )
    else Mux(
      priorityShiftValue(i),
      if (i==width-1) Cat(srcValue(i,0),"b01110000".U-(width-i-1).U(8.W)) 
      else Cat(Cat(srcValue(i,0),0.U((width-1-i).W)), "b01110000".U-(width-i-1).U(8.W)),
      do_shiftLeftPriority(srcValue = srcValue, priorityShiftValue = priorityShiftValue, i = i - 1)
      )
    }
    do_shiftLeftPriority(srcValue = srcValue, priorityShiftValue = priorityShiftValue, i = width-1)
  }

Both \(srcValue\) and \(priorityShiftValue\) are passed the \(f16\) significand. Judgment starts from the MSB of the significand. If the MSB is \(1\), the original \(srcValue\) is returned along with the corresponding exponent (the exponent is selected from several constants and depends on the position of the first \(1\) in the significand). If the MSB is \(0\), the next bit is checked. If it is \(1\), \(srcValue\) is left-shifted by one bit and then returned (a true left shift is not needed here because the high bits after the shift do not need to be preserved, so a truncation operation followed by zero-padding is sufficient), along with the corresponding exponent, and so on. In this way, a priority left shifter performs both \(lzd +\) left shift operations simultaneously and also generates the corresponding \(Ef32\), eliminating the need to calculate the \(Ef32\) exponent based on \(lzd\). This achieves a fast algorithm for converting \(f16\) denormalized numbers to \(f32\). The algorithm for converting \(f32\) to \(f64\) is similar and will not be detailed further.

Vector Floating-Point Fused Multiply-Add Algorithm

Floating-point fused multiply-add (FMA) computes \(fpa \times fp\_b + fp\_c\). The intermediate multiplication \(fpa \times fp\_b\) is performed as if with unlimited range and precision, without rounding. The final result is rounded only once to the target format. FMA is typically implemented using a pipeline, with main steps including multiplication, addition, normalization shift, and rounding. This chapter introduces the vector floating-point fused multiply-add algorithm, whose functionalities include:

(1) \(1 \ fp64 = fp64 \times fp64 + fp64\);

(2) \(2 \ fp32 = fp32 \times fp32 + fp32\);

(3) \(4 \ fp16 = fp16 \times fp16 + fp16\);

(4) \(2 \ fp32 = fp16 \times fp16 + fp32\);

(5) \(1 \ fp64 = fp32 \times fp32 + fp64\).

In (1), (2), and (3), the source operands and destination operand are of the same floating-point format. In (4) and (5), the two multiplicands have the same width, and the addend and the result have the same width, which is twice the width of the multiplicands.

Scalar Single-Precision Format Algorithm

The computation process first calculates the unrounded product of two floating-point numbers, and then adds this unrounded product to a third number. The algorithm flowchart is shown in the figure. The computation process is expressed by the formula \(fp\_result = fp\_a \times fp\_b + fp\_c\). In the diagram, \(Sa, Sb, Sc\) are the significands of \(fp\_a, fp\_b, fp\_c\) respectively, and \(Ea, Eb, Ec\) are their exponents:

For ease of description below, some parameters are defined. Their meanings and values are shown in the table:

Parameter Meanings and Values for Different Precisions
Parameter	\(f16\)	\(f32\)	\(f64\)	Meaning
\(significandWidth\)	\(11\)	\(24\)	\(53\)	Significand width
\(exponentWidth\)	\(5\)	\(8\)	\(11\)	Exponent width
\(rshiftBasic\)	\(14\)	\(27\)	\(56\)	Number of bits \(fp\_c\) significand needs to be right-shifted for alignment with product significand
\(rshiftMax\)	\(37\)	\(76\)	\(163\)	Maximum right-shift bits for \(fp\_c\) significand (if exceeded, \(g\) and \(r\) are \(0\))

Unsigned Integer Multiplication

The rules for multiplying two floating-point numbers are: multiply the sign bits, add the exponents (not simple addition, bias needs to be considered), and multiply the significands (including the implicit bit and mantissa bits). Significand multiplication is actually fixed-point multiplication, which has the same principle as unsigned integer multiplication.

Binary long multiplication is the original multiplication algorithm. The long method for \(n\)-bit \(C=A \times B\) is shown in the figure. This process generates \(n\) partial products, which are then added with appropriate shifts.

The long multiplication algorithm has a large delay. Optimizations for multiplication mainly focus on two aspects: first, reducing the number of partial products (e.g., \(Booth\) encoding), and second, reducing the delay caused by adders (e.g., \(CSA\) compression).

When multiplying two floating-point numbers, their significands are multiplied. Significands are unsigned, so unsigned integer multiplication can be used to multiply two significands. There are many algorithms for implementing unsigned integer multiplication. Below, three algorithms are compared.

Method 1: Directly use the multiplication symbol \(\times\), letting the synthesis tool decide.

Method 2: Use a long method similar to manual decimal multiplication. Multiplying two \(n\)-bit numbers generates \(n\) partial products. These \(n\) partial products are then compressed using \(CSA\) (discussed later) into the sum of two numbers.

Method 3: Use \(Booth\) encoding to generate \(\lceil(n+1)/2\rceil\) partial products, then compress them using \(CSA\) into the sum of two numbers.

The data in the table shows the results of multiplying two \(53\)-bit unsigned integers (for \(f64\)) using the \(TSMC7nm\) process library. The target frequency is \(3GHz\), with a theoretical cycle time of \(333.33ps\). However, considering \(clock \quad uncertain\) and process corner variations, design margin must be left for the backend, making the available time per cycle approximately \(280ps\). Therefore, it is evident that multiplication cannot be completed within one cycle. In practice, calculating the implicit bit also takes some time, making it even more impossible to implement \(53\)-bit multiplication in one cycle. Thus, although Method 1 has smaller area and shorter delay, it cannot be pipelined. We can only choose Method 2 or Method 3. Method 3 has shorter delay and smaller area compared to Method 2, so Method 3 is chosen as the implementation for unsigned integer multiplication.

Comparison of Three Unsigned Integer Multiplication Algorithms
Algorithm	Delay (\(ps\))	Area (\(um²\))	Pipelined?
Method 1	\(285.15\)	\(1458.95\)	No
Method 2	\(320.41\)	\(2426.34\)	Yes
Method 3	\(302.19\)	\(2042.46\)	Yes

\(Booth\) Encoding

The purpose of \(Booth\) encoding is to reduce the number of partial products in a multiplier. Taking binary unsigned integer multiplication \(C=A*B\) as an example, we derive the \(Booth\) encoding algorithm.

The following equation is the general expression for a binary unsigned integer. To facilitate subsequent transformations, a \(0\) is added at both the head and tail, without changing its value.

\[ B = 2^{n-1}B_{n-1} + 2^{n-2}B_{n-2} + \ldots + 2B_1 + B_0 + B_{-1}, \quad B_{-1}=0 \]

After an equivalent transformation, adjacent \(1\)s will cancel out to \(0\). If there is a sequence of \(1\)s, the lowest \(1\) will become \(-1\), the bit above the highest \(1\) will change from \(0\) to \(1\), and all \(1\)s in between will become \(0\). This transformation is the Booth transformation. Booth transformation can simplify sequences of three or more consecutive \(1\)s; the longer the sequence of \(1\)s, the better the simplification effect. However, the above transformation does not optimize hardware circuits because it cannot guarantee that a certain partial product is always \(0\). Therefore, in circuit design, an improved Booth encoding method is generally used, which can genuinely reduce the number of partial products.

\[ \begin{split} B &= 2^{n-1}B_{n-1} + 2^{n-2}B_{n-2} + \ldots + 2B_1 + B_0 + B_{-1} \\ &= 2^{n-1}B_{n-1} + 2^{n-2}B_{n-2} + 2^{n-2}B_{n-2} - 2^{n-2}B_{n-2} + \ldots + 2B_1 + 2B_1 - 2B_1 + B_0 + B_0 - B_0 + B_{-1} \\ &= 2^{n-1}(B_{n-1}+B_{n-2}) + 2^{n-2}(-B_{n-2} + B_{n-3}) + \ldots + 2(-B_1 + B_0) + (-B_0 + B_{-1}) \end{split} \]

Another equivalent transformation is performed, but this time with some constraints on \(n\). Assume \(n\) is odd. A zero is still appended at the LSB, making the length even. Then, a zero is prepended before the MSB, making the total length \(n+2\). This is done to facilitate later derivation.

\[ \begin{split} B &= 2^nB_n + 2^{n-1}B_{n-1} + 2^{n-2}B_{n-2} + \ldots + 2B_1 + B_0 + B_{-1} \\ &= -2 \times 2^{n-1}B_n + 2^{n-1}B_{n-1} + 2^{n-2}B_{n-2} + 2^{n-2}B_{n-2} - 2^{n-2}B_{n-2} + \ldots \\ &\quad + 2^3B_3 + 2^3B_3 - 2^3B_3 + 2^2B_2 + 2B_1 + 2B_1 - 2B_1 + B_0 + B_{-1} \\ &= 2^{n-1}(-2B_n + B_{n-1} + B_{n-2}) + 2^{n-2}(-2B_{n-1} + B_{n-2} + B_{n-3}) + \ldots \\ &\quad + 2^2(-2B_3 + B_2 + B_1) + (-2B_1 + B_0 + B_{-1}) \end{split} \]

After the equivalent transformation, it can be found that the number of terms in the polynomial expression becomes \((n+1)/2\) (when \(n\) is odd). If \(n\) is even, a zero needs to be appended at the end, and two zeros prepended before the MSB, making the number of terms in the polynomial expression \(n/2+1\) (when \(n\) is even). Combining the odd and even cases, the number of terms in the polynomial expression is \(\lceil (n+1)/2 \rceil\). Starting from the LSB of the original binary number, groups of three bits are formed (the LSB of the first group needs an appended \(0\); for the MSB, if \(n\) is odd, one \(0\) is prepended, if \(n\) is even, two \(0\)s are prepended, the purpose being to make the length after padding odd). Adjacent groups overlap by one bit (the MSB of the lower group overlaps with the LSB of the higher group), forming new polynomial factors. This is the improved Booth encoding method.

When multiplying two binary numbers, if the multiplier is encoded using the improved Booth encoding, the number of partial products can be reduced by half. Let the multiplicand be \(A\) and the multiplier be \(B\). Let \(B_{2i+1}\), \(B_{2i}\), \(B_{2i-1}\) be three consecutive bits of \(X\) (actually \(B\)), where \(i\) is a natural number \(N\). \(PP_i\) is the corresponding partial product for different values of \(i\). After applying the improved Booth transformation to \(B\) and then multiplying by \(A\), the \(Booth\) encoding and \(PP\) truth table are as shown in the table.

\(Booth\) Encoding and \(PP\) Truth Table
\(B_{2i+1}\)	\(B_{2i}\)	\(B_{2i-1}\)	\(PP_i\)
\(0\)	\(0\)	\(0\)	\(0\)
\(0\)	\(0\)	\(1\)	\(A\)
\(0\)	\(1\)	\(0\)	\(A\)
\(0\)	\(1\)	\(1\)	\(2A\)
\(1\)	\(0\)	\(0\)	\(-2A\)
\(1\)	\(0\)	\(1\)	\(-A\)
\(1\)	\(1\)	\(0\)	\(-A\)
\(1\)	\(1\)	\(1\)	\(0\)

Based on the value of every three consecutive bits of the multiplier, the corresponding partial product is determined. This reduces the number of partial products by half, as if the multiplier is treated as a radix-4 number for multiplication, hence it is called radix-4 Booth encoding. Multiplication using radix-4 Booth encoding offers significant optimization compared to traditional multiplication and is easy to implement, meeting most application requirements.

During \(Booth\) encoding, five types of partial products need to be calculated: \(0\), \(A\), \(2A\), \(-A\), \(-2A\). \(0\) and \(A\) require no calculation. \(2A\) is a left shift by one bit. \(-A\) and \(-2A\) require a complement and add one operation. This document introduces a fast algorithm for handling complement and add one.

To explain the principle simply, let's assume we are calculating for \(f16\). The significand is \(11\) bits, which will generate \(6\) partial products. The width of the partial products is \(22\) bits, as shown in the figure. The colored positions in the figure have a width of \(12\) bits, indicating that \(A\) might be multiplied by \(0\), \(1\), or \(2\). Because the three-bit encoding of the last partial product is \(0xx\), its value cannot be negative. Let's assume all partial products except the last one are negative. Then, each of these partial products needs to be complemented and incremented by one. The colored part represents the result after only complementing. We place the 'add one' for the current partial product at the corresponding position of the next partial product. This ensures the sum of partial products remains unchanged and avoids a carry chain problem when adding one to the current partial product. The last partial product is non-negative, so its 'add one' does not need processing.

Diagram assuming all partial products are negative, complemented and incremented

The \(1\)s in the figure above can be summed up and simplified to get the result in the figure below.

Simplified result assuming all partial products are negative

If the actual value of a partial product is positive, the above result needs to be corrected, which means adding one at the bit position to the left of the colored part, and changing the 'add one' at the tail of the next partial product to zero. As shown in the figure, \(Si\) (\(i\) starts from \(0\)) represents the sign bit of the \(i\)-th partial product. This becomes a general form, where the colored positions only calculate \(0, A, 2A, \sim A, \sim 2A\), speeding up partial product generation.

Corrected result when partial product is positive

There is one more point to note here. The sum of partial products is the result of the multiplication, but carries may also occur when summing partial products. This carry is meaningless for the multiplication itself, but when the product is added to a wider number, it will cause an incorrect carry. The correction method is to add an extra bit to the MSB of the partial product, as shown in the figure.

This ensures that the carry is correct after all partial products are added together. This concludes the introduction to \(Booth\) encoding. Note that this was explained using an \(11\)-bit multiplication as an example. The significand bit-widths for \(f16\) and \(f64\) are odd, but for \(f32\) it is even. Zero-padding at the MSB needs slight differentiation. Other processes are similar and will not be elaborated further.

\(CSA\) Compression

\(Carry\)-\(Save\)-\(Adder\) is an adder that retains carries. Its function is to compress \(n\) addends into \(m\) addends (\(m<n\)) through some logic, typically \(3\_2\) compression and \(4\_2\) compression.

Suppose we are calculating the sum of two binary numbers \(A+B\). Their sum and carry truth table is shown below. In the table, \(A[i]+B[i]\) is the decimal result, which is also the number of \(1\)s in \(A[i], B[i]\):

\(A+B\) Sum and Carry Truth Table
\(A[i]\)	\(B[i]\)	\(A[i] + B[i]\)	\(Sum[i]\)	\(Car[i]\)
\(0\)	\(0\)	\(0\)	\(0\)	\(0\)
\(0\)	\(1\)	\(1\)	\(1\)	\(0\)
\(1\)	\(0\)	\(1\)	\(1\)	\(0\)
\(1\)	\(1\)	\(2\)	\(0\)	\(1\)

Simplified into logical expressions:

\(Sum = A \oplus B\)

\(Car = A \cdot B\)

\(Result = A+B = Sum + (Car \ll 1)\)

When adding three numbers, the sum is the XOR of two numbers, and the carry is when both numbers are \(1\). \((Car \ll 1)\) is because the carry from the current bit goes to the next bit. This is just a derivation for later understanding; in practice, generating sum and carry for two addends does not accelerate addition.

Suppose we want to calculate the sum of three numbers \(A+B+C\). The key to \(CSA\) is to generate sum and carry. The truth table is as follows:

\(A+B+C\) Sum and Carry Truth Table
\(A[i]\)	\(B[i]\)	\(C[i]\)	\(A[i] + B[i] + C[i]\)	\(Sum[i]\)	\(Car[i]\)
\(0\)	\(0\)	\(0\)	\(0\)	\(0\)	\(0\)
\(0\)	\(0\)	\(1\)	\(1\)	\(1\)	\(0\)
\(0\)	\(1\)	\(0\)	\(1\)	\(1\)	\(0\)
\(0\)	\(1\)	\(1\)	\(2\)	\(0\)	\(1\)
\(1\)	\(0\)	\(0\)	\(1\)	\(1\)	\(0\)
\(1\)	\(0\)	\(1\)	\(2\)	\(0\)	\(1\)
\(1\)	\(1\)	\(0\)	\(2\)	\(0\)	\(1\)
\(1\)	\(1\)	\(1\)	\(3\)	\(1\)	\(1\)

Some patterns can be observed from the table above. The generation of \(Sum[i]\) and \(Car[i]\) actually only depends on the sum of \(A[i]+B[i]+C[i]\), i.e., the number of \(1\)s in \(A[i], B[i], C[i]\). The simplified expressions are as follows:

\(Sum = A \oplus B \oplus C\)

\(Car = (A \cdot B) \lor (A \cdot C) \lor (B \cdot C)\)

\(Result = A+B+C = Sum + (Car \ll 1)\)

When adding three numbers, the sum is the XOR of the three numbers, and the carry occurs if at least two numbers are \(1\). \((Car \ll 1)\) is because the carry from the current bit goes to the next bit. This method saves a lot of time, especially for long bit widths, by reducing the addition of three numbers to the addition of two numbers with the delay of two XOR gates.

Adding four numbers is more complex because if all four numbers are \(1\), the sum is \(4\), requiring a carry of \(2\). We call one carry \(Cout\) and the other \(Car\). Then, the \(Cout\) generated from the current four-bit addition is passed to the next stage as \(Cin\). \(Cin\) and the four numbers become five numbers to be added. After this operation, each bit input becomes five numbers: \(A[i], B[i], C[i], D[i], Cin[i]\). The output is three numbers: \(Sum[i], Cout[i], Car[i]\). The LSB \(Cin[0]\) is \(0\), and \(Cin[i]\) for other bits is \(Cout[i-1]\) from the lower bit, as shown in the table.

\(A+B+C+D\) Sum and Carry Truth Table
\(A[i]+B[i]+C[i]+D[i]+Cin[i]\)	\(Sum[i]\)	\(Cout[i]\)	\(Car[i]\)
\(0\)	\(0\)	\(0\)	\(0\)
\(1\)	\(1\)	\(0\)	\(0\)
\(2\)	\(0\)	\(1/0\)	\(0/1\)
\(3\)	\(1\)	\(1/0\)	\(0/1\)
\(4\)	\(0\)	\(1\)	\(1\)
\(5\)	\(1\)	\(1\)	\(1\)

There are many ways to simplify this truth table. One feasible method is described below. The value of \(Sum[i]\) is easily derived as the XOR of the five inputs: \(Sum[i] = A[i] \oplus B[i] \oplus C[i] \oplus D[i] \oplus Cin[i]\). \(Car[i]\) and \(Cout[i]\) are more complex. We stipulate that \(Cout[i]\) is generated only by the first three numbers, i.e., if the sum of the first three numbers is greater than \(1\), \(Cout[i] = 1\). The truth table for \(Cout[i]\) is:

\(Cout\) Truth Table from \(A+B+C\)
\(A[i]+B[i]+C[i]\)	\(Cout[i]\)
\(0\)	\(0\)
\(1\)	\(0\)
\(2\)	\(1\)
\(3\)	\(1\)

\(Cout[i]\) can be expressed as: \(Cout[i] = (A[i] \oplus B[i]) ? C[i] : A[i]\). \(Car[i]\) is then generated by \(D[i]\) and \(Cin[i]\). The truth table for \(Car[i]\) is:

\(Car\) Truth Table from \(A+B+C+D\)
\(A[i]+B[i]+C[i]+D[i]\)	\(Car[i]\)
\(0\)	\(D[i]\)
\(1\)	\(Cin[i]\)
\(2\)	\(D[i]\)
\(3\)	\(Cin[i]\)

\(Car[i]\) can be expressed as: \(Car[i] = (A[i] \oplus B[i] \oplus C[i] \oplus D[i]) ? Cin[i] : D[i]\). Specifically, when \((A[i] \oplus B[i] \oplus C[i] \oplus D[i]) = 1\), then \(A[i]+B[i]+C[i]+D[i] = 1 \text{ or } 3\). In this case, if \(Cin[i] = 1\), a carry is generated. When \((A[i] \oplus B[i] \oplus C[i] \oplus D[i]) = 0\), then \(A[i]+B[i]+C[i]+D[i] = 0 \text{ or } 4\). In this case, if \(D[i] = 0\), it means \(A[i]+B[i]+C[i]+D[i] = 0\), and adding \(Cin\) does not generate a carry. If \(D[i] = 1\), it means \(A[i]+B[i]+C[i]+D[i] = 4\), and adding \(Cin\) will also generate a carry. Therefore, combining the above derivation, the expressions for \(CSA4\_2\) are as follows:

\(Sum[i] = A[i] \oplus B[i] \oplus C[i] \oplus D[i] \oplus Cin[i]\), where \(Cin[i] = Cout[i-1]\) and \(Cin[0] = 0\).

\(Cout[i] = (A[i] \oplus B[i]) ? C[i] : A[i]\)

\(Car[i] = (A[i] \oplus B[i] \oplus C[i] \oplus D[i]) ? Cin[i] : D[i]\)

\(Result = A+B+C+D = Sum + (Car \ll 1)\)

Using the \(TSMC7nm\) process library, XOR gates with different inputs, \(CSA3\_2\), and \(CSA4\_2\) were synthesized and their delay and area compared. The synthesis results for XOR gates with different inputs are shown in the table.

Synthesis Results for XOR Gates with Different Inputs
\(106bit\)	Delay (\(ps\))	Area (\(um²\))
\(A \oplus B\)	\(13.74\)	\(38.66880\)
\(A \oplus B \oplus C\)	\(23.01\)	\(63.09120\)
\(A \oplus B \oplus C \oplus D\)	\(24.69\)	\(87.51360\)
\(A \oplus B \oplus C \oplus D \oplus E\)	\(37.21\)	\(99.72480\)

The synthesis results for \(CSA3\_2\) and \(CSA4\_2\) are shown in the table.

Synthesis Results for \(CSA3\_2, CSA4\_2\)
\(106bit\)	Delay (\(ps\))	Area (\(um²\))
\(CSA3\_2\)	\(23.23\)	\(104.42880\)
\(CSA4\_2\)	\(40.63\)	\(237.86881\)

It can be seen that although \(CSA4\_2\) theoretically has a delay of three XOR gates and \(CSA3\_2\) theoretically has a delay of two XOR gates, after actual physical implementation, \(CSA4\_2\) is only slightly faster than two levels of \(CSA3\_2\). Therefore, \(CSA3\_2\) should be used as much as possible, unless one level of \(CSA4\_2\) can replace two levels of \(CSA3\_2\), such as in \(4\to2\) compression or \(8\to2\) compression.

\(CSAn\_2\)

The number of partial products generated by \(Booth\) encoding for two unsigned integer multiplications is \(\lceil \frac{n+1}{2} \rceil\). Because the compressed partial products need to be added to a number with a wider bit width, to ensure correct carry, the partial product bit width is extended by one bit. The number and bit width of partial products for different data formats are shown in the table.

Number and Bit Width of Partial Products for Different Data Formats
Data Format	Significand Bits	No. of Partial Products	Partial Product Bit Width
\(fp16\)	\(11\)	\(6\)	\(12\)
\(fp32\)	\(24\)	\(13\)	\(25\)
\(fp64\)	\(53\)	\(27\)	\(54\)

Following the principle of using \(CSA3\_2\) as much as possible, unless one level of \(CSA4\_2\) can replace two levels of \(CSA3\_2\), the number of \(CSA3\_2\) and \(CSA4\_2\) stages used for different data formats is shown in the table.

\(CSA\) Compression Process for Partial Products of Different Data Formats
Data Format	No. of \(CSA3\_2\) Stages	No. of \(CSA4\_2\) Stages	Process (\(->\) denotes \(CSA3\_2\), \(-->\) denotes \(CSA4\_2\))
\(fp16\)	\(1\)	\(1\)	\(6->4-->2\)
\(fp32\)	\(3\)	\(1\)	\(13->9->6->4-->2\)
\(fp64\)	\(3\)	\(2\)	\(27->18->12->8-->4-->2\)

Exponent Processing and Right Shift

If following the conventional approach, the magnitude relationship between the exponent of the product of \(fp\_a\) and \(fp\_b\) and the exponent of \(fp\_c\) is unknown. Then, similar to floating-point addition, the one with the smaller exponent needs to be right-shifted. This means both the significand of the product of \(fp\_a\) and \(fp\_b\) and the significand of \(fp\_c\) might be right-shifted. Such a design would use two shifters, increasing area, and would require waiting for the product of \(fp\_a\) and \(fp\_b\) to be calculated before right-shifting its significand, increasing circuit delay. An algorithm is now introduced to avoid using two shifters and to allow parallel execution with the multiplication of \(fp\_a\) and \(fp\_b\), reducing circuit delay.

First, the exponent bits are treated as unsigned numbers, but there is an exponent bias between them and the true exponent. The \(denormal\) case must also be considered. Let \(E\_fix\) be the exponent bits after handling the \(denormal\) case, and let \(E\_bit\) be the original exponent bits. When all bits of \(E\_bit\) are \(0\), \(E\_fix = 1\); otherwise, \(E\_fix = E\_bit\).

\[ E\_real = E\_fix - bias, \quad bias = (1 << (exponentWidth - 1)) - 1 \]

In the above equation, the true exponent \(E\_real\) equals \(E\_fix\) minus a bias value \(bias\). \(exponentWidth\) is the width of \(E\_bit\). The value of \(bias\) is \(E\_bit\) with the MSB as \(0\) and all other bits as \(1\). Without considering carries or borrows from significand multiplication, the true exponent result \(Eab\_real\) of the product \(fp\_a \times fp\_b\) is as follows:

\[ Ea\_real = Ea\_fix + Eb\_fix - 2 \times bias \]

The formula for calculating the binary exponent result \(Eab\_bit\) of the product \(fp\_a \times fp\_b\) is shown below:

\[ Eab\_bit = Cat(0.U, Ea\_fix + \&Eb\_fix).asSInt - bias.S \]

Here, the \(+\&\) operation extends the result of \(Ea\_fix+Eb\_fix\) by one bit to preserve the carry. The carry is preserved because a bias value will be subtracted later, and the result would be incorrect without preserving the carry. We also need to consider that subtracting the bias value might result in a negative number, so we extend it by another bit, i.e., concatenate a \(0\) bit at the MSB. Finally, subtract the bias value \(bias\). This gives the binary exponent result \(Eab\_bit\) of the product \(fp\_a \times fp\_b\), without considering carries or borrows from significand multiplication. Then we construct an exponent \(Eab\) with the value shown below:

\[ Eab = Eab\_bit + rshiftBasic.S \]

And we consider the binary exponent result of the product \(fp\_a \times fp\_b\) to be \(Eab\). To ensure that the significand of the lossless precision product \(fp\_a \times fp\_b\) is added to the significand of \(fp\_c\), we extend the bit widths of these two addends separately. For the significand of \(fp\_c\), it is extended to \(3 \times significandWidth+4\). The bit distribution is shown in the figure, where \(g0, r0, g1, r1\) are used to store the \(guard, round\) bits during the right shift:

Bit distribution of fp_c significand extension

As shown in the figure above, the \(fp\_c\) significand is \(significandWidth+2\) bits higher than the product of the \(fp\_a\) and \(fp\_b\) significands. Because the product result has two bits before the decimal point, aligning to a product result of \(1.xxx\) makes it \(significandWidth+3\) bits higher. This is why \(rshiftBasic = significandWidth+3\).

Let \(fp\_c\_significand\_cat0 = Cat(fp\_c\_significand,0.U(2 \times significandWidth+4))\), where \(fp\_c\_significand\) is the significand of \(fp\_c\). If \(Ec\_fix = Eab = Eab\_bit + rshiftBasic.S\), then \(fp\_c\_significand\_cat0\) is exactly \(significandWidth+3\) larger than \(Eab\_bit\), and \(fp\_c\_significand\_cat0\) does not need to be right-shifted for alignment. If \(Ec\_fix > Eab\), theoretically \(fp\_c\_significand\_cat0\) needs to be left-shifted, but because \(g0, g1\) serve as an interval and the low bits do not generate carries (low bits only affect the rounding result), \(fp\_c\_significand\_cat0\) does not actually need to be left-shifted. If \(Ec\_fix < Eab\), then \(fp\_c\_significand\_cat0\) needs to be right-shifted. The number of shift bits is \(rshift\_value = Eab - Cat(0.U,Ec\_fix).asSInt\). \(rshift\_value\) comes from the addition of multiple numbers, so the LSB is calculated first. Therefore, during the right shift, the LSB of \(rshift\_value\) should first be used as the select signal for a \(Mux\), and then the higher bits should be used sequentially as \(Mux\) select signals. The right shift process needs to calculate \(guard, round, sticky\) (abbreviated as \(grs\)). For \(guard\) and \(round\), these positions have already been reserved during bit width extension and do not require additional calculation. For \(sticky\), there are two methods: Method 1 is to extend the bit width further to store the bits shifted out, and then calculate \(sticky\) after the entire right shift is completed. Method 2 is to calculate \(sticky\) simultaneously during the right shift based on the value of the \(Mux\) select signal. Method 2 has less delay than Method 1. Below is the design code for Method 2:

/**
 * Shift right using Mux, starting with the LSB. Output width is srcValue + 1 (Sticky)
 */
def shiftRightWithMuxSticky(srcValue: UInt, shiftValue: UInt): UInt = {
  val vecLength  = shiftValue.getWidth + 1
  val res_vec    = Wire(Vec(vecLength,UInt(srcValue.getWidth.W)))
  val sticky_vec = Wire(Vec(vecLength,UInt(1.W)))
  res_vec(0)    := srcValue
  sticky_vec(0) := 0.U
  for (i <- 0 until shiftValue.getWidth) {
    res_vec(i+1) := Mux(shiftValue(i), res_vec(i) >> (1<<i), res_vec(i))
    sticky_vec(i+1) := Mux(shiftValue(i), sticky_vec(i) | res_vec(i)((1<<i)-1,0).orR,
    sticky_vec(i))
  }
  Cat(res_vec(vecLength-1),sticky_vec(vecLength-1))
}

There is another way to speed up the right shift. The bit width of \(rshift\_value\) is \(exponentWidth+1\), while the width of \(fp\_c\_significand\_cat0\) is \(3 \times significandWidth+4\). \(rshift\_value\) may have overflow bits. For example, if a \(5\)-bit number is used to right-shift a \(7\)-bit number, \(a(6,0) \gg b(4,0)\), the maximum value of the third bit of \(b\) is \(7\), which already covers the width of \(a\). So, if any of the two MSBs of \(b\) are non-zero, the right-shift result of \(a\) will be zero. The right-shift result can be simplified to \(Mux(b(4,3).orR,0.U, a(6,0) \gg b(2,0))\). The table below lists the \(rshift\_value\) bit widths used for three floating-point data formats.

\(rshift\_value\) Bit Widths for Different Floating-Point Formats
Data Format	\(fp\_c\_significand\_cat0\) Width	\(rshift\_value\) Width	Used Width
\(f16\)	\(37\)	\(6\)	\(6\)
\(f32\)	\(76\)	\(9\)	\(7\)
\(f64\)	\(163\)	\(12\)	\(8\)

Based on the value of \(rshift\_value\), there are three cases: \(rshift\_value \le 0\), no right shift is needed, and the \(sticky\) result is \(0\); \(rshift\_value > rshiftMax\), the right-shift result is \(0\), and the \(sticky\) result is \(fp\_c\_significand\_cat0\) OR-reduced; \(0 < rshift\_value \le rshiftMax\), the right-shift result and \(sticky\) are calculated by \(shiftRightWithMuxSticky\).

This subsection has introduced exponent processing methods, shifter design, and \(grs\) handling during right shifts.

Significand Addition

The result of right-shifting the \(fp\_c\) significand, \(rshift\_result\), needs to be added to the two results from \(CSAn\_2\) compression. Since the signs of \(fp\_c\) and \(fp\_a \times fp\_b\) may be opposite, subtraction must be performed when they are opposite, and the subtraction result may be negative. To determine the sign, an additional sign bit is added. \(fp\_c\_rshiftValue\_inv\) selects either \(rshift\_result\) (sign bit padded with \(0\)) or the complement of \(rshift\_result\) (sign bit padded with \(1\)) based on whether the signs are opposite. Therefore, \(fp\_c\_rshiftValue\_inv\) needs to be added to the two results from \(CSAn\_2\) compression. However, during subtraction, \(fp\_c\_rshiftValue\_inv\) only complements \(rshift\_result\); it also needs to add \(1\) to the LSB if all right-shifted \(grs\) bits are \(0\). This \(+1\) is placed in the \(carry\) bit of the two \(CSAn\_2\) compression results, because the \(carry\) bit is always \(0\). Placing it here saves an adder and reduces area. These three numbers have different bit widths: the result of right-shifting the \(fp\_c\) significand has a width of \(3 \times significandWidth+4\); the two results from \(CSAn\_2\) compression have a width of \(2 \times significandWidth+1\) (the \(+1\) is because the partial products mentioned earlier need to be extended by one bit to ensure correct carry). The strategy for summing three numbers of different widths is: first, perform a \(CSA3\_2\) compression on the low \(2 \times significandWidth+1\) bits of the two \(CSAn\_2\) compression results and the low \(2 \times significandWidth\) bits of \(rshift\_result\) (with the MSB padded with a \(0\) to make it \(2 \times significandWidth+1\) bits). Then, add the two results of the \(CSA3\_2\) compression, denoted as \(adder\_low\_bits\). Simultaneously, calculate the high \(significandWidth+4\) bits of \(rshift\_result +1\). Then, based on whether the MSB of the sum of the low \(2 \times significandWidth+1\) bits is \(1\), select either the high \(significandWidth+4\) bits of \(fp\_c\_rshiftValue\_inv\) or the high \(significandWidth+4\) bits of \(fp\_c\_rshiftValue\_inv +1\), denoted as \(adder\_high\_bits\).

We also need to consider complementing and adding one to the right-shifted \(grs\) bits during subtraction. The final significand addition result \(adder\) (including right-shifted \(grs\)) is composed of: \(adder\_high\_bits\), \(adder\_low\_bits\), and right-shifted \(grs\) (complemented and incremented during subtraction). Because \(adder\) might be negative, an extra bit is extended solely for determining the sign of \(adder\), and is not used afterwards. \(adder\_inv\) complements \(adder\) if it is negative and removes this sign determination bit.

\(LZD\), Left Shift, Rounding, and Unrounded Mantissa Result

After calculating \(adder\_inv\), leading zero detection needs to be performed on \(adder\_inv\) to determine the number of bits to left-shift, thereby normalizing and rounding the mantissa result.

When performing \(LZD\) on \(adder\_inv\), there is an exponent limiting issue. Let \(E\_greater\) be \(Eab\) (the exponent result of \(fp\_a \times fp\_b\)). The left shift amount cannot be greater than \(E\_greater\), because if it equals \(E\_greater\), the exponent result is already all \(0\)s. To solve this problem, a \(mask\) is used for limiting during the left shift, similar to the floating-point adder.

For the case where \(adder\) is negative, \(-adder\) should be the complement of \(adder +1\). Since \(+1\) creates a long carry chain, only the complement operation is performed. Then, \(LZD\) of \(adder\_inv\) is calculated. This might cause a one-bit deviation. When the complement of \(adder\) has all trailing \(1\)s, adding \(1\) will cause the MSB \(1\) to generate a carry. The method to resolve this one-bit deviation is to perform trailing zero detection (\(TZD\)) on \(adder\). If \(LZD + TZD = \text{width of } adder\), then the complement of \(adder\) has all trailing \(1\)s. In this case, the left-shift result needs to be corrected. After left-shift correction, the unrounded result is obtained. Adding \(1\) to the unrounded result gives the rounded result.

Final Result

The sign bit result is obtained based on the sign of \(adder\). The calculation of \(grs\) needs to combine the right shift in Step 5 and the left shift in Step 7. The rounding strategy uses \(after \quad rounding\). To detect \(underflow\), another set of \(grs\) specifically for \(underflow\) detection is used. Based on the rounding mode and \(grs\), it is determined whether rounding is needed. The mantissa result is selected, and the exponent result is obtained based on the rounding situation.

If input operands contain special values like \(NaN\), infinity, or zero, the result is calculated separately and selected from the special result and the normal result based on the actual input numbers. The flag bits can generate four types of flags: overflow, underflow, invalid operation, and inexact, excluding the divide-by-zero flag.

Vector Single-Precision Format Algorithm

The main design idea for vector operations is to share hardware while meeting timing requirements.

During \(Booth\) encoding, \(f16\) generates \(6\) \(pp\)s (partial products), \(f32\) generates \(13\) \(pp\)s, and \(f64\) generates \(27\) \(pp\)s. Therefore, during \(Booth\) encoding, the space for \(27\) \(pp\)s from \(f64\) can accommodate two sets of \(13\) \(pp\)s from \(f32\). Similarly, the space for \(13\) \(pp\)s from \(f32\) can accommodate two sets of \(6\) \(pp\)s from \(f16\). This allows a single \(CSA\_27to2\) compression unit to be shared. Vector shared \(Booth\) encoding is shown in the figure.

When right-shifting the \(fp\_c\) mantissa, the right shift for \(f64\) and one of the \(f32\) mantissas can share one shifter. Other shifters are independent.

\(CSA\_3to2\) is also shared. The third operand comes from the result of right-shifting the \(fp\_c\) mantissa. The results of right-shifting \(2\) \(f32\) or \(4\) \(f16\) mantissas are concatenated and undergo \(3\_2\) compression along with the two operands from the shared \(Booth\) encoding compression.

The adder after compression is also shared. Different formats are allocated different bits, and they are separated by bits to prevent carries from lower bits from affecting higher bit results.

The sharing approach for \(LZD, TZD\), and left shifters is similar to that of right shifters: \(f64\) and one of the \(f32\)s are shared, while others are independent.

Vector Mixed-Precision Format Algorithm

There are two types of vector mixed-precision format computations:

(1) \(2 \ fp32 = fp16 \times fp16 + fp32\);

(2) \(1 \ fp64 = fp32 \times fp32 + fp64\).

For two multiplicands of the same width, it is essentially exponent addition and significand multiplication. There is no need to first convert their formats to be the same as the result format, as in floating-point addition. Simply extending the bit width is sufficient: pad the high bits of the exponent with zeros and the low bits of the significand with zeros to align with the high-precision floating-point operand. After alignment, proceed with the computation as for a single-precision format.

Vector Floating-Point Division Algorithm

Division is one of the most representative floating-point functions in modern processors. There are two main algorithms for calculating division in hardware: digit-recurrence algorithms, which have linear convergence and are based on subtraction, and multiplicative algorithms, which are based on multiplication and have quadratic convergence. Subtraction-based digit-recurrence algorithms are more energy-efficient and require less area. Subsequent references to digit-recurrence in this document refer to subtraction-based digit-recurrence. For commonly used floating-point precisions (double, single, and half), digit-recurrence methods are much faster. In digit-recurrence division, the most critical point is the selection of quotient digits. One digit of the quotient is obtained per iteration. To implement a simple \(Radix-4\) selection function independent of the divisor, the divisor needs to be scaled to be sufficiently close to 1. This scaling is performed before the digit-recurrence.

Digit-recurrence algorithms are widely used in high-performance processors due to their good trade-off in performance, area, and power consumption. This document is based on \(SRT\) division (\(Sweeney-Robertson-Tocher Division\)) and employs a \(Radix-64\) floating-point division algorithm, calculating \(6\) bits of the quotient per cycle. To reduce overhead, each \(Radix-64\) iteration consists of three \(Radix-4\) iterations; a speculative algorithm is used between consecutive \(Radix-4\) iterations to reduce latency.

Scalar Floating-Point Division Algorithm

The \(Radix-64\) scalar floating-point division algorithm implemented in this document has low latency for double-precision, single-precision, and half-precision floating-point division when both input operands and the result are normalized numbers: \(11, 6, 4\) cycles respectively. These latencies include cycles for scaling and rounding. In cases where input operands or the result include denormalized numbers, one or two additional normalization cycles are required.

The exponent result is easily determined. The focus is on significand division. The significand divider performs floating-point division of the dividend significand \(x\) and the divisor significand \(d\) to obtain the significand quotient \(q = x/d\). Both operands need to be normalized numbers, \(x, d \in [1, 2)\). Denormalized operands are also allowed; denormalized operands are normalized before digit iteration. If both operands are normalized in \([1, 2)\), the result is in \([0.5, 2)\). Thus, two bits to the right of the quotient's least significant bit (LSB) are needed for rounding: the guard bit and the round bit.

The guard bit is used for rounding when the result is normalized, \(q \in [1, 2)\). The round bit is used for rounding when the result is not normalized, \(q \in [0.5, 1)\). In the latter case, the result is left-shifted by \(1\) bit, and the guard and round bits become the LSB and guard bit of the normalized result, respectively. To simplify rounding, the result is forced to be \(q \in [1, 2)\). Note that \(q < 1\) only when \(x < d\). This situation is detected early, and the dividend is left-shifted by \(1\) bit, so that \(q = 2 \times x/d\) and \(q \in [1, 2)\). Note that the exponent result must be adjusted accordingly at this time.

The algorithm used for division is a \(Radix-4\) digit-recurrence algorithm. Each cycle consists of three iterations. The quotient's signed-digit representation uses the digit set {\(−2, −1, 0, +1, +2\)}; i.e., radix \(r=4\), digit set \(a=2\). In each iteration, one quotient digit is obtained through a selection function. To have a quotient digit selection function that is independent of the divisor, the divisor must be scaled to be close to \(1\). Of course, to maintain the correctness of the result, the dividend needs to be scaled by the same factor as the divisor.

Using the radix-4 algorithm, \(2\) bits of the quotient are obtained per iteration. Since three radix-4 iterations are performed per clock cycle, \(6\) bits of the quotient are obtained per cycle, which is equivalent to a \(Radix-64\) divider. Additionally, note that the first quotient digit, which is the integer bit of the result, can only take values {\(+1, +2\)}, and its calculation is much simpler than that of the remaining digits. Calculating it in parallel with operand prescaling saves one iteration for single-precision floating-point numbers. On the other hand, there is an early termination mode for exceptional operands. Early termination is triggered when any operand is \(NaN\), infinity, or zero, or in the case of division by a power of \(2\) where both operands are normalized. In the latter case, the result is obtained simply by reducing the exponent of the dividend. The main features of the \(Radix-64\) divider are as follows:

(1) Prescaling of divisor and dividend.

(2) The first quotient digit is executed in parallel with prescaling.

(3) Comparison of the scaled dividend and divisor, and left-shifting the dividend to obtain a result in \([1, 2)\).

(4) Three \(Radix-4\) iterations per cycle, yielding \(6\) bits per cycle.

(5) Support for half-precision, single-precision, and double-precision.

(6) Denormalized number support; an additional cycle is required before iteration to normalize denormalized numbers.

(7) Early termination for exceptional operands.

Digit-Recurrence Division Algorithm

Digit-recurrence division is an iterative algorithm that calculates one \(radix-r\) quotient digit \(q_{i+1}\) and a remainder in each iteration. The remainder \(rem[i]\) is used to obtain the next \(radix-r\) digit. For fast iteration, the remainder is kept in a carry-save adder using a signed-digit redundant representation. This document chooses a \(radix-2\) signed-digit representation for the remainder, which includes a positive number and a negative number. For radix \(r=4\), the expression for the partial quotient before the \(i\)-th iteration is:

\[ Q[i] = \sum_{j=0}^i q_j \times 4^{-j} \]

For the \(radix-4\) algorithm after the divisor is scaled close to \(1\), the quotient and remainder are described by the following equations:

\[ q_{i+1} = SEL(\widehat{rem}[i]) \]

\[ rem[i + 1] = 4 \times rem[i] - d \times q_{i+1} \]

where \(\widehat{rem}[i]\) is an estimate of the remainder \(rem[i]\) using only a few bits. For this algorithm, it has been determined that only the \(6\) most significant bits (MSBs) of the remainder are needed, i.e., \(3\) integer bits and \(3\) fractional bits. Then, each iteration obtains the quotient digit from the current remainder and calculates a new remainder for the next iteration. The number of iterations \(it\) is calculated by the formula:

\[ it = \lceil n/log_2(4) \rceil \]

where \(n\) is the number of bits in the result, including bits required for rounding. The latency of division, i.e., the number of cycles, is directly related to the number of iterations. It also depends on the number of iterations performed per cycle. Three iterations have been implemented per cycle to obtain \(6\) bits per cycle, which is equivalent to \(Radix-64\) division. Then, the number of cycles \(cycles\) required for normalized floating-point numbers is determined by the following formula. In addition to the (\(it/3\)) cycles required for iteration, there are two additional cycles for operand prescaling and rounding.

\[ cycles = \lceil it/3 \rceil + 2 \]

Some examples of digit-recurrence division, including the \(Radix-4\) algorithm, can be found in [\(38\)]. A simple implementation is shown in the figure. Note that only the MSBs of the remainder are used to select the quotient digit. The remainder is updated using a carry-save adder (\(CSA\)) and stored in a redundant representation. Then, quotient digit selection requires the \(t\) MSBs of the remainder to be added in a carry-propagate adder (\(CPA\)) to obtain its non-redundant representation. However, this implementation is too slow. To accelerate the iteration loop, speculative algorithms must be used in the remainder calculation and quotient digit selection between iterations.

Simple implementation of Radix-64 composed of three Radix-4 iterations

Operand Prescaling

During prescaling, the divisor is scaled to a value close to \(1\), so that the selection of quotient digits is independent of the divisor. For the \(radix-4\) algorithm, scaling the divisor to be within the range \([1 − 1/64, 1+1/8]\) is sufficient. As shown in the prescaling factor truth table, only three bits determine the scaling factor. Note that during prescaling, the divisor should be scaled up by a factor of \(1-2\). The dividend should also be scaled by the same factor.

Prescaling Factor Truth Table
\(0.1\)xxx	Prescaling Factor
\(000\)	\(1+1/2+1/2\)
\(001\)	\(1+1/4+1/2\)
\(010\)	\(1+1/2+1/8\)
\(011\)	\(1+1/2+0\)
\(100\)	\(1+1/4+1/8\)
\(101\)	\(1+1/4+0\)
\(110\)	\(1+0+1/8\)
\(111\)	\(1+0+1/8\)

Integer Bit Quotient Calculation

Simultaneously with the integer bit quotient calculation, the following data is provided for the digit iteration steps (each digit iteration performs three \(radix-4\) iterations, corresponding to \(s0, s1, s2\) stages):

(1) Carry-save representation of the redundant remainder: \(f\_r\_s, f\_r\_c\).

(2) Divisor after prescaling: \(divisor\).

(3) \(6\)-bit remainder result for quotient selection in the \(s0\) stage of the first digit iteration.

(4) \(7\)-bit remainder result for quotient selection in the \(s1\) stage of the first digit iteration.

Digit Iteration

The actual implementation of a floating-point divider requires performing three \(radix-4\) iterations per cycle. Conventional sequential iteration three times is too slow to meet timing requirements, so the logic has been optimized. The figure shows a block diagram of the digit iteration loop.

(1) Process the divisor to obtain the five possible divisor multiples needed for quotient selection results (only complemented when the quotient is negative).

(2) In the \(s0\) stage, use four \(CSA\) modules (not needed when the quotient is \(0\)) to predictively calculate in parallel the \(5\) redundant remainder representations needed for the \(s1\) stage.

(3) In the \(s0\) stage, use the \(5\) redundant remainder representations calculated in the second step to predictively calculate \(5\) \(7\)-bit remainder results for the \(s2\) stage.

(4) In the \(s0\) stage, based on the \(6\)-bit remainder result from the input signal, select the quotient for the \(s0\) stage. The quotient is represented by a \(5\)-bit one-hot code.

(5) Based on the quotient from the \(s0\) stage, select the redundant remainder representation to be used for the \(s1\) stage. Simultaneously, select one from the \(5\) predictively calculated \(7\)-bit remainder results for the \(s2\) stage from the third step.

(6) In the \(s1\) stage, use four \(CSA\) modules (not needed when the quotient is \(0\)) to predictively calculate in parallel the \(5\) redundant remainder representations needed for the \(s2\) stage.

(7) In the \(s1\) stage, based on the \(7\)-bit remainder result from the input signal, the \(5\) divisor multiples used for quotient selection, and the quotient from the \(s0\) stage, predictively perform quotient selection for the \(s1\) stage.

(8) Based on the quotient from the \(s1\) stage, select the redundant remainder representation to be used for the \(s2\) stage.

(9) In the \(s2\) stage, use four \(CSA\) modules (not needed when the quotient is \(0\)) to predictively calculate in parallel the \(5\) redundant remainder representations needed for the \(s0\) stage of the next digit iteration.

(10) In the \(s2\) stage, predictively calculate \(5\) possible results for both the \(6\)-bit remainder result needed for the \(s0\) stage and the \(7\)-bit remainder result needed for the \(s1\) stage of the next digit iteration.

(11) In the \(s2\) stage, based on the \(7\)-bit remainder result selected for the \(s2\) stage in the fifth step, the \(5\) divisor multiples used for quotient selection, and the quotient from the \(s1\) stage, predictively perform quotient selection for the \(s2\) stage.

(12) Based on the quotient selection result from the \(s2\) stage, select for the next digit iteration: the carry-save representation of the redundant remainder, the \(6\)-bit remainder result needed for the \(s0\) stage, and the \(7\)-bit remainder result needed for the \(s1\) stage.

Since the divisor multiples in the first step are only complemented and not incremented by \(1\), there will be deviations in the remainder calculation. Correction logic is added to the quotient selection process to fix this. The table below is the standard quotient selection function, and the subsequent table is the quotient selection function after logical correction.

Standard Quotient Selection Function
\(4 \times rem[i]\)	\(q_{i+1}\)
\([13/8,31/8]\)	\(+2\)
\([4/8,12/8]\)	\(+1\)
\([-3/8,3/8]\)	\(0\)
\([-12/8,-4/8]\)	\(-1\)
\([-32/8,-13/8]\)	\(-2\)

Quotient Selection Function After Logical Correction
\(4 \times rem[i]\)	\(carry\)	\(q_{i+1}\)
\(31/8\)	\(1\)	\(+2\)
\([13/8,30/8]\)	-	\(+2\)
\(12/8\)	\(0\)	\(+2\)
\(12/8\)	\(1\)	\(+1\)
\([4/8,11/8]\)	-	\(+1\)
\(3/8\)	\(0\)	\(+1\)
\(3/8\)	\(1\)	\(0\)
\([-3/8,2/8]\)	-	\(0\)
\(-4/8\)	\(0\)	\(0\)
\(-4/8\)	\(1\)	\(-1\)
\([-12/8,-5/8]\)	-	\(-1\)
\(-13/8\)	\(0\)	\(-1\)
\(-13/8\)	\(1\)	\(-2\)
\([-32/8,-14/8]\)	-	\(-2\)

The redundant remainder representation output by the digit iteration is restored to a standard remainder. The quotient and quotient-minus-one are calculated using \(On\)-\(the\)-\(Fly\) \(Conversion\). Two sets of \(grs\) and whether an upward round by one bit is needed are calculated for the quotient and quotient-minus-one. The selection signal for choosing the quotient or quotient-minus-one is calculated, and the correct quotient result is selected. Finally, rounding is performed using the correct quotient result and the corresponding signal indicating whether an upward round by one bit is needed.

Denormalized Numbers and Early Termination

(1) Input contains denormalized numbers: The significand of a denormalized number is less than \(1\) and cannot be prescaled together with normalized numbers. For this, an additional cycle is added to normalize the significand of the denormalized number and adjust its exponent accordingly.

(2) Result is a denormalized number: The quotient result after digit iteration is greater than \(1\), which does not fit the significand range of a denormalized number. An additional cycle is needed to right-shift the quotient result for normalization.

(3) Early termination: There are two cases: when the result is \(NaN\), infinity, or exact \(0\), the calculation is terminated early and the result is output. Since this information is available in the first cycle, the division result can be output in the second cycle. When the divisor is a power of \(2\), its significand is \(=1\). Division only requires processing the dividend's exponent and can skip the digit iteration stage. The result can be output as early as the second cycle. However, if the dividend or result is a denormalized number, additional cycles are also needed for processing.

Vector Floating-Point Division Algorithm

For vector floating-point division, the \(RISC-V\) vector instruction set extension does not have mixed-precision floating-point division. Thus, it only needs to support:

(1) \(1 \ f64 = f64 / f64\); (Note: Original had + instead of / for division)

(2) \(2 \ f32 = f32 / f32\); (Note: Original had + instead of / for division)

(3) \(4 \ f16 = f16 / f16\). (Note: Original had + instead of / for division) Self-correction: The original list for vector division incorrectly used '+' instead of '/'. I will assume it's a typo and use '/' for division operations. (1) \(1 \ f64 = f64 / f64\); (2) \(2 \ f32 = f32 / f32\); (3) \(4 \ f16 = f16 / f16\).

Considering that multiple division calculations occur simultaneously in vector division, and the early termination feature would cause asynchronous output of results for multiple data unless all terminate under the same conditions, the early termination mechanism is canceled for vector division. If an early termination condition occurs, the result is temporarily stored internally and output simultaneously with other division results.

To unify timing, the number of cycles for the divider is fixed to the worst-case scenario, i.e., when the input contains denormalized numbers and the output also contains denormalized numbers. Cases that could originally output results faster will have their data temporarily stored internally and output after the specified number of cycles.

The main design adopts resource reuse. In non-digit-iteration modules, the following data reuse is performed:

(1) \(1 \ f64/f32/f16 = f64/f32/f16 / f64/f32/f16\); (Corrected to /)

(2) \(1 \ f32/f16 = f32/f16 / f32/f16\); (Corrected to /)

(3) \(2 \ f16 = f16 / f16\). (Corrected to /)

A total of \(4\) sets of signals are used to complete the functionality of \(7\) division groups.

Since the digit iteration module is on the critical path and has high timing pressure, it cannot achieve the same high degree of reuse as the non-digit-iteration module parts, otherwise timing requirements would not be met. Therefore, the digit iteration module is designed with partial reuse:

(1) The interface consists of \(4\) sets of quotients and redundant remainders.

(2) The \(s0\) stage uses \(7\) sets of \(CSA\) and \(7\) sets of prediction, with \(4\) sets of quotient selection.

(3) The \(s1, s2\) stages use \(4\) sets of \(CSA\), \(4\) sets of prediction, and \(4\) sets of quotient selection.

For registers, resource reuse is also adopted. For registers storing the divisor, redundant remainder, quotient, etc., their size is allocated according to \(max\)(\(4 \times f16\), \(2 \times f32\), \(1 \times f64\)) required bits.

Hardware Design

Vector Floating-Point Adder

Scalar Single-Precision Floating-Point Adder

A scalar single-precision floating-point adder is designed based on the improved dual-path floating-point addition algorithm. Its hardware implementation architecture is shown in the figure.

On the left are two input operands \(fp\_a\) and \(fp\_b\). On the right, \(fp\_c\) represents the addition result, \(fflags\) is a \(5\)-bit exception flag, \(rm\) is the rounding mode (five modes, represented by \(3\) bits), and \(is\_sub\) determines the operation: if \(0\), \(fp\_c = fp\_a + fp\_b\); if \(1\), \(fp\_c = fp\_a - fp\_b\). The only difference between floating-point addition and subtraction is the sign bit of \(fp\_b\), so with a slight modification to \(fp\_b\)'s sign bit, the adder can support both addition and subtraction. It consists of three main parts: \(far\) path, \(close\) path, and exception path.

The \(far\) path first performs two normalized exponent subtractions and significand right shifts in parallel, handling the cases \(Efp\_a \ge Efp\_b\) and \(Efp\_b \ge Efp\_a\) separately. Based on the magnitude relationship between \(Efp\_a\) and \(Efp\_b\), the correct right-shift result is selected and sent to two significand adders, \(FS0\) and \(FS1\). In the \(far\) path, for subtraction, \(EA\) is the larger exponent minus one; for addition, \(EA\) is the larger exponent. This keeps the significand addition result in the range \([1,4)\). Two sets of \(grs\) are calculated during the right shift: \(grs\_normal\) for rounding when the result is in \([1,2)\), and \(grs\_overflow\) for rounding when the result is in \([2,4)\). Finally, based on the result of \(FS0\) and the rounding mode, either \(FS0\) or \(FS1\) is selected as the mantissa result. Simultaneously, \(EA\) or \(EA+1\) is selected as the exponent result based on rounding. The sign bit result is obtained from the exponent magnitude. The flag result indicates overflow if \(EA+1\) is all \(1\)s, and inexact based on \(grs\). The \(far\) path does not generate divide-by-zero, invalid operation, or underflow flags.

The \(close\) path uses four significand adders \(CS0, CS1, CS2, CS3\) to handle significand subtraction for three cases: \(Efp\_a = Efp\_b\), \(Efp\_a = Efp\_b + 1\), and \(Efp\_a = Efp\_b – 1\). Then, based on the \(CS0\) result and \(grs\), four one-hot selection signals \(sel\_CS0, sel\_CS1, sel\_CS2, sel\_CS3\) are generated. A one-hot 4-input multiplexer \(Mux1H\) selects one result, which is ORed with a left-shift \(mask\). Then, a priority left shifter normalizes the mantissa. The left shift simultaneously outputs the \(lzd\) value. The exponent result is \(EA – lzd\). The mantissa result is selected between the normalized mantissa and \(CS4\) (a supplementary rounding result that does not require left-shift normalization). The sign result is calculated from the exponent difference and the \(CS0\) result. The flag result only generates inexact; other exception flags are not generated.

The exception path is used to determine if there is an invalid operation, if the result is \(NaN\), or if the result is infinity. If none of these three conditions are met, normal computation proceeds, generating selection signals to choose the result and flags from either the \(far\) path or the \(close\) path as output.

Scalar Mixed-Precision Floating-Point Adder

Mixed-precision hardware design is based on the scalar single-precision floating-point adder. The main difference is the support for mixed-precision computations. Taking an \(f32\) result as an example, the truth table below shows operations corresponding to \(res\_widen\) and \(opb\_widen\).

\(f32\) Result Mixed-Precision Format Table
\(res\_widen\)	\(opb\_widen\)	\(f32\) Operation
\(0\)	\(0\)	\(f32 = f32 + f32\)
\(1\)	\(0\)	\(f32 = f16 + f16\)
\(1\)	\(1\)	\(f32 = f16 + f32\)
\(0\)	\(1\)	Not Allowed

The figure below shows the architecture of the scalar mixed-precision floating-point adder. The main difference is the addition of a fast format conversion module at the data input. Based on the operation type, operands are uniformly converted to the result's data format, after which the computation flow is the same as for single-precision floating-point addition.

Scalar Mixed-Precision Floating-Point Adder Architecture Diagram

Vector Floating-Point Adder

The figure below shows the architecture of the vector floating-point adder. To meet timing requirements, it is composed of four modules. \(FloatAdderF64Widen\) computes all operations with \(64\)-bit results. \(FloatAdderF32WidenF16\) computes all operations with \(16\)-bit or \(32\)-bit results. \(FloatAdderF16\) only computes operations with \(16\)-bit results.

Here, \(fp\_format\) is a \(2\)-bit result format control signal: \(00\) indicates \(f16\) result format, \(01\) indicates \(f32\), and \(10\) indicates \(f64\). The output flags are \(20\) bits, arranged least-significant first. When the result format is \(f16\), all \(20\) bits are valid. When \(f32\), the low \(10\) bits are valid. When \(f64\), the low \(5\) bits are valid.

Vector Floating-Point Adder Architecture Diagram

The vector floating-point adder uses a two-cycle pipeline design. To achieve fast wakeup, the addition result is computed in about \(1.5\) cycles. Pipelining is done within each submodule by inserting one level of registers. The pipeline partitioning for the three types of modules in the figure is described below.

The figure below shows the pipeline partitioning for the \(FloatAdderF64Widen\) module. In the \(far\) path, registers are inserted after significand right shift. In the \(close\) path, registers are inserted after \(Mux1H\).

FloatAdderF64Widen Pipeline Partitioning

The figure below shows the pipeline partitioning for the \(FloatAdderF32WidenF16\) module. This module includes computations for two different output formats. The selection logic in the second cycle is quite complex, so in the \(far\) path, registers are inserted within the adder. The first cycle performs addition of the low \(18\) bits and the high part. The second cycle adds the carry from the first cycle's low \(18\)-bit addition to the high part to get the final result. In the \(close\) path, registers are also inserted after \(Mux1H\).

FloatAdderF32WidenF16 Pipeline Partitioning

The figure below shows the pipeline partitioning for the \(FloatAdderF16\) module. This module has the least timing pressure. Registers are inserted after significand right shift in the \(far\) path, and after \(Mux1H\) in the \(close\) path.

Interface Description

The previously described vector floating-point adder has a width of \(64\) bits, and its input data requires both operands to be in vector form. However, \(RVV\) specifies not only vector-vector (\(vv\)) operations but also vector-scalar (\(vf\)) operations, where one operand is a vector and the other is a scalar. Additionally, for \(widening\) instructions, the arrangement of source operands is not limited to least-significant effective. When the source register width is half the destination register width, data may come from either the lower or upper half.

To implement all floating-point instruction computation operations in \(RVV\) and allow for \(VLEN\) extension, other simple instruction computations are added to the vector floating-point adder, making it a vector floating-point "ALU", called \(VFALU\).

Therefore, the vector floating-point adder needs to be modified to adapt to \(RVV\) features. The modification is in two parts: functional modification and interface modification.

The table below lists the opcodes supported by \(VFALU\), totaling \(16\) operations. (\(w\)) indicates operations that include \(widen\). The operand forms for \(vfmerge, vfmove, vfclass\) are special: \(vfmerge.vfm\) has three source operands: a vector register, a floating-point register, and a \(mask\) register; \(vfmove.v.f\) has only one source operand, a floating-point register; \(vfclass\) has only one source operand, a vector register.

\(VFALU\) Opcodes
\(op\_code\)	Corresponding Instruction	Operand Form	Meaning
\(0\)	\(vf(w)add\)	\(vv,vf\)	Addition
\(1\)	\(vf(w)sub\)	\(vv,vf\)	Subtraction
\(2\)	\(vfmin\)	\(vv,vf\)	Find Minimum
\(3\)	\(vfmax\)	\(vv,vf\)	Find Maximum
\(4\)	\(vfmerge\)	\(vfm\)	Data Merge
\(5\)	\(vfmove\)	\(v.f\)	Data Move
\(6\)	\(vfsgnj\)	\(vv,vf\)	Sign Inject
\(7\)	\(vfsgnjn\)	\(vv,vf\)	Invert Sign Inject
\(8\)	\(vfsgnjx\)	\(vv,vf\)	XOR Sign Inject
\(9\)	\(vmfeq\)	\(vv,vf\)	Is Equal
\(10\)	\(vmfnq\)	\(vv,vf\)	Is Not Equal
\(11\)	\(vmflt\)	\(vv,vf\)	Is Less Than
\(12\)	\(vmfle\)	\(vv,vf\)	Is Less Than or Equal
\(13\)	\(vmfgt\)	\(vf\)	Is Greater Than
\(14\)	\(vmfge\)	\(vf\)	Is Greater Than or Equal
\(15\)	\(vfclass\)	\(v\)	Classify

The table below defines the \(VFALU\) interface. Compared to the vector floating-point adder, it adds \(widen\_a, widen\_b\) for mixed-precision data sources. When source and destination operand formats are the same, data comes from \(fp\_a, fp\_b\); otherwise, from \(widen\_a, widen\_b\). And if \(uop\_idx=0\), the lower half is taken; if \(uop\_idx=1\), the upper half is taken. When \(is\_frs1=1\), source operand \(vs1\) comes from floating-point register \(frs1\) and needs to be replicated into a vector register before computation. \(mask\) participates in \(merge\) instruction computation. \(op\_code\) is the operation code, indicating the operation to be performed.

\(VFALU\) Interface and Meaning
Interface	Direction	Width	Meaning
\(fp\_a\)	\(input\)	\(64\)	Source operand \(vs2\)
\(fp\_b\)	\(input\)	\(64\)	Source operand \(vs1\)
\(widen\_a\)	\(input\)	\(64\)	\(widen\_vs2\)
\(widen\_b\)	\(input\)	\(64\)	\(widen\_vs1\)
\(frs1\)	\(input\)	\(64\)	Floating-point register data
\(is\_frs1\)	\(input\)	\(1\)	Addend from floating-point register data
\(mask\)	\(input\)	\(4\)	Participates in \(merge\) instruction computation
\(uop\_idx\)	\(input\)	\(1\)	Selects upper/lower half for \(widen\)
\(round\_mode\)	\(input\)	\(3\)	Rounding mode
\(fp\_format\)	\(input\)	\(2\)	Floating-point format
\(res\_widening\)	\(input\)	\(1\)	\(widen\) instruction
\(opb\_widening\)	\(input\)	\(1\)	Whether source operand \(vs1\) has same format as result
\(op\_code\)	\(input\)	\(5\)	Opcode
\(fp\_result\)	\(output\)	\(64\)	Computation result
\(fflags\)	\(output\)	\(20\)	Flags

Vector Floating-Point Fused Multiply-Add Unit

Pipeline Partitioning

The vector floating-point fused multiply-add unit uses a four-cycle pipeline design. To achieve fast wakeup, the multiply-add result is computed in about \(3.5\) cycles. The vector unit latency is \(3.5\) cycles. The figure below shows the architecture of the vector floating-point fused multiply-add unit. In the figure, \(reg\_0\) represents the first-stage register, \(reg\_1\) the second-stage register, and \(reg\_2\) the third-stage register. The vector floating-point fused multiply-add unit also supports \(widen\) functionality, but its \(widen\) function only has two cases: \(f32 = f16 \times f16 + f32\) and \(f64 = f32 \times f32 + f64\). So, when the output format is determined, only a one-bit \(widen\) signal is needed for control. The output \(fflags\) is also \(20\) bits, with the same representation as in the vector floating-point adder.

Vector Floating-Point Fused Multiply-Add Unit Architecture Diagram

To save area while meeting timing constraints, a resource reuse implementation is adopted. Computations for all data formats pass through the same vector \(Booth\) encoder and \(CSA\) compression. Through interleaved placement, the \(107\)-bit adder also achieves resource reuse.

The first cycle performs exponent processing for seven groups, obtaining seven right-shift values. The corresponding right-shift value is selected based on the computation format. For shifters, the \(f64\) shifter is shared with one \(f32\). Additionally, \(1\) \(f32\) and \(4\) \(f16\) shifters are set up separately. If subtraction is performed, the \(fp\_c\) mantissa right-shift result is inverted before being input to the first-stage register. The first cycle also performs vector \(Booth\) encoding, generating \(27\) partial products, which are compressed into \(4\) partial products using \(CSA\) and then registered.

The second cycle performs \(CSA4\_2\) compression on the remaining \(4\) partial products and then \(CSA3\_2\) compression with the significand right-shift result calculated in the first cycle. Then, a \(107\)-bit addition is performed, and the result is registered into the second-stage register.

The third cycle performs \(lzd\) and \(tzd\) on the sum result from the second cycle, followed by a left shift with \(mask\) limiting. The left-shift result is stored in the third-stage register.

The fourth cycle performs rounding to get the mantissa result. The exponent result is calculated based on the left shift situation in the third cycle. The sign bit can be obtained from the \(107\)-bit adder in the second cycle. The flag result can generate four types of flags: overflow, underflow, invalid operation, and inexact. Note the underflow detection method: \(IEEE-754\) specifies two ways to detect underflow, \(before \quad rounding\) and \(after \quad rounding\). This design uses the \(RISC-V\) selected \(after \quad rounding\) method to detect underflow.

Interface Description

According to \(RVV\) instruction definitions, the vector floating-point fused multiply-add unit can be reused to compute multiplication, controlled by \(op\_code\). When computing multiplication, the addend is internally set to zero. Also, \(RVV\) defines a series of floating-point fused multiply-add instructions, mainly differing in sign bits and operand order. The vector floating-point fused multiply-add unit is modified to support all related instructions as \(VFMA\), by adding \(op\_code\) and interfaces. The table below lists the \(VFMA\) opcodes, totaling \(9\) operations, all supporting \(vv\) and \(vf\) operand forms. For \(vf\), \(vs1[i]\) is replaced with floating-point register \(frs1\).

\(VFMA\) Opcodes
\(op\_code\)	Corresponding Instruction	Operand Form	Meaning
\(0\)	\(vf(w)mul\)	\(vv,vf\)	\(vd[i] = vs[2] \times vs1[i]\)
\(1\)	\(vf(w)macc\)	\(vv,vf\)	\(vd[i] = +(vs1[i] \times vs2[i]) + vd[i]\)
\(2\)	\(vf(w)nmacc\)	\(vv,vf\)	\(vd[i] = -(vs1[i] \times vs2[i]) - vd[i]\)
\(3\)	\(vf(w)msac\)	\(vv,vf\)	\(vd[i] = +(vs1[i] \times vs2[i]) - vd[i]\)
\(4\)	\(vf(w)nmsac\)	\(vv,vf\)	\(vd[i] = -(vs1[i] \times vs2[i]) + vd[i]\)
\(5\)	\(vfmadd\)	\(vv,vf\)	\(vd[i] = +(vs1[i] \times vd[i]) + vs2[i]\)
\(6\)	\(vfnsmadd\)	\(vv,vf\)	\(vd[i] = -(vs1[i] \times vd[i]) - vs2[i]\)
\(7\)	\(vfmsub\)	\(vv,vf\)	\(vd[i] = +(vs1[i] \times vd[i]) - vs2[i]\)
\(8\)	\(vfnmsub\)	\(vv,vf\)	\(vd[i] = -(vs1[i] \times vd[i]) + vs2[i]\)

The table below shows the \(VFMA\) interface. To simplify control logic complexity, the three operands sent to \(VFMA\) are fixed in the order \(vs2, vs1, vd\). The functional unit internally swaps the order based on \(op\_code\). Because the addend in \(fma\) instructions is fixed to the target format during \(widen\), only \(widen\_a\) and \(widen\_b\) need to be added. \(uop\_idx\) is similarly used to control selection of the upper or lower half of \(widen\_a\) and \(widen\_b\). \(frs1\) and \(is\_frs1\) are used to support \(vf\) instructions. Self-correction: Opcode 6 vfnamdd seems like a typo. Common FMA variants are vfmadd, vfmsub, vfnmadd, vfnmsub. vfnsmadd (negative sum multiply add) or vfnmadd (negative multiply add) are more likely. I've provisionally used vfnsmadd for "-(AB)-C". However, the original instruction vfnamdd in Chinese could be -(A*C)-B with operands swapped or a very specific non-standard one. I translated the formula as given \(vd[i] = -(vs1[i] \times vd[i]) - vs2[i]\). If vfnamdd stands for "Vector Floating-point Negative Add Multiply Destructive", the formula might be right. I will keep the provided instruction name vfnamdd and its formula.

\(VFMA\) Interface and Meaning
Interface	Direction	Width	Meaning
\(fp\_a\)	\(input\)	\(64\)	Source operand \(vs2\)
\(fp\_b\)	\(input\)	\(64\)	Source operand \(vs1\)
\(fp\_c\)	\(input\)	\(64\)	Source operand \(vd\)
\(widen\_a\)	\(input\)	\(64\)	\(widen\_vs2\)
\(widen\_b\)	\(input\)	\(64\)	\(widen\_vs1\)
\(frs1\)	\(input\)	\(64\)	Floating-point register data
\(is\_frs1\)	\(input\)	\(1\)	Operand from floating-point register data
\(uop\_idx\)	\(input\)	\(1\)	Selects upper/lower half for \(widen\)
\(round\_mode\)	\(input\)	\(3\)	Rounding mode
\(fp\_format\)	\(input\)	\(2\)	Floating-point format
\(res\_widening\)	\(input\)	\(1\)	\(widen\) instruction
\(op\_code\)	\(input\)	\(5\)	Opcode
\(fp\_result\)	\(output\)	\(64\)	Computation result
\(fflags\)	\(output\)	\(20\)	Flags
*Self-correction: \(is\_frs1\) is a boolean flag, so its width should be \(1\), not \(64\). The original table has \(64\). I'll translate as written but it's likely a typo in the source. For "加数来自浮点寄存器数据", if \(vs1\) is the addend, then "Addend from floating-point register data". If \(vs1\) is a multiplicand in some FMA forms, then "Operand from floating-point register data" is more general. I'll use "Operand from floating-point register data". The table has \(is\_frs1\) width as 64, this is unusual for a flag; it might imply broadcasting the scalar \(frs1\) to all 64 bits of the vector path if \(is\_frs1\) is true, or it's a typo for `width = 1`. Assuming it is a flag, width 1 is correct. However, I must translate as is. Let's keep 64 and make a mental note. Looking at `VFALU` table, `is_frs1` has width 64. This suggests it is a mask or it's indeed a broadcasted value (unlikely for a flag named `is_...`). For now, I will keep it as 64.

Vector Floating-Point Divider

Scalar Floating-Point Divider

The scalar floating-point divider supports three computation formats: \(1 \ f16 = f16 / f16\), \(1 \ f32 = f32 / f32\), \(1 \ f64 = f64 / f64\). The divider uses a \(Radix-64\) algorithm. The iteration module performs three \(Radix-4\) iterations per cycle to implement \(Radix-64\). The figure below shows the architecture of the scalar floating-point divider. The divider executes blockingly; it cannot accept the next division operation while a computation is in progress. Therefore, handshake signals are needed for control. This design uses \(start-valid\) handshake signals. Since CPUs may encounter branch mispredictions requiring a pipeline flush, a \(flush\) signal is specifically provided to clear the internal state of the divider, allowing it to immediately start a new division computation in the next cycle.

Input data is categorized into three cases: both are normalized numbers (excluding divisors that are powers of \(2\)), at least one is a denormalized number, and early termination (input contains \(NaN\), infinity, zero, or divisor is a power of \(2\)). Results are categorized into two cases: result is a normalized number, result is a denormalized number.

When both inputs are normalized numbers (excluding divisors that are powers of \(2\)), their significands are normalized, and the process directly enters the prescaling stage. When at least one input is a denormalized number, compared to the case where both inputs are normalized, an additional cycle is added before prescaling for significand normalization.

The prescaling stage takes one cycle, followed by integer quotient selection, which selects the two-bit integer quotient result and provides the prescaled divisor, dividend, and the carry-save redundant representation of the remainder for \(Radix-4\) iteration. The \(Radix-4\) iteration module computes \(6\) bits of the quotient per cycle. \(f16\) division requires \(2\) cycles of \(Radix-4\) iteration, \(f32\) division requires \(6\) cycles, and \(f64\) division requires \(9\) cycles. After \(Radix-4\) iteration is complete, the significand quotient result is in the range \((1, 2)\). When the result is a normalized number, only one cycle is needed for rounding and exponent result calculation to get the final division result. When the result is a denormalized number, an additional cycle is needed before rounding to denormalize the quotient result.

Early termination has two cases: when an input operand contains \(NaN\), infinity, or zero, no division computation is needed, and the result can be output in the second cycle. When the divisor is a power of \(2\), the exponent result can be obtained in the first cycle. If the result does not require a denormalization step, it can be output in the second cycle; if it does, an additional cycle is added, and the result is output in the third cycle.

The table below shows the computation cycles required by the scalar divider for different data formats. \(+1\) indicates an additional cycle for post-processing if the division result is a denormalized number. In early termination cases, all data format divisions are completed in \(1\) to \(2\) cycles. In non-early termination cases, \(f16\) division takes \(5\) to \(7\) cycles, \(f32\) division takes \(7\) to \(9\) cycles, and \(f64\) division takes \(12\) to \(14\) cycles.

Scalar Divider Computation Cycles
Data Format	Normalized Number	Denormalized Number	Early Termination
\(f16\)	\(5+1\)	\(6+1\)	\(1+1\)
\(f32\)	\(7+1\)	\(8+1\)	\(1+1\)
\(f64\)	\(12+1\)	\(13+1\)	\(1+1\)

Vector Floating-Point Divider

The figure below shows the architecture of the vector floating-point divider. Compared to the scalar floating-point divider, considering that vector division computes multiple divisions simultaneously and all results need to be written back to the register file uniformly after completion, early termination of a single division offers little acceleration benefit for vector division. Therefore, the characteristic of uncertain output result latency is removed. In any situation, based on the input data format, the latency of the vector floating-point divider is fixed, as shown in the table below.

Vector Divider Computation Cycles
Data Format	Computation Cycles
\(f16\)	\(7\)
\(f32\)	\(11\)
\(f64\)	\(14\)

In terms of hardware design, except for the \(Radix-64\) iteration module, the vector floating-point divider uses logic reuse. Four sets of signals are used for computation and control: the first set for computing \(f64\_0\) or \(f32\_0\) or \(f16\_0\), the second set for computing \(f32\_1, f16\_1\), the third set for computing \(f16\_2\), and the fourth set for computing \(f16\_3\). Registers also use reuse to store intermediate results, with register bit widths allocated according to \(max\)(\(1 \times f64, 2 \times f32, 4 \times f16\)) to meet maximum requirements. The \(Radix-64\) iteration module is on the critical path and also aims to save area while meeting timing requirements. The first \(Radix-4\) iteration uses \(7\) independent sets of \(CSA\) and quotient selection. The second and third \(Radix-4\) iterations use \(4\) reused sets of \(CSA\) and quotient selection.

Interface Description

\(RVV\) defines three vector floating-point division instructions:

① \(vfdiv.vv \quad vd[i] = vs2[i]/vs1[i]\)

② \(vfdiv.vf \quad vd[i] = vs2[i]/f[rs1]\)

③ \(vfrdiv.vf \quad vd[i] = f[rs1]/vs2[i]\)

Among these, ③ is special as its operand order differs from ① and ②. For the vector division unit, the control logic passes \(vs2[i]/f[rs1]\) as the first operand, and \(vs1[i]/f[rs1]/vs2[i]\) as the second operand. Thus, the functional unit sees the dividend in either vector or scalar form, and the divisor also in either vector or scalar form. Therefore, two scalar data interfaces need to be added. After adding the interfaces, the module is named \(VFDIV\). The interface is shown in the table below.

\(VFDIV\) Interface and Meaning
Interface	Direction	Width	Meaning
\(start\_valid\_i\)	\(input\)	\(1\)	Handshake signal
\(finish\_ready\_i\)	\(input\)	\(1\)	Handshake signal
\(flush\_i\)	\(input\)	\(1\)	Flush signal
\(fp\_format\_i\)	\(input\)	\(2\)	Floating-point format
\(opa\_i\)	\(input\)	\(64\)	Dividend
\(opb\_i\)	\(input\)	\(64\)	Divisor
\(frs2\_i\)	\(input\)	\(64\)	Dividend from floating-point register data
\(frs1\_i\)	\(input\)	\(64\)	Divisor from floating-point register data
\(is\_frs2\_i\)	\(input\)	\(1\)	Dividend from floating-point register
\(is\_frs1\_i\)	\(input\)	\(1\)	Divisor from floating-point register
\(rm\_i\)	\(input\)	\(3\)	Rounding mode
\(start\_ready\_o\)	\(output\)	\(1\)	Handshake signal
\(finish\_valid\_o\)	\(output\)	\(1\)	Handshake signal
\(fpdiv\_res\_o\)	\(output\)	\(64\)	Computation result
\(fflags\_o\)	\(output\)	\(20\)	Flags

Vector Format Conversion Module \(VCVT\)

The \(VCVT\) module is a three-stage pipelined vector floating-point format conversion module. It instantiates \(2\) submodules \(VectorCvt\), each capable of processing \(64bits\) of data. Each \(VectorCvt\) in turn contains one \(cvt64\), one \(cvt32\), and two \(cvt16\) modules. Among these, \(cvt64\) supports processing of \(64, 32, 16, 8bits\) floating-point/integer formats; \(cvt32\) supports processing of \(32, 16, 8bits\) floating-point/integer formats; and \(cvt16\) supports processing of \(16, 8bits\) floating-point/integer formats. That is, \(vectorCvt\) can simultaneously process the conversion of \(1 \times 64bits\) (or \(2 \times 32bits\), or \(4 \times 16bits\), or \(4 \times 8bits\)) floating-point/integer format input data.

Overall Design

Module Design

The \(CVT\) module includes single-width floating-point/integer type conversion instructions, widening floating-point/integer type conversion instructions, narrowing floating-point/integer type conversion instructions, vector floating-point reciprocal square root estimate instructions, and vector floating-point reciprocal estimate instructions.

Different \(cvt\) modules are called based on \(width\). The design philosophy for \(cvt\) modules is to categorize them into \(4\) types based on instruction type: \(fp2int, int2fp, fp2fp, vfr\), each implemented separately. The overall design idea for \(fcvt64\) is to unify the format of the input \(64bit\) data:

\(different \quad width \quad unsigned/signed \quad int -> 65 \quad signed \quad int\)

\(f16/f32/f64 -> 65bit (f64 \#\# false.B)\)

After format unification, the conversion process, to a certain extent, does not need to distinguish which data type it is, its bit width, or the position of its fields.

Based on this, \(VFCVT64\) is divided into \(5\) categories: \(int -> fp, fp -> fp widen, fp -> fp narrow, estimate7(rsqrt7 \& rec7), fp -> int\).

\(FuopType\) Decoding Logic

For \(cvt\) instructions: its \(fuopType\) is \(9\) bits in total, with each bit representing the following information:

Bits \([5:0]\) are obtained from the manual. Bits \([8:6]\) are added for convenience in generating control signals during design.

\([8]: 1\) indicates it is a \(move\) instruction, \(0\) indicates a \(cvt\) instruction or the two estimation instructions \(vfrsqrt7\) and \(vfrec7\).

\([7]: 1\) indicates its input is \(fp\), \(0\) indicates input is \(int\).

\([6]: 1\) indicates its output is \(fp\), \(0\) indicates output is \(int\).

\([5]: 1\) indicates it is one of the two estimation instructions \(vfrsqrt7\) and \(vfrec7\); otherwise, it is a \(cvt\) instruction. When it is \(1\), \([0]\) is used to distinguish between \(vfrsqrt7\) and \(vfrec7\).

\([4:3]: 00\) indicates \(single\) type, \(01\) indicates \(widen\), \(10\) indicates \(narrow\).

\([2:0]\): Has different functions for different instructions: for conversions between floating-point and integer, \([0]\) is used to distinguish whether the integer is signed or unsigned. In other cases, \([2:1]=11\) indicates it is an \(rtz\) type instruction, and \([2:0]=101\) indicates it is \(rod (vfncvt\_rod\_ffw)\).