BPU Submodule FTB

Functional Overview

The FTB stages FTB entries, providing more precise branch instruction location, type, and other information for subsequent advanced predictors. The FTB module contains an FTBBank module responsible for the actual storage of FTB entries. The module uses a multi-way SRAM as the storage device.

Request Reception

In stage 0, the FTB module sends a read request to the internal FTBBank. The requested PC value is the PC passed in at s0.

Data Reading and Return

In the cycle following the request, which is stage 1 of the predictor, the multi-way signal read from the FTB SRAM is staged.

In the subsequent cycle, which is stage 2 of the predictor, a hit signal is generated from the staged data based on the tag of each way and the matching status with the tag during the actual request. If a hit occurs, the hit FTB data is selected. If a hit request exists, the return value is the selected FTB entry and the hit way information. If not hit, the output data is meaningless. The tag is bits 29 to 10 of the PC.

The data read out from the FTBBank module is passed to subsequent predictors in the current cycle as the stage 2 prediction result via combinational logic connections. In addition, this read-out result is also staged within the FTB module and passed to subsequent predictors again as the prediction result via combinational logic connections in stage 3. If the FTB hits, the read-out hit way number is also passed as meta information in s3, along with hit information and cycle count, to the subsequent FTQ module.

Furthermore, if an always taken flag exists within an FTB entry, the corresponding br_taken_mask in the stage 2 prediction result is asserted within this module.

Data Update

Upon receiving an update request, the FTB module decides the update timing based on whether there is a hit in the meta information. If the meta indicates a hit, the update occurs immediately in the current cycle. Otherwise, it needs to be delayed by 2 cycles to wait for the existing result within the FTB to be read out before updating.

Inside the FTBBank, when an update request exists, the module's behavior also differs depending on whether it's an immediate update or a delayed update. For immediate updates, the SRAM write channel within the FTBBank is asserted, and the write is completed according to the provided information. For delayed updates, the FTBBank first receives an update read request, which has higher priority than normal prediction read requests. Then, data is read out in the next cycle, and the way encoding hit by the given address is selected and passed to the external FTB module. If there is no hit in this cycle, the data needs to be written to an allocated way in the next cycle. The way selection rule is: if all ways are full, use a replacement algorithm (pseudo LRU here, see ICache documentation for details) to select the way to replace; otherwise, select an empty way.

SRAM Specifications

Single bank, 512 sets, 4 ways, uses single-port SRAM, no read hold, with power-on reset.

20-bit tag, 60-bit FTB entry.

Where FTB entry is:

1 bit valid 20 bits br slot (4 bit offset, 12 bit lower, 2 bit tarStat, 1 bit sharing, 1 bit valid) 28 bits tail slot (4 bit offset, 20 bit lower, 2 bit tarStat, 1 bit sharing, 1 bit valid) 4 bits pftAddr 1 bit carry 1 bit isCall 1 bit isRet 1 bit isJalr 1 bit potentially last is rvi call 2 bits always taken

Overall Block Diagram

Interface Timing

Result Output Interface

The figure above shows the interface where the FTB module in the branch predictor outputs prediction results for a request with fallThrough address 0x2000001062 for three consecutive cycles at different stages of the branch predictor.

Update Interface

The figure above shows an update operation of the FTB module targeting address 0x2000000E00. All update data is passed within one cycle.

FTBBank

Interface Timing

Read Data Interface

The figure above shows the FTBBank read data interface. FTBBank replies with data one cycle after receiving the request. That is, the data replied at 16303ps is for the 0x2000001060 address request at 16301ps.

Update Read Data Interface

The figure above shows the FTBBank update read data interface. FTBBank replies with data one cycle after receiving an update read request. The replied data is used by the external unit for update write data one cycle later. It can be noted that the pftAddr from one cycle after the request is used for the data write one cycle after the result read out.

Update Write Data Interface

The figure above shows the FTBBank update write data interface. Data write is completed one cycle after receiving the write request.

Functional Overview

As mentioned above, the FTBBank primarily stores FTB entries and is a simple encapsulation of an SRAM module.

Brief Description of FTB Entry Generation Conditions

The FTB is the core of the BPU. All predictions made by other prediction components of the BPU rely on the information provided by the FTB. In addition to providing information about branch instructions within the predicted block, the FTB also provides the end address of the predicted block. For the FTB, the strategy for generating FTB entries is crucial. The Nanhu architecture, based on the original paper [1], combined with the ideas from paper [2], has formed the current strategy. Let the start address of an FTB entry be start and the end address be end. The specific strategy is as follows:

The FTB entry is indexed by start. start is generated in the prediction pipeline. In practice, start basically follows one of the following principles:
- start is the end of the previous predicted block.
- start is the target address of a re-direction from outside the BPU.
At most two branch instructions are recorded within an FTB entry, where the first one must be a conditional branch.
end must satisfy one of three conditions:
- end - start = prediction width
- end is the PC of the third branch instruction within the prediction width starting from start
- end is the PC of the instruction after an unconditional jump branch, and it is within the prediction width starting from start

Under this training strategy, the same branch instruction may exist in multiple FTB entries.

Similar to the implementation in paper 1, we only store the lower bits of the end address, and the higher bits are obtained by concatenating with the higher bits of the start address. Similar to AMD's 3 approach, we also record an "always taken" bit for conditional branch instructions in the FTB entry. This bit is set to 1 the first time this conditional branch is encountered and taken. When its value is 1, the direction of this conditional branch is always predicted as taken, and its result is not used to train the conditional branch direction predictor. When this conditional branch encounters an execution result of not taken once, this bit is set to 0, and afterwards its direction is predicted by the conditional branch direction predictor.

FTB Storage Structure

The FTB entry structure is as follows:

total	valid	brSlot	tailSlot	pftAddr	carry	isCall, isRet, isJalr	last_may_be_rvi_call	strong_bias
	Valid bit	First branch info	Second branch info	Predicted block end addr	Carry bit for end addr high bits	tailSlot branch type	Special bit for RAS identification	Strong bias
62	1	21	29	4	1	3	1	2

Composition of FTB slot: each slot corresponds to a branch instruction.

total	valid	offset	lower	tarStat	sharing	isRVC
	Valid bit	Offset relative to start PC	Lower bits of target addr	Whether high bits of target addr carry	(For tailSlot) Whether a conditional branch is stored	Whether it is a compressed instruction
21/29	1	4	12/20	2	1	1

FTB has 2048 entries, 4-way set associative. Each entry records at most 2 branches, where the first one must be a conditional branch, and the second one can be any type of branch instruction.

Target Address Generation Logic

For each slot, based on three possible high bit carry conditions (carry/borrow/no change), choose one from three cases (PC high bits +1, PC high bits -1, PC high bits), and concatenate with the stored lower bits of the target address information.

Update Flow

Entry Generation

1.1 Read necessary information from FTQ: - Start address startAddr - Old FTB entry old_entry read during prediction - Pre-decode information pd including all branch instructions within 32Bytes of the FTQ entry - Real jump result cfiIndex of valid instructions within this FTQ entry, including whether it jumped, and the offset of the jump instruction relative to startAddr - Jump address (execution result) of branch instructions (like jumps) within this FTQ entry - Whether FTB actually hit during prediction (whether the old FTB entry was valid) - Misprediction mask for all possible instructions within the corresponding FTQ entry

1.2 FTB Entry Generation Logic: - Case 1: FTB miss or error exists 1) Unconditional jump instruction processing: - Regardless of whether it was executed, it will definitely be written to the tailSlot of the new FTB entry. - If the finally jumped instruction within the FTQ entry is a conditional branch instruction, write it to the first brSlot of the new FTB entry, and set the corresponding always_taken bit to 1. 2) pftAddr setting: - When an unconditional jump instruction exists: Set with the end address of the first unconditional jump instruction. - When no unconditional jump instruction exists: Set with startAddr + fetch width (32B). - Special case: When the start address of the first 4-Byte wide unconditional jump instruction is at startAddr+30, although the end address is outside the fetch width range, still set as startAddr+32. 3) carry bit is set simultaneously based on pftAddr conditions. 4) Set branch type flags: - isJalr, isCall, isRet are set according to the type of the first unconditional jump instruction. - Special flag: If and only if the start address of the first 4-Byte wide unconditional jump instruction is at startAddr+30 and the instruction is of type call, set the last_may_be_rvi_call bit.
- Case 2: FTB hit and no error 1) Insert new conditional branch:
  - When there is an empty position: a) tailSlot has an unconditional jump: The new conditional branch must be instruction sequence earlier than this unconditional jump, directly insert into brSlot. b) brSlot has a conditional branch: Arrange by instruction sequence, keep the branch in brSlot within the FTB entry instruction sequence earlier than tailSlot. c) pftAddr does not need to be modified in the above cases.
  - When there is no empty position: a) tailSlot has an unconditional jump:
    - The new conditional branch instruction sequence is definitely earlier than this unconditional jump.
    - The new conditional branch replaces the position of the unconditional jump.
    - pftAddr is set according to the PC of the unconditional jump. b) tailSlot has a conditional branch: i) The new conditional branch instruction sequence is earlier than the existing conditional branch instruction sequence within tailSlot:
      - Arrange the conditional branch in brSlot and the new conditional branch by instruction sequence.
      - pftAddr is set according to the PC of the original branch within tailSlot. ii) The new conditional branch instruction sequence is after all existing branch instructions:
      - The slot does not change.
      - Only set pftAddr according to the PC of the new conditional branch. 2) Update jalr jump address information:
  - When tailSlot originally recorded jalr (RISC-V unconditional indirect jump instruction)
  - When the jump address changes, modify the corresponding target address lower bits and high bit carry information recorded within tailSlot. 3) Update always_taken bit:
  - If always_taken bit is 1, and the execution result of the corresponding conditional branch this time is not taken, deassert the always_taken bit.
Write to SRAM

2.1 Write Conditions:
- The new FTB entry has no change at all, or although FTB missed but uFTB hit: No write needed.
- The new FTB entry has changed and it is not the case where uFTB hit and FTB missed: Write is needed.
2.2 Write Flow: - Case 1: Prediction miss 1) First use one cycle to perform an FTB read to determine the hit situation at this time. 2) If hit, write to the corresponding way. 3) If still not hit, select a way to write according to the replacement algorithm. 4) The entire flow requires 3 clock cycles. 5) During the process, due to FTB read/write port occupancy, requires FTQ cooperation to not issue new update requests. - Note: Banking may increase update bandwidth. - Case 2: Hit during prediction (including cases where the hit entry contains error information) 1) Directly write to the corresponding way, no need to read out again.
Pipeline diagram for writing to SRAM is as follows: