Kunming Lake FTQ Module Documentation

Terminology Explanation

Table 1.1 Terminology Explanation

Abbreviation	Full Name	Description
CRU	Clock Reset Unit	Clock Reset Unit
FTQ	Fetch Target Queue	Fetch Target Queue
FTB	Fetch Target Buffer	Fetch Target Buffer

Functional Description

Functional Overview

FTQ is a buffer queue between the branch prediction unit and the instruction fetch unit. Its main function is to temporarily store the fetch targets predicted by the BPU and send fetch requests to the IFU based on these fetch targets. Another important function is to temporarily store the prediction information from various BPU predictors and send this information back to the BPU for training after instruction commit. Therefore, it needs to maintain the complete lifecycle of instructions from prediction to commit.

Supports temporary storage of BPU-predicted fetch targets and sending fetch requests to the IFU.
Supports temporary storage of BPU prediction information and sending it back to the BPU for training.
Supports redirection recovery.
Supports sending prefetch requests to the ICache.

Temporarily Storing BPU-Predicted Fetch Targets and Sending Fetch Requests to the IFU

Temporarily Storing BPU-Predicted Fetch Targets

Structure for Storing PC

A BPU prediction goes through three pipeline stages, and each stage generates new prediction content. FTQ receives the prediction results from each BPU pipeline stage, and the results from later stages overwrite those from earlier stages.

Instructions are issued from the BPU in prediction blocks, enter the FTQ, and the bpuPtr increments. The corresponding FTQ entry's status is initialized, and various prediction information is written into the storage structure. If the prediction block comes from the BPU's overwrite prediction logic, the bpuPtr and ifuPtr are restored.

The fetch targets predicted by the BPU are temporarily stored in ftq_pc_mem:

ftq_pc_mem: Implemented as a register file, used to store information related to instruction addresses, including the following fields:
startAddr: Start address of the prediction block.
nextLineAddr: Start address of the next cache line for the prediction block.
isNextMask: Indicates for each possible instruction start position within the prediction block whether it is in the next region aligned by the prediction width. isNextMask is 16 bits, where each bit indicates whether the position at 2byte * n relative to the start address crosses a cache line, describing the nature of each position.
fallThruError: Indicates whether there is an error in the predicted next sequential fetch address.

Each field exists in its own register (e.g., data_0_startAddr), and they are not concatenated and stored in the same register.

Method for Calculating PC

Each fetch from the ICache retrieves one or two CacheLineSize (64Bytes) lengths of cache line instruction data. Whether two are fetched depends on whether the prediction block crosses a cache line.

Each prediction block has a length of PredictWidth (16) compressed instruction lengths (32Bytes). The length of each cache line is twice the length of each prediction block. Therefore, the startAddr of each prediction block is either in the first half of the current cache line (startAddr[5]=0) or in the second half (startAddr[5]=1).

If startAddr[5]=0, then the current prediction block definitely does not cross a cache line. In this case, the predicted instruction PC is {startAddr[38:6], startAddr[5:1] + offset, 1'b0}.

If startAddr[5]=1, then the current prediction block might cross a cache line. In this case:

If isNextMask(offset)=0, it means the current predicted instruction PC does not cross the cache line. Then the predicted instruction PC is {startAddr[38:6], startAddr[5:1] + offset, 1'b0}.
If isNextMask(offset)=1, it means the current predicted instruction PC crosses the cache line. Then the predicted instruction PC is {nextLineAddr[38:6], startAddr[5:1] + offset, 1'b0}.

Sending Fetch Requests to the IFU

FTQ sends a fetch request to the IFU, the ifuPtr increments, and it waits for the pre-decode information to be written back.

The pre-decode information written back by the IFU is temporarily stored in ftq_pd_mem:

ftq_pd_mem: Implemented as a register file, storing the decode information for each instruction within the prediction block returned by the fetch unit, including the following fields:
brMask: Whether each instruction is a conditional branch instruction.
jmpInfo: Information about the unconditional jump instruction at the end of the prediction block, including whether it exists, if it's jal or jalr, and if it's a call or ret instruction.
jmpOffset: Position of the unconditional jump instruction at the end of the prediction block.
jalTarget: Jump address for jal at the end of the prediction block.
rvcMask: Whether each instruction is a compressed instruction.

Temporarily Storing BPU Prediction Information and Sending it Back to the BPU for Training

Temporarily Storing BPU Prediction Information

Besides being temporarily stored in ftq_pc_mem mentioned above, some of the prediction information passed from the BPU to the FTQ is also stored in ftq_redirect_sram, ftq_pc_mem, and ftb_entry_mem.

ftq_redirect_sram: Implemented as SRAM, storing prediction information that needs to be restored during redirection, mainly including information related to RAS and branch history. It is divided into 3 banks, each with a depth × width of 64 × 236.
ftq_meta_1r_sram: Implemented as SRAM, storing other BPU prediction information. The SRAM depth × width is 64 × 256.
ftb_entry_mem: Implemented as a register file, storing necessary FTB entry information during prediction, used for training new FTB entries after commit. Why store ftb_entry? Because when updating, the ftb_entry needs to be modified based on the original. To avoid rereading the FTB, the ftb_entry is stored here in ftb_entry_mem.

The specific implementation mechanisms for each SRAM/MEM in FTQ are shown in the table below:

| | Write Timing (Forward Write) | Update Timing (Backward Update, e.g., Redirection) | Read Timing | Written Data Content | Updated Data Content | | ---------------- | -------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | None | | ftq_pc_mem | BPU pipeline S1 stage, when creating a new prediction entry | Does not exist (Current design is that FTQ summarizes redirection and sends it to BPU and IFU. When BPU re-enqueues the prediction block redirected to the new address, the new block is written to ftq_pc_mem. The entries in ftq_pc_mem represent the addresses of the current prediction block and do not include the target, so the incorrectly predicted block does not need to be updated) | Data is read into a Reg every clock cycle. If IFU does not need to read from bypass, the Reg data is directly connected to Icache and IFU | startAddr: Start address of the prediction block nextLineAddr: Start address of the next cache line for the prediction block isNextMask: Indicates for each possible instruction start position within the prediction block whether it is in the next region aligned by the Predict Width (① If isNextMask(offset) = 0, it means the current predicted instruction PC does not cross the cache line. Then the predicted instruction PC is {startAddr[38:6], startAddr[5:1] + offset, 1'b0}. ② If isNextMask(offset) = 1, it means the current predicted instruction PC crosses the cache line. Then the predicted instruction PC is {nextLineAddr[38:6], startAddr[5:1] + offset, 1'b0}.) fallThruError: Indicates whether there is an error in the predicted next sequential fetch address | | ftq_meta_1r_sram | BPU pipeline S3 stage | | When the instructions in the FTQ entry can commit, the meta data is read out and sent to BPU for training | The written data packet includes prediction information from 4 predictors | | ftb_entry_mem | BPU pipeline S3 stage | | 1. Backend redirection 2. IFU writes back pre-decode information 3. IFU pre-decode detects error and sends redirection | BrSlot: brSlot_offset/lower/tarStat/sharing/valid TailSlot: tailSlot_offset/lower/tarStat/sharing/valid pftAddr, carry, isCall, isRet, isJalr... | | ftq_pd_mem | Next cycle of IFU stage F3 pipeline | | Always reads the data corresponding to commPtr as the address, assigned to ftbEntryGen | rvcMask brMask jmpInfo jmpOffset jalTarget |

Sending Back to BPU for Training

When instructions commit in the backend, they notify the FTQ that they have committed. When all valid instructions in an FTQ entry have committed in the backend, the commPtr increments, and the corresponding information is read from the storage structure and sent to the BPU for training.

In Kunming Lake V2, a commitStateQueue was used to record the commit status of instructions within an FTQ entry. Note that this design was incomplete and violated the original intention of BPU updates. In V3, this mechanism has been entirely removed in conjunction with the backend.

Each bit of the commitStateQueue recorded whether an instruction in an FTQ entry had been committed.

Because the V2 backend re-compressed FTQ entries in the ROB, it was not guaranteed that every instruction in an entry would be committed, nor was it guaranteed that every entry would have instructions committed. Determining if an entry was committed had the following possibilities:

robCommPtr is before commPtr. This means the backend has already started committing instructions from subsequent entries, so entries before the one pointed to by robCommPtr must have completed commit.
The last instruction in the commitStateQueue is committed. The commit of the last instruction in an entry meant that this entry was completely committed.

Furthermore, it was necessary to consider that the backend might have self-flushing redirect requests, meaning the instruction itself needed to be re-executed. This included exceptions, load replay, and other situations. In such cases, this entry should not be committed to update the BPU, as it would significantly decrease BPU accuracy.

Redirection Recovery

After each prediction, the top item of the RAS stack and the stack pointer are stored in the FTQ's ftq_redirect_sram, and the global BPU history used is stored in the FTQ for misprediction recovery.

Pre-decode Detects Prediction Error

After FTQ sends a fetch request to the IFU, the IFU writes back pre-decode information to FTQ, and the ifuWbPtr increments. If the pre-decode detects a prediction error, it sends a corresponding redirection request to the BPU. FTQ restores the bpuPtr and ifuPtr based on the ftqIdx in the redirection signal.

Backend Detects Misprediction

If the backend detects a misprediction during instruction execution, it notifies the FTQ. FTQ sends the corresponding redirection requests to the IFU and BPU. At the same time, FTQ restores bpuPtr, ifuPtr, and ifuWbPtr based on the ftqIdx in the redirection signal.

To enable reading the redirection data stored in FTQ one cycle earlier and reduce redirection penalty, the backend transmits the ftqIdxAhead signal and ftqIdxSelOH signal to the FTQ one cycle earlier (relative to the formal backend redirect signal). However, the backend cannot get the accurate ftqIdx one cycle earlier. It needs arbitration among 4 ALU paths, but the arbitration result is only available when the formal backend redirect signal is valid. Therefore, the early redirect's ftqIdx signal received by the FTQ needs to be read for all four paths.

io.fromBackend.ftqIdxAhead: 7 FtqIdx. Indicates the index in FTQ where the prediction block needing redirection is stored. There are 7 because the backend has 7 paths that can generate a redirect signal before final arbitration: Jump * 1, Alu * 4, LdReplay * 1, Exception * 1. However, we only read the redirect signals generated by Alu * 4 earlier, so only 4 FtqIdx are actually used in ftqIdxAhead.
Io.fromBackend.ftqIdxSelOH: 4-bit one-hot + valid, indicating the validity of ftqIdxAhead for the 4 paths, high active.

Sending Prefetch Requests to the ICache

Since the BPU is largely non-blocking, it can often get ahead of the IFU. Therefore, the FTQ implements the logic to use the fetch requests provided by the BPU that have not yet been sent to the IFU for instruction prefetching, sending prefetch requests directly to the instruction cache.

Overall Block Diagram

Interface Timing

BPU to FTQ Interface Timing

The figure above illustrates the interface timing for prediction results from the BPU to the FTQ. When the corresponding handshake signals io_fromBpu_resp_valid and io_fromBpu_resp_ready are both high, the prediction results from the three BPU pipeline stages are input to the FTQ at pipeline stages 1, 2, and 3, respectively.

If the prediction result from a later BPU pipeline stage is inconsistent with a previous stage, the corresponding redirect signal io_fromBpu_resp_bits_s2_hasRedirect_4 or io_fromBpu_resp_bits_s3_hasRedirect_4 will be asserted, indicating that the prediction pipeline needs to be flushed.

Functional Description

FTQ is a buffer queue between the branch prediction unit and the instruction fetch unit. Its main function is to temporarily store the fetch targets predicted by the BPU and send fetch requests to the IFU based on these fetch targets. Another important function is to temporarily store the prediction information from various BPU predictors and send this information back to the BPU for training after instruction commit. Therefore, it needs to maintain the complete lifecycle of instructions from prediction to commit. Since storing PC in the backend has high overhead, the backend reads the instruction PC from the FTQ when needed.

Internal Structure

FTQ has 64 entries and is a queue structure. However, the content of each entry is stored in different storage structures based on its characteristics. These storage structures primarily include the following types:

ftq_pc_mem: Implemented as a register file, storing information related to instruction addresses, including the following fields:
startAddr: Start address of the prediction block.
nextLineAddr: Start address of the next cache line for the prediction block.
isNextMask: Indicates for each possible instruction start position within the prediction block whether it is in the next region aligned by the prediction width.
fallThruError: Indicates whether there is an error in the predicted next sequential fetch address.
ftq_pd_mem: Implemented as a register file, storing the decode information for each instruction within the prediction block returned by the fetch unit, including the following fields:
brMask: Whether each instruction is a conditional branch instruction.
jmpInfo: Information about the unconditional jump instruction at the end of the prediction block, including whether it exists, if it's jal or jalr, and if it's a call or ret instruction.
jmpOffset: Position of the unconditional jump instruction at the end of the prediction block.
jalTarget: Jump address for jal at the end of the prediction block.
rvcMask: Whether each instruction is a compressed instruction.
ftq_redirect_sram: Implemented as SRAM, storing prediction information that needs to be restored during redirection, mainly including information related to RAS and branch history.
ftq_meta_1r_sram: Implemented as SRAM, storing other BPU prediction information.
ftb_entry_mem: Implemented as a register file, storing necessary FTB entry information during prediction, used for training new FTB entries after commit.

Additionally, information such as queue pointers, the status of each item in the queue, etc., are implemented using registers.

Instruction Lifecycle in FTQ

Instructions are issued from the BPU in prediction blocks and enter the FTQ. They remain in the FTQ until all instructions within the prediction block have completed commitment in the backend. Only then does the FTQ fully release the entry corresponding to that prediction block in the storage structure. The events during this process are as follows:

A prediction block is issued from the BPU and enters the FTQ. The bpuPtr increments, the status of the corresponding FTQ entry is initialized, and various prediction information is written into the storage structure. If the prediction block comes from the BPU's overwrite prediction logic, the bpuPtr and ifuPtr are restored.
FTQ sends a fetch request to the IFU. The ifuPtr increments, and it waits for the pre-decode information to be written back.
The IFU writes back pre-decode information. The ifuWbPtr increments. If the pre-decode detects a prediction error, it sends a corresponding redirection request to the BPU and restores the bpuPtr and ifuPtr.
Instructions enter backend execution. If the backend detects a misprediction, it notifies the FTQ, which sends redirection requests to the IFU and BPU and restores the bpuPtr, ifuPtr, and ifuWbPtr.
Instructions commit in the backend and notify the FTQ. When all valid instructions in the FTQ entry have committed, the commPtr increments, and the corresponding information is read from the storage structure and sent to the BPU for training.

The lifecycle of instructions within prediction block 'n' involves the four pointers bpuPtr, ifuPtr, ifuWbPtr, and commPtr in the FTQ. When bpuPtr starts pointing to n+1, the instructions within the prediction block enter their lifecycle. When commPtr points to n+1, the instructions within the prediction block complete their lifecycle.

Other Functions of FTQ

Since the BPU is largely non-blocking, it can often get ahead of the IFU. Therefore, the fetch requests provided by the BPU that have not yet been sent to the IFU can be used for instruction prefetching. FTQ implements this logic, sending prefetch requests directly to the instruction cache.