Kunming Lake FTQ Module Documentation
Terminology Explanation
Table 1.1 Terminology Explanation
Abbreviation | Full Name | Description |
---|---|---|
CRU | Clock Reset Unit | Clock Reset Unit |
FTQ | Fetch Target Queue | Fetch Target Queue |
FTB | Fetch Target Buffer | Fetch Target Buffer |
Functional Description
Functional Overview
FTQ is a buffer queue between the branch prediction unit and the instruction fetch unit. Its main function is to temporarily store the fetch targets predicted by the BPU and send fetch requests to the IFU based on these fetch targets. Another important function is to temporarily store the prediction information from various BPU predictors and send this information back to the BPU for training after instruction commit. Therefore, it needs to maintain the complete lifecycle of instructions from prediction to commit.
- Supports temporary storage of BPU-predicted fetch targets and sending fetch requests to the IFU.
- Supports temporary storage of BPU prediction information and sending it back to the BPU for training.
- Supports redirection recovery.
- Supports sending prefetch requests to the ICache.
Temporarily Storing BPU-Predicted Fetch Targets and Sending Fetch Requests to the IFU
Temporarily Storing BPU-Predicted Fetch Targets
Structure for Storing PC
A BPU prediction goes through three pipeline stages, and each stage generates new prediction content. FTQ receives the prediction results from each BPU pipeline stage, and the results from later stages overwrite those from earlier stages.
Instructions are issued from the BPU in prediction blocks, enter the FTQ, and the bpuPtr
increments. The corresponding FTQ entry's status is initialized, and various prediction information is written into the storage structure. If the prediction block comes from the BPU's overwrite prediction logic, the bpuPtr
and ifuPtr
are restored.
The fetch targets predicted by the BPU are temporarily stored in ftq_pc_mem
:
ftq_pc_mem
: Implemented as a register file, used to store information related to instruction addresses, including the following fields:startAddr
: Start address of the prediction block.nextLineAddr
: Start address of the next cache line for the prediction block.isNextMask
: Indicates for each possible instruction start position within the prediction block whether it is in the next region aligned by the prediction width.isNextMask
is 16 bits, where each bit indicates whether the position at2byte * n
relative to the start address crosses a cache line, describing the nature of each position.fallThruError
: Indicates whether there is an error in the predicted next sequential fetch address.
Each field exists in its own register (e.g., data_0_startAddr
), and they are not concatenated and stored in the same register.
Method for Calculating PC
Each fetch from the ICache retrieves one or two CacheLineSize (64Bytes) lengths of cache line instruction data. Whether two are fetched depends on whether the prediction block crosses a cache line.
Each prediction block has a length of PredictWidth (16) compressed instruction lengths (32Bytes). The length of each cache line is twice the length of each prediction block. Therefore, the startAddr
of each prediction block is either in the first half of the current cache line (startAddr[5]=0
) or in the second half (startAddr[5]=1
).
If startAddr[5]=0
, then the current prediction block definitely does not cross a cache line. In this case, the predicted instruction PC is {startAddr[38:6], startAddr[5:1] + offset, 1'b0}
.
If startAddr[5]=1
, then the current prediction block might cross a cache line. In this case:
- If
isNextMask(offset)=0
, it means the current predicted instruction PC does not cross the cache line. Then the predicted instruction PC is{startAddr[38:6], startAddr[5:1] + offset, 1'b0}
. - If
isNextMask(offset)=1
, it means the current predicted instruction PC crosses the cache line. Then the predicted instruction PC is{nextLineAddr[38:6], startAddr[5:1] + offset, 1'b0}
.
Sending Fetch Requests to the IFU
FTQ sends a fetch request to the IFU, the ifuPtr
increments, and it waits for the pre-decode information to be written back.
The pre-decode information written back by the IFU is temporarily stored in ftq_pd_mem
:
ftq_pd_mem
: Implemented as a register file, storing the decode information for each instruction within the prediction block returned by the fetch unit, including the following fields:brMask
: Whether each instruction is a conditional branch instruction.jmpInfo
: Information about the unconditional jump instruction at the end of the prediction block, including whether it exists, if it'sjal
orjalr
, and if it's acall
orret
instruction.jmpOffset
: Position of the unconditional jump instruction at the end of the prediction block.jalTarget
: Jump address forjal
at the end of the prediction block.rvcMask
: Whether each instruction is a compressed instruction.
Temporarily Storing BPU Prediction Information and Sending it Back to the BPU for Training
Temporarily Storing BPU Prediction Information
Besides being temporarily stored in ftq_pc_mem
mentioned above, some of the prediction information passed from the BPU to the FTQ is also stored in ftq_redirect_sram
, ftq_pc_mem
, and ftb_entry_mem
.
ftq_redirect_sram
: Implemented as SRAM, storing prediction information that needs to be restored during redirection, mainly including information related to RAS and branch history. It is divided into 3 banks, each with a depth × width of 64 × 236.ftq_meta_1r_sram
: Implemented as SRAM, storing other BPU prediction information. The SRAM depth × width is 64 × 256.ftb_entry_mem
: Implemented as a register file, storing necessary FTB entry information during prediction, used for training new FTB entries after commit. Why storeftb_entry
? Because when updating, theftb_entry
needs to be modified based on the original. To avoid rereading the FTB, theftb_entry
is stored here inftb_entry_mem
.
The specific implementation mechanisms for each SRAM/MEM in FTQ are shown in the table below:
| | Write Timing (Forward Write) | Update Timing (Backward Update, e.g., Redirection) | Read Timing | Written Data Content | Updated Data Content |
| ---------------- | -------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | None |
| ftq_pc_mem | BPU pipeline S1 stage, when creating a new prediction entry | Does not exist (Current design is that FTQ summarizes redirection and sends it to BPU and IFU. When BPU re-enqueues the prediction block redirected to the new address, the new block is written to ftq_pc_mem
. The entries in ftq_pc_mem
represent the addresses of the current prediction block and do not include the target, so the incorrectly predicted block does not need to be updated) | Data is read into a Reg every clock cycle. If IFU does not need to read from bypass, the Reg data is directly connected to Icache and IFU | startAddr
: Start address of the prediction block nextLineAddr
: Start address of the next cache line for the prediction block isNextMask
: Indicates for each possible instruction start position within the prediction block whether it is in the next region aligned by the Predict Width (① If isNextMask(offset) = 0
, it means the current predicted instruction PC does not cross the cache line. Then the predicted instruction PC is {startAddr[38:6], startAddr[5:1] + offset, 1'b0}
. ② If isNextMask(offset) = 1
, it means the current predicted instruction PC crosses the cache line. Then the predicted instruction PC is {nextLineAddr[38:6], startAddr[5:1] + offset, 1'b0}
.) fallThruError
: Indicates whether there is an error in the predicted next sequential fetch address |
| ftq_meta_1r_sram | BPU pipeline S3 stage | | When the instructions in the FTQ entry can commit, the meta data is read out and sent to BPU for training | The written data packet includes prediction information from 4 predictors |
| ftb_entry_mem | BPU pipeline S3 stage | | 1. Backend redirection 2. IFU writes back pre-decode information 3. IFU pre-decode detects error and sends redirection | BrSlot
: brSlot_offset/lower/tarStat/sharing/valid TailSlot
: tailSlot_offset/lower/tarStat/sharing/valid pftAddr, carry, isCall, isRet, isJalr... |
| ftq_pd_mem | Next cycle of IFU stage F3 pipeline | | Always reads the data corresponding to commPtr
as the address, assigned to ftbEntryGen
| rvcMask brMask jmpInfo jmpOffset jalTarget |
Sending Back to BPU for Training
When instructions commit in the backend, they notify the FTQ that they have committed. When all valid instructions in an FTQ entry have committed in the backend, the commPtr
increments, and the corresponding information is read from the storage structure and sent to the BPU for training.
In Kunming Lake V2, a commitStateQueue
was used to record the commit status of instructions within an FTQ entry. Note that this design was incomplete and violated the original intention of BPU updates. In V3, this mechanism has been entirely removed in conjunction with the backend.
Each bit of the commitStateQueue
recorded whether an instruction in an FTQ entry had been committed.
Because the V2 backend re-compressed FTQ entries in the ROB, it was not guaranteed that every instruction in an entry would be committed, nor was it guaranteed that every entry would have instructions committed. Determining if an entry was committed had the following possibilities:
robCommPtr
is beforecommPtr
. This means the backend has already started committing instructions from subsequent entries, so entries before the one pointed to byrobCommPtr
must have completed commit.- The last instruction in the
commitStateQueue
is committed. The commit of the last instruction in an entry meant that this entry was completely committed.
Furthermore, it was necessary to consider that the backend might have self-flushing redirect requests, meaning the instruction itself needed to be re-executed. This included exceptions, load replay, and other situations. In such cases, this entry should not be committed to update the BPU, as it would significantly decrease BPU accuracy.
Redirection Recovery
After each prediction, the top item of the RAS stack and the stack pointer are stored in the FTQ's ftq_redirect_sram
, and the global BPU history used is stored in the FTQ for misprediction recovery.
Pre-decode Detects Prediction Error
After FTQ sends a fetch request to the IFU, the IFU writes back pre-decode information to FTQ, and the ifuWbPtr
increments. If the pre-decode detects a prediction error, it sends a corresponding redirection request to the BPU. FTQ restores the bpuPtr
and ifuPtr
based on the ftqIdx
in the redirection signal.
Backend Detects Misprediction
If the backend detects a misprediction during instruction execution, it notifies the FTQ. FTQ sends the corresponding redirection requests to the IFU and BPU. At the same time, FTQ restores bpuPtr
, ifuPtr
, and ifuWbPtr
based on the ftqIdx
in the redirection signal.
To enable reading the redirection data stored in FTQ one cycle earlier and reduce redirection penalty, the backend transmits the ftqIdxAhead
signal and ftqIdxSelOH
signal to the FTQ one cycle earlier (relative to the formal backend redirect signal). However, the backend cannot get the accurate ftqIdx
one cycle earlier. It needs arbitration among 4 ALU paths, but the arbitration result is only available when the formal backend redirect signal is valid. Therefore, the early redirect's ftqIdx
signal received by the FTQ needs to be read for all four paths.
io.fromBackend.ftqIdxAhead
: 7FtqIdx
. Indicates the index in FTQ where the prediction block needing redirection is stored. There are 7 because the backend has 7 paths that can generate a redirect signal before final arbitration: Jump * 1, Alu * 4, LdReplay * 1, Exception * 1. However, we only read the redirect signals generated by Alu * 4 earlier, so only 4FtqIdx
are actually used inftqIdxAhead
.Io.fromBackend.ftqIdxSelOH
: 4-bit one-hot + valid, indicating the validity offtqIdxAhead
for the 4 paths, high active.
Sending Prefetch Requests to the ICache
Since the BPU is largely non-blocking, it can often get ahead of the IFU. Therefore, the FTQ implements the logic to use the fetch requests provided by the BPU that have not yet been sent to the IFU for instruction prefetching, sending prefetch requests directly to the instruction cache.
Overall Block Diagram
Interface Timing
- BPU to FTQ Interface Timing
The figure above illustrates the interface timing for prediction results from the BPU to the FTQ. When the corresponding handshake signals io_fromBpu_resp_valid
and io_fromBpu_resp_ready
are both high, the prediction results from the three BPU pipeline stages are input to the FTQ at pipeline stages 1, 2, and 3, respectively.
If the prediction result from a later BPU pipeline stage is inconsistent with a previous stage, the corresponding redirect signal io_fromBpu_resp_bits_s2_hasRedirect_4
or io_fromBpu_resp_bits_s3_hasRedirect_4
will be asserted, indicating that the prediction pipeline needs to be flushed.
Functional Description
FTQ is a buffer queue between the branch prediction unit and the instruction fetch unit. Its main function is to temporarily store the fetch targets predicted by the BPU and send fetch requests to the IFU based on these fetch targets. Another important function is to temporarily store the prediction information from various BPU predictors and send this information back to the BPU for training after instruction commit. Therefore, it needs to maintain the complete lifecycle of instructions from prediction to commit. Since storing PC in the backend has high overhead, the backend reads the instruction PC from the FTQ when needed.
Internal Structure
FTQ has 64 entries and is a queue structure. However, the content of each entry is stored in different storage structures based on its characteristics. These storage structures primarily include the following types:
ftq_pc_mem
: Implemented as a register file, storing information related to instruction addresses, including the following fields:startAddr
: Start address of the prediction block.nextLineAddr
: Start address of the next cache line for the prediction block.isNextMask
: Indicates for each possible instruction start position within the prediction block whether it is in the next region aligned by the prediction width.fallThruError
: Indicates whether there is an error in the predicted next sequential fetch address.ftq_pd_mem
: Implemented as a register file, storing the decode information for each instruction within the prediction block returned by the fetch unit, including the following fields:brMask
: Whether each instruction is a conditional branch instruction.jmpInfo
: Information about the unconditional jump instruction at the end of the prediction block, including whether it exists, if it'sjal
orjalr
, and if it's acall
orret
instruction.jmpOffset
: Position of the unconditional jump instruction at the end of the prediction block.jalTarget
: Jump address forjal
at the end of the prediction block.rvcMask
: Whether each instruction is a compressed instruction.ftq_redirect_sram
: Implemented as SRAM, storing prediction information that needs to be restored during redirection, mainly including information related to RAS and branch history.ftq_meta_1r_sram
: Implemented as SRAM, storing other BPU prediction information.ftb_entry_mem
: Implemented as a register file, storing necessary FTB entry information during prediction, used for training new FTB entries after commit.
Additionally, information such as queue pointers, the status of each item in the queue, etc., are implemented using registers.
Instruction Lifecycle in FTQ
Instructions are issued from the BPU in prediction blocks and enter the FTQ. They remain in the FTQ until all instructions within the prediction block have completed commitment in the backend. Only then does the FTQ fully release the entry corresponding to that prediction block in the storage structure. The events during this process are as follows:
- A prediction block is issued from the BPU and enters the FTQ. The
bpuPtr
increments, the status of the corresponding FTQ entry is initialized, and various prediction information is written into the storage structure. If the prediction block comes from the BPU's overwrite prediction logic, thebpuPtr
andifuPtr
are restored. - FTQ sends a fetch request to the IFU. The
ifuPtr
increments, and it waits for the pre-decode information to be written back. - The IFU writes back pre-decode information. The
ifuWbPtr
increments. If the pre-decode detects a prediction error, it sends a corresponding redirection request to the BPU and restores thebpuPtr
andifuPtr
. - Instructions enter backend execution. If the backend detects a misprediction, it notifies the FTQ, which sends redirection requests to the IFU and BPU and restores the
bpuPtr
,ifuPtr
, andifuWbPtr
. - Instructions commit in the backend and notify the FTQ. When all valid instructions in the FTQ entry have committed, the
commPtr
increments, and the corresponding information is read from the storage structure and sent to the BPU for training.
The lifecycle of instructions within prediction block 'n' involves the four pointers bpuPtr
, ifuPtr
, ifuWbPtr
, and commPtr
in the FTQ. When bpuPtr
starts pointing to n+1, the instructions within the prediction block enter their lifecycle. When commPtr
points to n+1, the instructions within the prediction block complete their lifecycle.
Other Functions of FTQ
Since the BPU is largely non-blocking, it can often get ahead of the IFU. Therefore, the fetch requests provided by the BPU that have not yet been sent to the IFU can be used for instruction prefetching. FTQ implements this logic, sending prefetch requests directly to the instruction cache.