Kunming Lake BPU Module Document

Terminology Explanation

Table 1.1 Terminology Explanation

Abbreviation	Full Name	Description
BPU	Branch Prediction Unit	Branch Prediction Unit
IFU	Instruction Fetch Unit	Instruction Fetch Unit
FTQ	Fetch Target Queue	Fetch Target Queue
uFTB	Micro Fetch Target Buffer	Branch Target Buffer
FTB	Fetch Target Buffer	Fetch Target Buffer
TAGE	TAgged GEometric length predictor	A conditional branch predictor
SC	statistical corrector predictor	A conditional branch predictor used to correct TAGE predictions in case of statistical bias
ITTAGE	Indirect Target TAgged GEometric length predictor	A branch predictor used to predict the target address of indirect jump instructions
RAS	Return Address Stack	A branch predictor used to predict the target address of return instructions corresponding to call instructions

Design Specification

Supports generating one branch prediction bundle and its corresponding predictor auxiliary information at a time.
Supports simple prediction without bubbles.
Supports multiple accurate predictors and override mechanisms.
Supports training predictors.
Supports branch history information maintenance and misprediction recovery.
Supports top-down performance event statistics.

Function Description

Function Overview

The BPU module receives redirection signals from the backend execution unit and subsequent pipeline stages outside the module. Based on the address, it uses multiple predictors to generate a prediction bundle starting from the current PC value and the internal meta information of each predictor when generating this bundle. This information is passed to the subsequent Fetch Target Queue (FTQ) for storage. The prediction bundle is used by the Instruction Fetch Unit (IFU), and the meta information is used for future training and restoring predictors. Among these, the BPU module uses a fully associative uFTB as a next line predictor, generating a simple bubble-free prediction result with a theoretical delay of only 1 cycle. This result is directly passed as output to the FTQ. At the same time, this basic prediction result will flow within the BPU's subsequent pipeline to be used by advanced prediction components to provide more accurate prediction results. Once the prediction result of an advanced predictor in a subsequent pipeline stage differs from the existing result, the advanced predictor's result will be used as the new output to update the prediction bundle result stored in the subsequent FTQ, redirect the s0 level PC, and clear the wrong path results in pipeline stages before the new result's stage. The information predicted for different types of instructions also varies. The target address of conditional branch instructions is provided by the uFTB, and its direction needs to be predicted; the target address of unconditional direct jump instructions is provided by the uFTB and no special result prediction is required; the jump direction of indirect jump instructions does not need to be predicted, but the jump address result provided by the uFTB is not necessarily correct and needs prediction.

Advanced predictors within the BPU include FTB, TAGE-SC, ITTAGE, and RAS. Among these, the FTB is responsible for maintaining the start address, end address, PC addresses of contained branch instructions, types (whether branch, whether jalr, whether jal, whether call, whether return), and basic direction results of the prediction bundle. TAGE-SC is the main predictor for conditional branch instructions. ITTAGE is used to predict indirect jump instructions. RAS is responsible for predicting the jump address of return type indirect jump instructions.

Multiple predictors within the prediction unit use branch prediction history as a prediction condition. To improve the match between history and the true execution trace, branch prediction history is also speculatively updated along with the prediction result. Global branch history is maintained at the top level of the BPU using multiple update sources. When maintaining, it is maintained according to the lengths required for TAGE, SC, and ITTAGE prediction. The branch history maintenance algorithm is consistent for different branch history lengths. Specifically, the branch history lengths used by TAGE are 8, 13, 32, 119; the history lengths used by ITTAGE are 4, 8, 13, 16, 32; the history lengths used by SC are 0, 4, 10, 16. Branch prediction history is uniformly maintained at the top level of the BPU module. The update sources, in increasing order of priority from low to high, are: branch history temporarily stored by s0 stall, branch history updated using the s1 prediction result, branch history updated using the s2 prediction result, branch history updated using the s3 prediction result, and branch history redirected from outside the BPU. The specific maintenance strategy for each branch history is:

Branch history temporarily stored by s0 stall: No active update is performed; it always remains consistent with the latest global folded history.

Branch history updated using the s1 prediction result: Updated on the global folded history passed along the pipeline using the s1 stage branch prediction result. Specifically, the prediction result is shifted into the global branch history according to the located branch prediction slot. If it's in slot 0, it's shifted in directly. If it's in slot 1, 0 (slot 0 not taken) and the current prediction result are shifted in.

Branch history updated using the s2 prediction result: Updated on the global folded history passed along the pipeline using the s2 stage branch prediction result. The update algorithm is the same as for s1. The s2 update is only effective when the s2 prediction result differs from the previous s1 result.

The s3 update strategy is the same as s1 and s2. The update is only effective when the s3 result differs from the previous s1 or s2 result.

Redirected branch history updates only occur during redirection. Depending on the addIntoHist signal in the redirection information, the returned branch history is either used directly or after adding the direction result of the corresponding branch instruction for the redirect to update the BPU's global branch history.

To ensure relatively accurate prediction results, each branch predictor needs to be continuously trained using the latest execution results. Specifically, updated prediction bundles and the meta information of the internal state of each predictor when making that prediction will be generated within the FTQ module and passed back to the BPU unit for each predictor to update its internal state.

Branch prediction does not guarantee correctness. When the prediction result does not match the true state, the state needs to be recovered to before the state was updated using the erroneous prediction. This mainly involves the recovery of branch history and the redirection of the prediction bundle's start address.

Branch Prediction Bundle and Meta Information Generation

Branch Prediction Bundle Concept

The purpose of branch prediction is to predict the jump direction and target of branch instructions in the execution stream to speculatively generate subsequent fetch PC range information before actually executing the current instruction, thereby ensuring a continuous supply of instructions.

A branch prediction bundle contains information such as whether the bundle is valid (BranchPredictionBundle.valid), the starting address, the complete prediction result, FTB entry, folded branch history, RAS predictor stack top, etc. Among these, the complete prediction result comes from uFTB in the first pipeline stage, and subsequently from advanced predictors like FTB. The FTB entry comes from the uFTB and FTB read results.

The complete prediction result records information such as the branch instruction jump direction, whether the recorded branch instruction information within the bundle is valid (slot_valids, i.e., whether such a branch instruction exists; one valid corresponds to one slot, one slot corresponds to one branch instruction within the prediction bundle, totaling 2 slots, where the last slot may record the second branch instruction within the bundle or an unconditional jump/indirect jump instruction), the branch instruction target, the jalr instruction target, the branch instruction offset within the bundle, the instruction bundle end address when there is no jump, whether the end address is incorrect (bundle start address greater than end address, indicating a false hit), the type of the last branch instruction, whether the last instruction is an RVI call instruction, whether the second branch instruction slot records a branch instruction instead of an unconditional/indirect jump instruction, whether there was a hit, and other information. As mentioned before, a prediction bundle can contain at most 2 branch instructions / 1 branch instruction + 1 unconditional jump instruction. If the actual number of instructions recorded within the prediction bundle exceeds this limit, the subsequent FTQ module will split it. Furthermore, if there is no branch instruction within the prediction bundle or the branch instruction quantity limit is not exceeded but the maximum width of the prediction bundle (32B) is reached, it will also be truncated.

The FTB entry records information such as whether the entry is valid (FTB uses direct mapping; if the read address has not been written, this flag is invalid), the first branch instruction slot, the branch/jump shared slot at the end, the end address, the instruction type, whether the last instruction is an RVI call instruction, whether it always jumps, and other information. The FTB entry can be used to generate the complete prediction result.

Branch Prediction Bundle Generation

The basic PC information in the branch prediction bundle is initially specified by the reset address passed from outside the module. Subsequently, during processor operation, it is continuously speculatively updated according to the prediction bundle's jump address in normal cases. When a misprediction is encountered, the PC value will be updated according to the value provided by the redirect channel. The next line's complete prediction result is read from the uFTB module. In the complete prediction result, the FTB entry is generated by the FTQ module based on past training results. The jump direction of conditional branch instructions is generated by the uFTB and subsequently updated by the TAGE-SC predictor. The jump address of indirect jump instructions is generated by the uFTB and subsequently updated by the ITTAGE predictor. The RAS predictor will override the ITTAGE prediction result for return type indirect jump instructions. Overridden results are only reflected in the predictor output and will not immediately feedback to the overridden predictor to update its internal state.

Meta Information Generation

Each predictor, for ease of self-update, will pass the internal state information of the predictor when making a prediction (e.g., the prediction table index hit when making that prediction, the hit index) as meta information along with the prediction result in the pipeline.

Simple Bubble-Free Prediction

The uFTB serves as the BPU's next line predictor, making bubble-free basic predictions for the processor to continuously generate the next speculative PC value.

uFTB Request Reception

Each time the stage 1 request is valid, bits 16 to 1 of the incoming prediction bundle's starting PC are truncated to generate a tag and sent to the fully associative uFTB within this module to read the FTB entry. The contents of the FTB entry are as described before. The uFTB has 32 entries built with registers in a fully associative structure. Since it is implemented using registers, each entry can generate a hit signal for this entry and read out the FTB entry data based on whether the data stored within it is valid and whether the stored tag matches the incoming information in the current cycle and return it to the uFTB level.

uFTB Data Reading and Return

In the current cycle, the uFTB storage array has returned the hit signal and read out data. In this stage, at most one hit entry will be selected from the returned hit signals, and the prediction result will be generated using that hit entry. The algorithm for generating the complete prediction result is described in detail in the subsequent FTB module. Here, the uFTB has an additional supplementary counter mechanism, adding a 2-bit wide counter to each of the at most 2 branch instructions within each uFTB entry. If the counter is greater than 1 or the always_taken in the FTB entry is valid (this mechanism also exists in the FTB module), the prediction result is a jump. In addition, the hit signal at this level and the selected hit way number are also sent out as meta information of this predictor when entering the s3 stage along with other predictors, and stored in the FTQ together with the final prediction result. The cycle count in the meta information is only used for performance statistics in simulation environments; it does not exist in FPGA environments. This predictor has no other extra actions in stages 2 and 3.

uFTB Data Update

When all instructions corresponding to this prediction bundle are committed, the FTQ transmits an update channel connected to this module, containing the FTB entry updated by the FTQ module based on instruction commit information, to the BPU. Since the fully associative uFTB storage array is built entirely with registers, write operations do not affect parallel read operations, and the incoming update information will always be used for updating. When the update channel is valid, the incoming update PC value will be used in the current cycle to generate a tag, match it with existing entries in the uFTB, and generate a match signal and the matching way signal. In the next cycle, if a match exists, the write signal for the matching way will be asserted. Otherwise, a way to be replaced will be selected using a pseudo-LRU replacement algorithm, and the corresponding way's write signal will be asserted. The data written is the updated FTB entry.

The counter maintenance for each branch instruction is also updated simultaneously when the update channel is asserted. In the cycle after the update channel is asserted, the counter corresponding to the taken branch instruction within the updated FTB entry and the branch instruction before it is updated. If taken, the counter increments by 1; if not taken, the counter decrements by 1. If saturated (0 or all 1s), the current value remains unchanged.

The pseudo-LRU algorithm also requires data updates. It has two data sources: one is the encoded way number hit when making a prediction, and the other is the encoded way number to be written when the uFTB is updated. If either is valid, the pseudo-LRU state is updated using its information. When both are valid, both pieces of information are used sequentially in combination logic within one cycle to update the state.

Multiple Advanced Predictors and Override Mechanisms

The results output by each advanced predictor will be compared with the results generated in the previous pipeline stage and passed along the pipeline (the uFTB's minimal result or the complete result provided by the FTB in a later pipeline stage). If there is a difference, the newer result will be used to flush the pipeline.

Composer

Composer is a module used to combine multiple predictors. In this project, it combines five predictors: uFTB, FTB, TAGE-SC, ITTAGE, and RAS, and abstractly presents them as a three-stage pipeline override predictor to the outside. The predictors in Composer can be turned on or off by writing the custom register sbpctl, allowing predictors to be used on demand. Upon detecting an external redirection, Composer sends the redirection request to each predictor to restore the speculatively updated elements. After all instructions in a prediction bundle are committed, each predictor in Composer is trained. Finally, Composer outputs the three-stage prediction results to the Predictor. The meta information of each predictor is concatenated in this module and passed to the FTQ. The training meta information returned by the FTQ is also split in this module and sent to each module.

Start PC Configuration

The Composer's IO interface io_reset_vector can configure the starting PC. Simply pass the desired starting PC to this IO.

Connection with Predictors

Composer connects the five predictors: uFTB, FTB, TAGE-SC, ITTAGE, and RAS. There are three pipeline stages for branch predictors. The same pipeline stage of each predictor is connected sequentially from front to back with combinational logic. Each predictor has a fixed delay and will definitely complete prediction by that pipeline stage. Therefore, Composer only needs to output the prediction result of the corresponding predictor at the corresponding pipeline stage.

Predictor Switching

Using Zicsr instructions, we can read and write the custom CSR sbpctl to control the enable of each predictor in Composer. sbpctl[6:0] represents the enables of the seven predictors: {LOOP, RAS, SC, TAGE, BIM, BTB, uFTB}. A high level indicates enable, and a low level indicates not enabled. Specifically, the value of the sbpctl CSR is passed to each predictor via the Composer's IO interface io_ctrl_*, and each predictor is responsible for implementing the enable. Currently, Loop and BIM predictors are not included in the architecture, so the corresponding bits are invalid.

Redirection Recovery

Composer receives redirection requests via the IO ports io_s2_redirect, io_s3_redirect, and io_redirect_*, etc. These requests are sent to its individual predictors to restore speculatively updated elements, such as the RAS stack top entry, etc.

FTB

FTB temporarily stores FTB entries, providing more accurate branch instruction location, type, target address, and other critical information for prediction bundles to subsequent advanced predictors like TAGE, ITTAGE, SC, and RAS. It also provides a basic direction prediction for branch instructions that always jump. The FTB module contains an FTBBank module responsible for the actual storage of FTB entries. The module uses a multi-way SRAM as the storage device. The SRAM specifications and format are detailed later.

Request Reception

In stage 0, the FTB module sends a read request to the internal FTBBank. The requested PC value is the PC passed in s0.

Data Reading and Return

In the cycle after sending the request, which is the predictor's stage 1, the multi-way signals read from the FTB SRAM will be temporarily stored.

In the next cycle, which is the predictor's stage 2, a hit signal is generated from the temporarily stored data based on the tag of each way and the match between the high bits of the PC and the tag generated when the request was made. If a hit occurs, the hit FTB data is selected. If there is a hit request, the returned value is the selected FTB entry and the hit way information. If there is no hit, the output data is meaningless.

The data read out by the FTBBank module is passed within the FTB module as the stage 2 prediction result to the s2 stage of the subsequent BPU predictors to obtain branch instruction types, PC information, etc. Furthermore, this read-out result is also temporarily stored within the FTB module. In stage 3, it is passed as a prediction result using combinational logic to the subsequent predictors. If the FTB hits, the read-out hit way number, hit information, cycle count, etc., are also passed along the pipeline. Finally, if this prediction bundle is not flushed midway through the pipeline, it is passed as meta information to the subsequent FTQ module in s3. The cycle count is only used for performance statistics in simulation environments; it does not exist in FPGA environments.

In addition, if the valid branch instruction recorded in the FTB entry has an always taken flag, indicating that this branch instruction has never had a non-taken case in history, the corresponding br_taken_mask in the stage 2 prediction result is directly asserted in this module, directly predicting the branch instruction as taken, and no longer using the prediction results from other advanced predictors.

TAGE-SC

TAGE-SC is the main predictor for conditional branches in the Nanhu architecture and belongs to Accurate Predictors (APD).

TAGE uses multiple prediction tables with different history lengths to exploit very long branch history information; SC is a statistical corrector.

TAGE consists of a base prediction table and multiple history tables. The base prediction table is indexed by PC, while the history tables are indexed by XORing PC with the folded results of a certain length of branch history. Different history tables use different branch history lengths. During prediction, the PC is also XORed with another folded result of the branch history corresponding to each history table to calculate a tag, which is matched with the tag read from the table. If the match is successful, that table hits. The final result depends on the prediction result of the hit table with the longest history length.

When SC believes TAGE has a significant probability of misprediction, it reverses the final prediction result.

In the Nanhu architecture, at most 2 conditional branch instructions can be predicted simultaneously in each cycle. When accessing the various history tables of TAGE, the starting address of the prediction bundle is used as the PC, and two prediction results are fetched simultaneously. The branch history used for them is also the same.

TAGE Prediction Timing

TAGE is a high-accuracy conditional branch direction predictor. It uses different lengths of branch history and the current PC value to address multiple SRAM tables. When hits occur in multiple tables, the prediction result of the corresponding table entry with the longest history length and strongest hit is preferentially selected as the final result.

TAGE requires 2 cycles of latency:

Cycle 0: Generate the index for SRAM addressing. The index generation process is XORing the folded history and PC. The management of folded history is not inside ITTAGE and TAGE but in BPU.
Cycle 1: Read out the result.
Cycle 2: Output the prediction result.

TAGE: Folded History

Each history table of TAGE-like predictors has a specific history length. To XOR with the PC for indexing the history tables, a very long branch history sequence needs to be divided into many segments, and then all segments are XORed together. The length of each segment is usually equal to the logarithm of the history table depth. Since the number of XOR operations is generally large, to avoid the delay of multi-level XORs in the prediction path, we directly store the folded history. Since different history lengths require different folding methods, the number of required folded history copies is equal to the number of unique (history length, folded length) pairs. When updating a single bit of history, it is only necessary to XOR the oldest bit and the newest bit before folding into the corresponding position, and then perform a shift operation.

TAGE: Alternate Prediction Logic

Implemented the USE_ALT_ON_NA register to dynamically decide whether to use alternate prediction when the longest history match result lacks confidence. In the implementation, for timing considerations, the result of the base prediction table is always used as the alternate prediction, which results in a very small accuracy loss.

SC: Timing

In some applications, some branch behaviors have weak correlation with branch history or path, exhibiting a statistical prediction bias. For these branches, compared to TAGE, using a counter to capture statistical bias is more effective. TAGE is very effective for predicting strongly correlated branches. TAGE fails to predict branches with statistical bias, such as those with a small deviation in only one direction but no strong correlation with the historical path.

The purpose of statistical correction is to detect less reliable predictions and reverse them. The prediction from TAGE and branch information (address, global history, global path, local history) are presented to the statistical corrector predictor, which decides whether to reverse the prediction. SC is responsible for predicting conditional branch instructions with statistical bias and reversing the TAGE predictor's result in such cases.

SC's prediction algorithm depends on whether TAGE has a history table hit signal provided, and the provider's prediction result taken, to decide its own prediction. provided is one of the necessary conditions for using SC prediction. The provider's taken is used as the choose bit, selecting SC's final prediction, because SC may have different predictions in scenarios where the TAGE prediction results differ.

SC requires 3 cycles of latency:

Cycle 0: Generate the addressing index s0_idx. The index generation process is XORing the folded history and PC. The management of folded history is not inside ITTAGE and TAGE but in the BPU.
Cycle 1: Read out the counter data s1_scResps corresponding to s0_idx from the SCTable.
Cycle 2: Decide whether to reverse the prediction result based on s1_scResps.
Cycle 3: Output the complete prediction result.

ITTAGE

ITTAGE receives prediction requests from within the BPU. Internally, it consists of a base prediction table and multiple history tables. Each table entry has a field for storing the target address of an indirect jump instruction. The base prediction table is indexed by PC, while the history tables are indexed by XORing PC with the folded results of a certain length of branch history. Different history tables use different branch history lengths. During prediction, the PC is also XORed with another folded result of the branch history corresponding to each history table to calculate a tag, which is matched with the tag read from the table. If the match is successful, that table hits. The final result depends on the prediction result of the hit table with the longest history length. Finally, ITTAGE outputs the prediction result to the composer.

Indirect Jump Instruction Prediction

ITTAGE is used to predict indirect jump instructions. The jump targets of ordinary branch instructions and unconditional jump instructions are directly encoded in the instructions, making them easy to predict. However, the jump addresses of indirect jump instructions come from runtime-variable registers, thus having multiple possible choices, which need to be predicted based on branch history. For this purpose, each entry in ITTAGE adds a field for the predicted jump address on top of the TAGE entry. The final output is the selected hit predicted jump address, rather than the selected jump direction. Since each FTB entry stores at most one indirect jump instruction information, the ITTAGE predictor also predicts the target address of at most one indirect jump instruction per cycle.

ITTAGE requires 3 cycles of latency:

Cycle 0: Generate the addressing index.
Cycle 1: Read out the data.
Cycle 2: Select the hit result.
Cycle 3: Output the result.

Folded Branch History

History tables have specific history lengths. To XOR with the PC for indexing history tables, a very long branch history sequence needs to be divided into many segments, and then all segments are XORed together. The length of each segment is generally equal to the logarithm of the history table depth. Due to the generally large number of XOR operations, to avoid the delay of multi-level XORs in the prediction path, we directly store the folded history. Since different history lengths have different folding methods, the required number of folded history copies is equal to the number of unique (history length, folded length) pairs. When updating a single bit of history, it is only necessary to XOR the oldest bit and the newest bit before folding into the corresponding position, and then perform a shift operation.

RAS

RAS uses a stack structure to predict the return address for function call and return types of execution flow, which have matching pairs. The characteristics of call (push/call) type instructions are jal/jalr instructions with a target register address of 1 or 5. The characteristics of return (ret) type instructions are jalr instructions with a source register address of 1 or 5. These instructions are unconditional jump instructions whose type and offset within the block have already been read out from the FTB.

In the implementation, the RAS predictor provides prediction results in stages s2 and s3.

Stage 2 Results

In stage 2, since the stage 3 prediction result might still need updating, the current FTB entry is not necessarily the final execution path. The prediction made at this point speculates that the stage 3 prediction result (i.e., the previous prediction bundle) will not refresh the predicted starting address within the current pipeline stage. If the FTB entry passed from FTB in stage 2 is valid and contains a push-type (call) instruction, the PC value of the instruction immediately after it is pushed onto the RAS stack; if the FTB entry passed in stage s2 is valid and contains a pop-type (return) instruction, the address at the current stack top is returned as the result, and the result is popped from the stack.

Within the RAS stack module, the above behaviors are reflected as follows: during a push operation, if the current address is different from the stack top address, a new item is pushed onto the stack, and its corresponding counter is 0. Otherwise, the counter of the stack top item is incremented by 1. Both operations require setting this top information as a write bypass item for the current read operation to use. During a pop operation, if the current stack top item counter is 0, the stack pointer is decremented by 1. If the counter is greater than 0, the counter is decremented by 1. For timing optimization, data written into the RAS stack will be written after a one-cycle delay. Considering the possibility that data written in the current cycle needs to be read in the next cycle, a write bypass mechanism is designed. Data ready to be written will first be used in the current cycle to update items related to write bypass, including the write operation pointer and write operation data. If the pointer requested for reading in the next cycle matches the pointer position recorded by the write bypass, the bypass value is used; otherwise, the actual stack top value is used.

Stage 3 Results

In stage 3, the speculative push/pop operations recorded in stage 2 are passed along the pipeline. Stage 3 will generate push/pop control signals using the stage 3 FTB entry (at this point, there are no subsequent pipeline stages, meaning FTB entries that can enter stage 3 will not be flushed by subsequent pipeline stages) and the same logic as in stage 2. If it is found that the stage 2 speculative result differs from the stage 3 determined result, i.e., when RAS was in stage 2, the stage 3 prediction had flushed the BPU pipeline, then the situation upon which the prediction made in stage 2 was based has changed, and the operation on the RAS stack was incorrect, requiring state recovery similar to misprediction. Specific details are described later.

Predictor Training

For every prediction made by a predictor, after all instructions within it are successfully committed, the FTQ will generate prediction bundle update information. This information, along with the meta information of each predictor passed to the FTQ, is sent back to the predictor for training.

Predictor Training Data Reception

Predictor training data passed from the FTQ is sent to each predictor along with the BPU module's FTB entry, update PC, and other signals. Each predictor, depending on its timing pressure, will either temporarily store the update signal or proceed immediately.

FTB

The specific logic for updating FTB entries is detailed in the FTQ module.

Upon receiving an update request, the FTB module will decide the timing of the update based on whether the original read result recorded in the meta information was a hit when this prediction was made. If the meta indicates a hit when the prediction was made, the new FTB data is written to the SRAM immediately in the current cycle. Otherwise, there needs to be a 2-cycle delay to wait for the existing results within the FTB to be read out before deciding the write way and then updating.

Within the FTBBank internally, when an update request exists, the module's behavior also differs depending on whether it's an immediate update or a delayed update. For an immediate update, the SRAM write channel within the FTBBank is asserted, and the write is completed according to the given information. For a delayed update by 2 cycles, the FTBBank first receives a read request for an update, and its priority is higher than that of a normal prediction read request. Then, in the next cycle, the data is read out, and the way number hit by the given address is selected and passed to the external FTB module (hit scenario: two requests targeting this FTB entry, the first request missed, the second access occurred before its update was completed, and this update is for the second corresponding update). If there is no hit in this cycle, the next cycle needs to write to the way allocated by the way selection algorithm one cycle after reading out the FTB entry. The way selection rule is that if all ways are full, a way to be replaced is selected using a replacement algorithm (here, pseudo-LRU, detailed in the ICache document). Otherwise, an empty way is selected.

Composer

Composer sends training signals to its individual predictors through the IO port io_update_*. In general, to prevent incorrect execution paths from polluting predictor contents, each part of the predictor is trained after all instructions in the prediction bundle are committed. Their training content comes from their own prediction information and the decoding and execution results of instructions in the prediction bundle. These will be read out from the FTQ and sent back to the BPU. Among these, their own prediction information will be packaged and passed into the FTQ for storage after prediction; instruction decoding results come from the IFU's pre-decode module and are written back to the FTQ after instructions are fetched; execution results come from various execution units.

TAGE-SC & ITTAGE

The table entry contains a useful field. A non-zero value in this field indicates that the entry is useful and will not be allocated as an empty entry by the allocation algorithm during training. During training, a saturated counter is used to dynamically monitor the number of successful/failed allocations. When the number of failed allocations is sufficiently large and the counter reaches saturation, all useful fields are cleared to zero.

RAS

When the speculative result of RAS in stage 2 differs from stage 3, or a previous prediction result encountered a redirect, the state needs to be restored. Among these, the redirect information is temporarily stored in registers within RAS before actual recovery and updated with a one-cycle delay. The update is completed in the cycle after stage 3 detects a difference. If it is a call misprediction in the redirect (the erroneous instruction is pre-decoded as a call type instruction) or the stage 3 push operation does not match, a recover_push operation is performed, pushing the originally incorrectly popped RAS stack top back in. If it is a ret misprediction in the redirect (the erroneous instruction is pre-decoded as a pop instruction) or the stage 3 pop operation does not match, a recover_pop operation is performed, popping the RAS stack top that should have been popped. The stack pointer is restored to the stack pointer passed by the redirect during redirect; otherwise, it remains the current value. The stack top is restored to the stack top entry passed by the redirect during redirect recovery; otherwise, it remains the current value. The new address for recovery is the value of the instruction immediately after the redirect signal during redirect; otherwise, it is the stage 2 speculative value.

Inside the RAS stack, during state recovery, push, pop, and other operations are also generated. The handling method for these operations is the same as described for stage 2. Here, only cases with differences are listed. In the case of a push operation and not allocating a new entry, if in a recover state, the sp, stack top pointer, and top return address should be set to the updated values. In the case of a pop operation and the current stack top counter is not zero, the so, stack top pointer, and top return address should be set to the updated values. When it is neither a push nor a pop, the sp, stack top pointer, and top return address need to be restored while also handling write bypass.

Branch History Information Maintenance

Speculative Update

After the predictor generates a speculative result in the pipeline stage, the branch history used for subsequent requests will also include this speculative value to improve prediction accuracy.

Redirect Recovery

When a branch prediction error is encountered, the branch history is also restored to the state before the error occurred, thus ensuring the accuracy of the branch history.

Top-Down Performance Analysis Event Statistics

Speculative results generated by predictors in the BPU pipeline stages may be stalled for various reasons, which can ultimately lead to pipeline bubbles for the processor as a whole. To accurately analyze and pinpoint performance bottlenecks, the Kunming Lake architecture adds top-down performance counters to collect pipeline bubble/stall information for each pipeline stage and attribute instruction commit bubbles (the difference between the actual number of committed instructions and the ideal number of issued instructions) to specific modules, thereby achieving localization of bottleneck components. Details of the performance analysis modeling method are available in the Intel paper "A Top-Down method for performance analysis and counters architecture". The stall events that can be statistically measured in the BPU include pipeline bubbles introduced by misprediction recovery of each predictor, pipeline bubbles introduced by backend memory access violation recovery, pipeline bubbles introduced by BPU internal override predictions refreshing older prediction results, pipeline bubbles introduced by BPU stalls due to branch instruction training predictors, and pipeline bubbles introduced by BPU stalls due to FTQ being full and unable to receive new branch prediction bundles. The BPU does not handle the priority among the causes of these bubbles but only asserts the bubble control signal and passes it along the processor pipeline when the corresponding statistical condition is met.

Currently, there are 5 types of predictors within the BPU: FTB (including uFTB and main FTB), TAGE, SC, ITTAGE, and RAS. Top-down analysis categorizes branch misprediction causes into these 5 predictors. Specifically, the condition for decomposing each misprediction bubble to each predictor is:

FTB: A branch instruction related redirection occurred, and the mispredicted instruction was not recorded in the FTB of the corresponding prediction bundle.
TAGE: A branch instruction related redirection occurred, and the mispredicted instruction was recorded in the FTB of the corresponding prediction bundle, but the SC predictor did not provide a corresponding prediction.
SC: A branch instruction related redirection occurred, and the mispredicted instruction was recorded in the FTB of the corresponding prediction bundle, and the SC predictor provided a corresponding prediction result.
ITTAGE: A branch instruction related redirection occurred, the mispredicted instruction was a jalr instruction but not a return instruction, and it hit in the FTB entry.
RAS: A branch instruction related redirection occurred. The mispredicted instruction was a return instruction (a special type of jalr instruction, details later in the RAS module description).

The condition for pipeline bubbles introduced by backend memory access violation recovery is that the redirection signal from the backend indicates that the redirection is due to a memory access violation.

Pipeline bubbles introduced by BPU internal override have two possible sources: redirection signals from BPU pipeline stages 2 and 3.

Pipeline bubbles introduced by branch predictor training have 3 possible sources: ready signals from BPU pipeline stages 1, 2, and 3. The ready signals mentioned above are the result of the logical OR operation of the ready signals of each predictor within the BPU.

The condition for pipeline bubbles introduced by a full FTQ is when the ready signal of the handshake interface from the BPU to the FTQ module is low.

For bubbles caused by branch misprediction and memory access violations, they will be marked in the top-down signals of the current BPU pipeline stages. For bubbles introduced by BPU internal override, they will be marked in the pipeline stage where the override is currently located and earlier stages, and will not affect pipeline stages earlier than this override which contain earlier prediction bundles. The handling of bubbles caused by branch predictor training is similar to BPU override.

Overall Design

Overall Block Diagram

Interface Timing

BPU to FTQ Interface Timing

The figure above shows the BPU to FTQ prediction result interface timing. In the figure, the prediction results for the starting address 0x1FFFF80020 are output in pipeline stages 1, 2, and 3 respectively. If the result is inconsistent with previous pipeline stages, the redirect signal is asserted, indicating the need to refresh the prediction pipeline.

FTQ to BPU Redirect Interface Timing

The figure above shows the FTQ to BPU redirect interface timing. The core redirect signal is cfiUpdate_target, which specifies the redirect target address as 0x200000109a.

FTQ to BPU Update Interface Timing

The figure above shows the FTQ to BPU update interface timing. This update is prepared for the prediction bundle starting at 0x2000000e00, and its target jump address is 0x200000e0e.

Register Configuration

Register	Address	Reset Value	Attribute	Description
sbpctl	0x5C0	64'd0	RW	bit0: uFTB enable signal bit1: FTB enable signal bit2: BIM enable signal (reserved) bit3: TAGE enable signal bit4: SC enable signal bit5: RAS enable signal bit6： loop predictor enable signal (reserved)

Note: RO - Read-Only register; RW - Read/Write register.

References

Reinman G, Austin T, Calder B. A scalable front-end architecture for fast instruction delivery[J]. ACM SIGARCH Computer Architecture News, 1999, 27(2): 234-245.
Perais A, Sheikh R, Yen L, et al. Elastic instruction fetching[C]//2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2019: 478-490.
Software Optimization Guide for AMD Family 19h Processors (PUB), Chap. 2.8.1.5, https://www.amd.com/system/files/TechDocs/56665.zip
Seznec A, Michaud P. A case for (partially) TAgged GEometric history length branch prediction[J]. The Journal of Instruction-Level Parallelism, 2006, 8: 23.
Seznec A. A 256 kbits l-tage branch predictor[J]. Journal of Instruction-Level Parallelism (JILP) Special Issue: The Second Championship Branch Prediction Competition (CBP-2), 2007, 9: 1-6.
Seznec A. A new case for the tage branch predictor[C]//Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 2011: 117-127.
Seznec A. The O-GEHL branch predictor[J]. The 1st JILP Championship Branch Prediction Competition (CBP-1), 2004.
Jiménez D A, Lin C. Dynamic branch prediction with perceptrons[C]//Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture. IEEE, 2001: 197-206.
Seznec A. A 64-Kbytes ITTAGE indirect branch predictor[C]//JWAC-2: Championship Branch Prediction. 2011.