BPU Submodule ITTAGE
Functionality
ITTAGE receives prediction requests from within the BPU. Internally, it consists of a base prediction table and multiple history tables. Each table entry contains a field for storing the target address of an indirect jump instruction. The base prediction table is indexed by PC, while history tables are indexed by the XOR of PC and the folded result of branch history of a certain length. Different history tables use different lengths of branch history. During prediction, the tag is also calculated by XORing PC with another folded result of the branch history corresponding to each history table. This is matched with the tag read from the table. If the match is successful, the table hits. The final result depends on the result of the prediction table with the longest history length that hits. Finally, ITTAGE outputs the prediction result to the composer.
Prediction of Indirect Jump Instructions
ITTAGE is used to predict indirect jump instructions. The jump targets of normal branch instructions and unconditional jump instructions are directly encoded in the instructions, making prediction easy. However, the jump addresses of indirect jump instructions come from run-time variable registers, thus offering multiple possible choices, and prediction needs to be made based on branch history.
For this purpose, each entry in ITTAGE, building upon the structure of TAGE entries, includes a predicted jump address field. The final output result is the selected hitting predicted jump address rather than the selected branch direction.
Since each FTB entry stores information for at most one indirect jump instruction, the ITTAGE predictor also predicts the target address for at most one indirect jump instruction per cycle.
The ITTAGE in the XiangShan Nanhu architecture provides 5 tagged prediction tables T1-T5. Basic information about the base predictor and tagged prediction tables is shown in the table below.
Predictor | Is Tagged | Function | Entry Structure | Number of Entries |
---|---|---|---|---|
Base Predictor T0 | No | Provides prediction result when tags in all tagged prediction tables do not match | T0 is not implemented in ITTAGE; instead, the prediction result from FTB is directly used as the base prediction result | |
Prediction Tables T1-T5 | Yes | When tag matches exist, the one with the longest history length is selected to provide the prediction result | valid: 1bit, tag: 9bits, ctr: 2bits (highest bit indicates whether to output this prediction result), us: 1bit (usefulness counter), target: 39 bits | T1-T2: 256 entries each, T3-T5: 512 entries each |
Within the BPU module, a 256-bit global branch history ghv is maintained, and separate folded branch histories are maintained for each of ITTAGE's 5 tagged prediction tables. The folding algorithm is the same as in TAGE. The specific configurations of folded histories are shown in the table below. ghv is a cyclic queue, and "low" n bits refer to the low bits counted from the position indicated by ptr:
History | Index Folded Branch History Length | Tag Folded Branch History 1 Length | Tag Folded Branch History 2 Length | Design Principle |
---|---|---|---|---|
Global Branch History ghv | 256 bits | None | None | Each bit represents whether the corresponding branch is taken or not |
T1 Corresponding Folded Branch History | 4 bits | 4 bits | 4 bits | XOR folding of low 4 bits of ghv starting from ptr |
T2 Corresponding Folded Branch History | 8 bits | 8 bits | 8 bits | XOR folding of low 8 bits of ghv starting from ptr |
T3 Corresponding Folded Branch History | 9 bits | 9 bits | 8 bits | XOR folding of low 13 bits of ghv starting from ptr |
T4 Corresponding Folded Branch History | 9 bits | 9 bits | 8 bits | XOR folding of low 16 bits of ghv starting from ptr |
T5 Corresponding Folded Branch History | 9 bits | 9 bits | 8 bits | XOR folding of low 32 bits of ghv starting from ptr |
ITTAGE requires 3 cycles of latency:
- Cycle 0: Generate addressing index
- Cycle 1: Read out data
- Cycle 2: Select hitting result
- Cycle 3: Output
Wrbypass
Wrbypass contains Mem and Cam, used for ordering updates. Each time ITTAGE updates, it writes into this wrbypass, and simultaneously writes into the SRAM of the corresponding prediction table. Each time an update occurs, this wrbypass is checked. If it hits, the CTR value read from ITTAGE is used as the old value, and the old CTR value previously carried with the branch instruction to the backend and then sent back to the frontend is discarded. Thus, if a branch is updated repeatedly, wrbypass can guarantee that a specific update will retrieve the final value of the immediately preceding update.
Each of ITTAGE's prediction tables T1~T5 has a corresponding wrbypass. In each prediction table's wrbypass, Mem has 4 entries, each storing 1 CTR; Cam has 4 entries. By inputting the update's index (idx) and tag, the position of the corresponding data in Cam can be read. Cam and Mem are written simultaneously, so the position of data in Cam is the same as its position in Mem. Therefore, by utilizing this Cam, we can check during an update if the data corresponding to the index (idx) is present in the wrbypass.
Predictor Training
First, define the predictor with the longest required history length among all prediction tables producing a tag match as the provider, while the remaining matching prediction tables (if any) are called altpred. When the provider's CTR is 0, the altpred's result will be chosen as the prediction result.
ITTAGE entries contain a usefulness field. When the provider predicts correctly and the altpred predicts incorrectly, the provider's usefulness is set to 1, indicating that this entry is useful and will not be allocated as an empty entry by the allocation algorithm during training. When the prediction produced by the provider is confirmed to be correct, and the prediction result of the provider is different from that of the altpred at this time, the provider's usefulness field is set to 1.
If the predicted address matches the actual address, the CTR counter of the corresponding provider entry is incremented by 1; if the predicted address does not match the actual address, the CTR counter of the corresponding provider entry is decremented by 1. In ITTAGE, it is determined based on the CTR whether to take the predicted jump target result. When the CTR is 0, the result from altpred will be chosen.
Next, if the prediction table from which the provider originates is not the one with the highest required history length, the following entry allocation operation is performed at this time. The entry allocation operation first reads the usefulness fields of all prediction tables with a history length longer than that of the provider. If the usefulness field of a certain table is 0 at this time, a corresponding entry is allocated in that table; if no table satisfying the usefulness field value of 0 is found, the allocation fails. When the usefulness fields of multiple prediction tables (e.g., Tj, Tk) are all 0, the probability of entry allocation is random. During allocation, some tables are randomly masked off so that allocation doesn't always happen in the same one. The randomness of entry allocation here is achieved by generating pseudorandom numbers using the 64-bit Linear Feedback Shift Register primitive LFSR64 from Chisel's util package. In the Verilog code, this corresponds to the allocLFSR_lfsr register. During training, an 8-bit saturating counter tickCtr is used to count (failed allocations - successful allocations). When the number of failed allocations is sufficiently large, and the tickCtr counter counts to saturation, a global useful bit reset is triggered, clearing all usefulness fields to zero.
Note: The saturating counter for clearing the usefulness fields in ITTAGE is named tickCtr, its length is 8 bits, and both the name and length are different from those in TAGE.
Finally, during initialization / when new entries are allocated in TAGE tables, the CTR counters in all table entries are set to 0, and all usefulness fields are set to 0.
Storage Structure
- 5 history tables, with entry counts of 256, 256, 512, 512, 512 respectively. Each table is divided into a total of 2 banks based on the low bits of the PC. Each bank has 128 sets, and each set corresponds to at most 1 indirect jump instruction for one FTB entry.
- Each entry contains 1bit valid, 9bit tag, 2bit ctr, 39bit target, 1bit useful; the useful bit is stored separately, and valid uses a register file for independent storage.
- Uses the FTB result as the base table, equivalent to 2K entries (but the FTB target bit width is insufficient and cannot effectively store long jump addresses).
- Each bank of the history tables has a 4-entry write buffer (wrbypass).
Indexing Method
- index = pc[8:1] ^ folded_hist(8bit) or pc and folded_hist each 9bit
- tag = pc[17:9] (or pc[19:10]) ^ folded_hist(9bit) ^ (folded_hist(8bit) << 1)
- It is probably better to use the low bits of PC here, rather than another segment of PC not used by the index, or
- History uses basic segmented XOR folding.
Prediction Flow
- s0: Perform index calculation, address sent to SRAM
- s1: Read out table entries, perform bank selection, and determine hit, result registered to s2
- s2: Calculate longest history match and second-longest history match:
- When a history table hits and ctr != 0, attempt to use the target address of the longest history result.
- When the provider lacks confidence (ctr == 0), if the second-longest history hits, attempt to use the target address of the second-longest history result.
- When no history table hits, use the FTB result.
- When attempting to use the result from a history table, it is truly used only if ctr > 1; if ctr <= 1, the result is not used.
- s3: Use the target address, compare it with the s2 result within the BPU, and determine if the pipeline needs to be flushed.
Training Flow
Basically the same as TAGE. For the target field, a new value is replaced only when a new entry is allocated or the original CTR is the minimum value 0; otherwise, it remains unchanged.