BPU Sub-module TAGE-SC

Functionality

TAGE-SC is the main predictor for conditional branches in the Nanhu architecture, belonging to the Accurate Predictor (APD) category. TAGE leverages multiple prediction tables with different history lengths, allowing it to capture extremely long branch history information. SC is a Statistical Corrector.

TAGE consists of a base prediction table and multiple history tables. The base prediction table is indexed by PC, while history tables are indexed by the result of XORing PC with a folded branch history of a certain length. Different history tables use different lengths of branch history. During prediction, the PC is also XORed with another folded result of the corresponding branch history for each history table to calculate a tag. This tag is matched against the tag read from the table. If the match is successful, the table is considered a hit. The final result is determined by the result from the prediction table with the longest history length that hit.

When the SC believes there is a high probability that TAGE will mispredict, it reverses the final prediction result.

In the Nanhu architecture, up to 2 conditional branch instructions can be predicted simultaneously per cycle. When accessing the various history tables of TAGE, the starting address of the prediction block is used as the PC, and two prediction results are fetched simultaneously. The branch history used for these two predictions is the same.

TAGE: Design Concept

As a hybrid predictor, TAGE's advantage lies in its ability to perform branch prediction for a given branch instruction based on branch history sequences of different lengths simultaneously. It evaluates the accuracy of the branch prediction for that instruction under each history sequence and selects the one with the highest historical accuracy as the final standard for branch prediction.

Compared to traditional hybrid predictors, TAGE possesses the following two new design characteristics, which significantly improve its prediction accuracy:

Adding tag data to each entry in every prediction table. Traditional priority-based branch predictors often rely only on branch history and the current branch instruction's PC value to access prediction tables. This situation can lead to multiple different branch instructions pointing to the same prediction table entry (aliasing). This issue is particularly significant for the accuracy of branch predictions provided by prediction tables that use shorter branch history lengths in hybrid predictors. Therefore, in the TAGE design, by using a partial tagging method, the current instruction can be better matched with the entry in the prediction table, which greatly avoids the occurrence of the aforementioned situation.
Using branch history lengths that change geometrically to index different prediction tables. This approach greatly increases the granularity of the entry selection process in each prediction table during branch prediction. Additionally, a usefulness counter is added to each entry to record how useful that entry is for branch prediction. Through this design, various types of branch instructions have a higher chance of being indexed by the branch history length with the highest prediction accuracy to make the final branch prediction.

These two design characteristics ensure that the TAGE predictor does not degrade the information effectiveness of a table entry by having it mapped to multiple different instructions due to a history length that is too short, nor does it require a very long update time to become effective because of a history length that is too long. Therefore, the range of branch history lengths that TAGE can employ is very large, allowing it to judge branch instructions in various code contexts.

TAGE: Hardware Implementation

TAGE is a high-accuracy conditional branch direction predictor. It uses different lengths of branch history and the current PC value to address multiple SRAM tables. When there is a hit in multiple tables, it prioritizes the prediction result from the corresponding entry with the longest history length that hit.

The structural schematic of TAGE is shown in the figure above. It provides a base predictor T0 and four tagged prediction tables T1-T4. The basic information for the base predictor and tagged prediction tables is shown in the table below.

Predictor	Tagged?	Function	Entry Composition	Number of Entries
Base Predictor T0	No	Provides prediction result when no tag matches in T1-T4	ctr 2bits (MSB gives prediction, 1 for taken, 0 for not taken)	2048 sets, 2 ways
Prediction Tables T1-T4	Yes	Provides prediction result when tags match; longest history wins	valid 1bittag 8bitsctr 3bits (MSB gives prediction, 1 for taken, 0 for not taken)us: 1bit (usefulness counter)	4096 entries each

Note: The "way" in the table above is a parameter name in the Chisel SRAMTemplate class. The value of this parameter indicates how many copies of the same type of data are stored in the SRAM, and it doesn't necessarily represent the conventional meaning of "way" requiring tag matching. T0 has two ways because a prediction block can have up to two branch instructions, and the different ways are used to predict the two different branches within the prediction block separately.

The prediction content for the two jump instructions within each prediction block is stored and updated separately. The 4096 entries of the tagged prediction tables T1-T4 are divided equally into two banks for the maximum of two branches within a prediction block. Each bank has 2048 entries, so the index for prediction tables T1-T4 is only 11 bits, which is log2(number of entries per table / 2). The two branch instructions within the same prediction block use the same index, but they index two different banks. The results read out from the two banks representing the two prediction information might show a hit in one bank but not in the other.

Each prediction table in TAGE-like predictors has a specific history length. To allow the originally very long global branch history table to be XORed with the PC for prediction table indexing or tag matching, the long branch history sequence needs to be divided into many segments, which are then all XORed together. The length of each segment is generally equal to the logarithm of the history table depth. Since the number of XOR operations is usually large, to avoid latency from multi-level XORs in the prediction path, we directly store the folded history.

There are three copies of the folded branch history corresponding to each prediction table: one for indexing the prediction table, and two for tag matching. The BPU module maintains a 256-bit global branch history ghv and maintains their respective folded branch histories for the 4 tagged prediction tables of TAGE. Specific configurations are shown in the table below (see TageTableInfos), where ghv is a circular queue, and "low" n bits refer to the low bits starting from the position indicated by ptr:

History	Index Folded Branch History Length	Tag Folded Branch History 1 Length	Tag Folded Branch History 2 Length	Design Principle
Global Branch History ghv	256 bits (unfolded)	N/A	N/A	Each bit represents whether the corresponding branch was taken
T1 Corresponding Folded Branch History	8 bits	8 bits	7 bits	ghv takes low 8 bits from ptr, folds and XORs
T2 Corresponding Folded Branch History	11 bits	8 bits	7 bits	ghv takes low 13 bits from ptr, folds and XORs
T3 Corresponding Folded Branch History	11 bits	8 bits	7 bits	ghv takes low 32 bits from ptr, folds and XORs
T4 Corresponding Folded Branch History	11 bits	8 bits	7 bits	ghv takes low 119 bits from ptr, folds and XORs

As shown in the figure above, for timing considerations, the specific implementation of folded branch history is not the folding described in the design principle. Instead, the oldest bit (h[12] in the figure) and the newest bit (h[0] in the figure) of the pre-folded branch history are XORed into the corresponding positions (c[1] and c[4] in the figure), and then a shift operation (changing to c[0] and c[2] in the figure) is performed. The pseudocode is as follows:

c[0] \<= c[4] ^ h[0];

c[1] \<= c[0];

c[2] \<= c[1] ^ h[12];

c[3] \<= c[2];

c[4] \<= c[3];

The branch history table is not only updated after the backend commit; updates can occur at any stage between s1~s3. When updating, the pointer and value are updated. If new results are produced in s1~s3, the pointer is restored, and the new value is updated. However, if the prediction was correct, it is not modified (because it was already updated before).

Note: From the specific configuration table of TAGE's folded branch history, it can be seen that there are 2 tag folded branch histories. Both use the same folding algorithm, the only difference being the folding length. The tag folded branch history 1 for T1-T4 is always 8 bits, while tag folded branch history 2 is always 7 bits.

TAGE: Prediction Timing

During each prediction cycle, two different hash functions (see table below) are first computed for each tagged prediction table using the pc value and their respective branch histories. The results of these computations are used to calculate the final tag for the operation and to index the prediction table.

	Computation Method
Index	(index folded branch history) ^ (lower bits of (pc>>1))
Tag	(tag folded branch history 1) ^ (tag folded branch history 2<<1) ^ (lower bits of (pc>>1))

If the valid field obtained by indexing T1-T4 is 1, and the tag matches the result obtained from calculating the tag hash function, then the MSB of the pred provided by that prediction table is added to the final branch prediction sequence. Finally, through a multi-level MUX, the prediction with the longest history length among all tag-matching branch predictions is selected as the final prediction result.

If there are no matches in T1-T4, the result from T0 is used as the final prediction result. T0 is indexed directly using the lower 11 bits of the pc.

TAGE requires 2 cycles of latency:

Cycle 0: Generate the index for SRAM addressing. The index generation process is XORing the folded history with the pc. The management of folded history is not inside ITTAGE or TAGE, but within the BPU.
Cycle 1: Read out the results.
Cycle 2: Output the prediction result.

TAGE: Predictor Training

First, define the provider as the prediction table with the longest required history length among all prediction tables that produced a tag match. Any other prediction tables that produced a tag match (if they exist) are called altpred.

The TAGE entry contains a usefulness field. When the provider predicts correctly and the altpred predicts incorrectly, the provider's usefulness is set to 1, indicating that this entry is useful and will not be allocated as an empty entry by the allocation algorithm during training. When the prediction generated by the provider is confirmed to be correct, and the provider's prediction result is different from the altpred's, the provider's usefulness field is set. If the branch instruction actually jumps, the corresponding provider entry's ctr counter is incremented by 1; if the branch instruction actually does not jump, the corresponding provider entry's ctr counter is decremented by 1. If a TAGE entry needs updating due to a misprediction, and the misprediction was not caused by using altpred while discarding the correct provider, it indicates that an entry needs to be added. However, it's not guaranteed that an entry can actually be added at this time. It also needs to satisfy that the prediction table from which the provider originated is not the prediction table with the longest required history length. If this condition is met, the entry addition operation is performed.

Here is a logical example to determine if the condition "provider's table is not the longest history table" is met:

s1_providers(i) represents the prediction table index corresponding to the provider for the i-th branch in the prediction block. Assume the provider is in prediction table T2. Then LowerMask(UIntToOH(s1_providers(i)), TageNTables) would be 0b0011. s1_provideds(i) indicates whether the provider for the i-th branch is in T1~T4. Based on the assumption, Fill(TageNTables, s1_provideds(i).asUInt) would be 0b1111. Their bitwise AND gives 0b0011, and then taking the bitwise NOT gives 0b1100. From this, it can be seen that T3 and T4 are prediction tables with longer history lengths than the provider.

Let's take another example. Initially, all prediction tables are empty, and there is no provider. At this time, the corresponding s1_providers(i) is 0, and s1_provideds(i) is false. Then Fill(TageNTables, s1_provideds(i).asUInt) is 0b0000. Their bitwise AND and then NOT will always result in 0b1111, indicating that T1~T4 all belong to the "history tables longer than the provider".

The specific entry addition operation is as follows:

The entry addition operation first reads the usefulness of all prediction tables with history lengths longer than the provider. If the usefulness field of any of these tables is 0, a corresponding entry is allocated in that table. If no table with a usefulness field value of 0 is found, the allocation fails. When multiple prediction tables (e.g., Tj, Tk) have usefulness fields of 0, the probability of entry allocation is random. During allocation, some tables are randomly masked out using a generated random number so that allocation doesn't always happen in the same table. The specific implementation of this mask uses the LFSR64 primitive, a 64-bit Linear Feedback Shift Register from Chisel's util package, to generate pseudo-random numbers. In the Verilog code, this corresponds to the allocLFSR_lfsr register. During training, a 7-bit saturated counter bankTickCtrs is used to count the net difference between allocation failures and successes. When the number of allocation failures is sufficiently large such that the bankTickCtrs counter reaches saturation, a global useful bit reset is triggered, clearing all usefulness fields to zero.

Finally, during initialization/when TAGE table allocates a new entry, the ctr counters in all entries are set to 0, and all usefulness fields are set to 0.

Note: The usefulness values for the two branch instructions within the same prediction block are not necessarily equal. If the altpred predicts taken for the first branch and not taken for the second, while the provider predicts taken for both, only the usefulness for the first branch needs to be set to one.

Note: meta refers to the data used by the predictor during the prediction phase. It is brought back during the update phase for updating. It is called meta because the composer integrates all predictors and interacts with the outside world using a common meta interface. The specific composition of TAGE's meta is shown in the figure below:

TAGE: Alternate Prediction Logic (USE ALT ON NA)

When the confidence counter ctr of a tag-matching entry is 0, altpred is sometimes more accurate than the normal prediction. Therefore, a 4-bit counter useAltCtr is used in the implementation to dynamically decide whether to use the alternate prediction when the longest history match result is not confident. In the implementation, for timing considerations, the base prediction table result is always used as the alternate prediction, which introduces very little accuracy loss.

The specific implementation of the alternate prediction logic is as follows:

ProviderUnconf indicates that the longest history match result is not confident. When the provider's corresponding ctr value is 0b100 or 0b011, it means the longest history match result is very confident, and ProviderUnconf is false. When the provider's corresponding ctr value is 0b01 or 0b10, it means the longest history match result is not confident, and ProviderUnconf is true.

useAltOnNaCtrs is a group of 128 4-bit saturated counters. Each counter is initialized to 0b1000. When TAGE receives a training update request, if, in the prediction being trained, the provider's prediction result differs from altpred's and the provider's prediction result is not confident, then the correctness of the alternate prediction result is considered. If the alternate prediction result is correct and the provider is incorrect, the corresponding useAltOnNaCtrs counter value is incremented by 1. If the alternate prediction result is incorrect and the provider is correct, the corresponding useAltOnNaCtrs counter value is decremented by 1. Since useAltOnNaCtrs are saturated counters, if the useAltOnNaCtrs value is already 0b1111 and correct, or already 0b0000 and incorrect, the useAltOnNaCtrs value remains unchanged.

useAltOnNa is derived from the most significant bit of the counter result obtained by indexing the useAltOnNaCtrs counter group using pc(7, 1), i.e., the corresponding lower bits of the pc.

When ProviderUnconf && useAltOnNa is true, the alternate prediction result (i.e., the base prediction table result) is used as the final prediction result for TAGE, instead of the provider's prediction result.

TAGE: Wrbypass

Wrbypass contains both Mem and Cam, used to order updates. Each time TAGE updates, it writes into this wrbypass and simultaneously writes into the sram of the corresponding prediction table. During each update, it checks this wrbypass. If it hits, the ctr value read from the TAGE in the wrbypass is used as the old value, and the old ctr value carried to the backend with the branch instruction and sent back to the frontend is discarded. This way, if a branch is updated repeatedly, the wrbypass ensures that a specific update can get the final value from the immediately preceding update.

T0 has a corresponding wrbypass, and T1~T4 each have 16 wrbypasses. In the wrbypass corresponding to each prediction table, Mem has 8 entries. T0's Mem stores 2 prediction table entries per entry (corresponding to two banks), while T1~T4's Mem stores 1 prediction table entry per entry (because the two banks are in different wrbypasses). Cam has 8 entries. Inputting the update idx and tag (T0 doesn't have a tag) allows reading the position of the corresponding data in Cam. Cam and Mem are written simultaneously, so the position of data in Cam is its position in Mem. Thus, using this Cam, we can check during updates if the data for the corresponding idx is in the wrbypass.

Storage Structure

4 history tables. Each table is divided into a total of 8 banks based on the lower bits of the pc. Each bank has 256 sets. Each set corresponds to the maximum of 2 branches in an FTB entry. Each bank has a total of 512 entries, each table has a total of 4096 entries, and all tables have a total of 16K entries.
Each entry contains 1-bit valid, 8-bit tag, 3-bit ctr, 1-bit useful. The useful bit is stored independently.
The base table has 2048 sets. Each set has one 2-bit saturated counter for each of the two branches, recording a total of 4K branches.
Each bank of the history table has an 8-entry write cache (wrbypass). The base table has an 8-entry wrbypass.
The use_alt_on_na prediction uses a total of 128 4-bit saturated counters.

Indexing Method

index = pc[11:1] ^ folded_hist(11bit)
tag = pc[11:1] ^ folded_hist(8bit) ^ (folded_hist(7bit) << 1)
History uses basic segmented XOR folding.
Two types of correspondence between the two branches per entry and the two FTB slots are selected using pc[1]. Due to the mechanism of FTB entry creation, the usage rate of the first slot is higher than the second. This approach alleviates this uneven distribution.
The base table and use_alt_on_na both use lower bits of pc for direct indexing.

Prediction Flow

s0: Index calculation is performed, and the address is sent to SRAM. Only the corresponding banks are enabled.
s1: Tag matching, bank data selection, and reordering between the two slots are performed to obtain the overall result from each history table. These results are sent to the top level for the longest history selection. Combined with the use_alt_on_na mechanism, a selection is made between the longest history match and the base table (the original algorithm is simplified to always use the base table as the alternate prediction). After obtaining the prediction result for each branch, it is temporarily stored in s2.
s2: The prediction result is used. It is compared with the s1 result within the BPU to determine if the pipeline needs to be flushed.

Training Flow

After receiving a training request from the FTQ, updates are performed based on the information recorded during prediction. The update flow is divided into two pipeline stages. The first stage judges some update conditions outside the history table. Details are as follows:

Provider Update: The longest history table hit during prediction will be updated.
- When the alternate prediction is different, the new useful bit is determined based on whether the provider was correct.
- The increment/decrement of the ctr is decided based on the actual direction.
Alloc: Allocation is performed when there is a misprediction, excluding the case where the provider was correct but a wrong alternate prediction result was chosen during prediction.
- During allocation, a mask is generated based on the useful bit information of all history tables recorded during prediction. Some bits are randomly masked out using a random number from the LFSR. Is there still an available entry in a history table longer than the provider?
  - If yes, use the entry with the shortest history among them.
  - If no, use the allocatable entry with the shortest history longer than the provider, before the LFSR mask was applied.
Useful Reset Counter: An 8-bit saturated counter is incremented/decremented based on the net number of allocation successes/failures. If the number of failures is too high and reaches saturation, all useful bits are cleared.
- Let n be the number of history tables longer than the provider. Those with useful bit 0 are considered allocation successes, and vice versa are considered failures. The absolute difference between these counts is the value of the increment/decrement.
use_alt_on_na Training: When the provider's ctr is one of the two weakest values, and the alternate prediction direction differs from the provider, the use_alt_on_na saturated counter indexed by lower bits of pc is incremented or decremented based on the correctness of the alternate prediction.

The second pipeline stage sends the update request into each prediction table, attempting to write to SRAM.

The old value of the ctr might come from prediction content long ago, which could be significantly different from the current state of the ctr in the table. To solve this problem, the wrbypass mechanism is introduced:
- Wrbypass is a fully associative write cache for the TAGE table. Whenever a write is attempted, this fully associative table is first queried based on the history table index. If it hits, the corresponding content is used as the old ctr. It is then updated according to the corresponding jump direction and written to both SRAM and the wrbypass simultaneously. If it does not hit, an entry is selected for replacement according to the replacement algorithm.
- Each bank has a corresponding 8-entry fully associative wrbypass. Each table has a total of 64 entries (8 banks * 8 entries/bank).
Since single-port SRAM is used, read and write requests cannot be handled simultaneously. Therefore, to avoid read-write conflicts:
- A bank-based approach is used.
- No write is performed when saturated and the new value equals the old value.
If a read-write conflict still occurs, the write request is handled, but the read request is not blocked. When using the read result, it is treated as not hit by default. This might reduce accuracy.

SC: Design Concept

In some applications, some branch behaviors have a weak correlation with branch history or path, exhibiting a statistical prediction bias. For these branches, using counters to capture the statistical bias is more effective than TAGE. TAGE is very effective at predicting strongly correlated branches, but it fails to predict branches with a statistical bias, such as those with a small bias in one direction but no strong correlation with the history path.

The purpose of SC statistical correction is to detect unreliable predictions and restore them. SC is responsible for predicting conditional branch instructions with a statistical bias and, in such cases, reversing the result of the TAGE predictor.

SC: Hardware Implementation

SC maintains a total of 4 prediction tables. The specific parameters of the prediction tables are shown in the table below.

ID	Number of Entries	ctr Length	Folded History Length	Design Principle
T1	512	6	0	ghv takes low 0 bits from ptr, folds and XORs
T2	512	6	4	ghv takes low 4 bits from ptr, folds and XORs
T3	512	6	8	ghv takes low 10 bits from ptr, folds and XORs
T4	512	6	8	ghv takes low 16 bits from ptr, folds and XORs

SC: Prediction Timing

SC receives the prediction result pred and ctr from TAGE, and folded branch history information and pc from the BPU. It indexes the SC prediction tables using ((index folded branch history) ^ (lower bits of (pc>>1))). Based on the ctr results from the SC prediction tables and the TAGE prediction result pred and ctr, it dynamically determines (see HasSC) whether to reverse the TAGE prediction.

The specific logic for SC's dynamic determination is as follows:

SC implements 2 banks of scThresholds register groups. The 2 banks correspond to TAGE's 2 banks, both due to the maximum of two branch instructions within a prediction block. These are the only two register groups for dynamic determination in SC. Each bank in scThresholds has a 5-bit saturated counter ctr (initial value 0b10000) and an 8-bit thres (initial value 6). To differentiate, we will use tage_ctr, sc_ctr, and thres_ctr to refer to the ctr passed from TAGE to SC, SC's own ctr, and the ctr within scThresholds, respectively.

Update of scThresholds: When SC receives training data, if SC flipped the TAGE prediction result at that time, and the absolute value of (sc_ctr_0*2+1)+(sc_ctr_1*2+1)+(sc_ctr_2*2+1)+(sc_ctr_3*2+1)+(2*(tage_ctr-4)+1)*8 (signed carry-propagating addition) is within the range [thres-4, thres-2], then scThresholds will be updated.

Update of ctr in scThresholds: If the prediction was correct, thres_ctr +1; if the prediction was incorrect, thres_ctr -1. If thres_ctr has already reached 0b11111 and was correct, or has already reached 0b00000 and was incorrect, thres_ctr remains unchanged.
Update of thres in scThresholds: If the updated value of thres_ctr reaches 0b11111 and thres <= 31, then thres +2; if the updated value of thres_ctr is 0 and thres >= 6, then thres -2. In other cases, thres remains unchanged.
After the thres update decision is finished, another check is performed on thres_ctr. If the updated thres_ctr is 0b11111 or 0, thres_ctr is reset to its initial value 0b10000.

Let scTableSums be (sc_ctr_0*2+1)+(sc_ctr_1*2+1)+(sc_ctr_2*2+1)+(sc_ctr_3*2+1) (signed carry-propagating addition), tagePrvdCtrCentered be (2*(tage_ctr-4)+1)*8 (signed carry-propagating addition), and totalSum be scSum + tagePvdr (signed carry-propagating addition). If scTableSums > (signed thres - tagePrvdCtrCentered) and totalSum is positive, OR if scTableSums < (-signed thres - tagePrvdCtrCentered) and totalSum is negative, it means the threshold has been exceeded, and the TAGE prediction result is reversed. Otherwise, the TAGE prediction result is used unchanged.

SC's prediction algorithm depends on the provided signal from TAGE indicating a history table hit, and the taken result of the provider's prediction, to decide its own prediction. provided is one of the necessary conditions for using SC's prediction. The provider's taken result acts as a choose bit, selecting SC's final prediction. This is because SC may have different predictions in scenarios where the TAGE prediction results differ.

SC reversing the TAGE prediction result will cause a new entry to be added to the TAGE table.

SC requires 3 cycles of latency:

Cycle 0: Generate the addressing index to get s0_idx.
Cycle 1: Read out the counter data s1_scResps corresponding to s0_idx from the SCTable.
Cycle 2: Decide whether to reverse the prediction result based on s1_scResps.
Cycle 3: Output the complete prediction result.

SC: Wrbypass

Wrbypass contains both Mem and Cam, used to order updates. Each time SC updates, it writes into this wrbypass and simultaneously writes into the sram of the corresponding prediction table. During each update, it checks this wrbypass. If it hits, the ctr value read from the SC in the wrbypass is used as the old value, and the old ctr value carried to the backend with the branch instruction and sent back to the frontend is discarded. This way, if a branch is updated repeatedly, the wrbypass ensures that a specific update can get the final value from the immediately preceding update.

SC's T1~T4 each have 2 wrbypasses. In the wrbypass for each prediction table, Mem has 16 entries, each storing 2 prediction table entries. Cam has 16 entries. Inputting the update idx allows reading the position of the corresponding data in Cam. Cam and Mem are written simultaneously, so the position of data in Cam is its position in Mem. Thus, using this Cam, we can check during updates if the data for the corresponding idx is in the wrbypass.

SC: Predictor Training

sc_ctr (see signedSatUpdate) is a 6-bit signed saturated counter. It increments by 1 after the instruction is actually taken, and decrements by 1 after it is not taken. The count range is [-32, 31].

Note: meta refers to the data used by the predictor during the prediction phase. It is brought back during the update phase for updating. It is called meta because the composer integrates all predictors and interacts with the outside world using a common meta interface. The specific composition of SC's meta is shown in the figure below:

Overall Block Diagram

Interface Timing

s0 input pc and folded history timing example

The figure above shows an example of s0 input pc and folded history timing. When io_s0_fire is high, the input io_in_bits data is valid.

Storage Structure

4 tables, with history lengths 0, 4, 10, and 16 respectively. Each table has 256 entries. Each entry contains 4 6-bit wide saturated counters. These 4 saturated counters correspond to 2 branches * 2 TAGE prediction results (Taken/Not Taken). Each table has a total of 1024 counters, with a storage overhead of 3KB.
Each table has a 16-entry write cache (wrbypass).
2 threshold registers, corresponding to the branches in one slot. Each threshold has an 8-bit wide threshold saturated counter and a 5-bit increment/decrement buffer saturated counter.

Indexing Method

index = pc[8:1] ^ folded_hist(8bit)
History uses basic segmented XOR folding.
Two types of correspondence between the two branches per entry and the two FTB slots are selected using pc[1]. Due to the mechanism of FTB entry creation, the usage rate of the first slot is higher than the second. This approach alleviates this uneven distribution.

Prediction Flow

s0: Index calculation is performed, and the address is sent to SRAM.
s1: Saturated counters are read out. Reordering between slots is performed. The SC ctrs corresponding to the two TAGE prediction results are sent to the top level for addition separately. The result is registered to s2.
s2: The SC's ctrs are added to the TAGE's provider ctrs to obtain the final result. The absolute value is taken. If it is greater than the threshold, and TAGE also had a hit, the sign of the addition result is used as the direction for the final prediction. The result is registered to s3.
s3: The direction result is used. It is compared with the s2 result within the BPU to determine if the pipeline needs to be flushed.

Training Flow

After receiving a training request from the FTQ, updates are performed based on the information recorded during prediction. The update flow is divided into two pipeline stages. The first stage judges some update conditions outside the history table. Details are as follows:

If TAGE hit during prediction, the old SC ctr and old TAGE ctr stored during prediction are added again. This is compared with the threshold used for the update (prediction threshold * 8 + 21). If it is less than the threshold, or if the prediction result does not match the execution direction at this time, each counter is trained according to the perceptron training rules. The request is registered to the next cycle and sent to each table.

The second pipeline stage sends the update request into each prediction table, attempting to write to SRAM. It still attempts to query the write cache (wrbypass) to get the latest counter value (this query should ideally be done in the previous pipeline stage, instead of directly using the old ctr).