BPU Submodule RAS

Glossary

Abbreviation	Full Name	Description
TOSW	Top Of Speculative Queue Write Pointer	Speculative Queue Write Pointer
TOSR	Top Of Speculative Queue Read Pointer	Speculative Queue Read Pointer
NOS	Next Older Speculative Queue entry Pointer	Pointer to Next Older Speculative Queue entry
ssp	Speculative Stack Pointer	Speculative Stack Pointer
nsp	Next Stack Pointer	Commit Stack Pointer

Functionality

The RAS predictor uses a stack structure to predict control flow with paired matching characteristics, such as function calls and returns. Call (push/call) instructions are characterized as jal/jalr instructions with a destination register address of 1 or 5. Return (ret) instructions are characterized as jalr instructions with a source register address of 1 or 5.

In the implementation, the RAS predictor provides prediction results in both s2 and s3 stages.

The RAS predictor in the Kunming Lake architecture differs significantly from the Nanhu architecture. By introducing a persistent queue, the new RAS predictor structure solves the data pollution problem in local predictors caused by speculative execution. At the same time, the new structure also retains a stack structure similar to the Nanhu architecture as a commit stack, storing committed push information, to compensate for the low storage density of the persistent queue.

Overall Architecture Overview of RAS Predictor Based on Persistent Queue Design

The RAS predictor utilizes paired information about call/return instructions within the branch predictor's FTB blocks to predict return instructions. Since a normally called function returns to the instruction immediately following the call instruction, the RAS predictor can, upon encountering a call instruction, generate a prediction for its function call return address based on the current instruction PC and whether the current instruction is an RVC instruction (to determine instruction width as 2 or 4), and then push this onto the stack.

As modern superscalar out-of-order processors typically employ deep pipeline speculative execution techniques, the branch predictor generates prediction results for subsequent instructions before the execution results of previously predicted instructions are finalized. That is, when facing prediction requests for three consecutive branch instructions A, B, and C, the RAS predictor cannot obtain the final execution results of instructions within block A when predicting B; it can only obtain the final execution results of instructions within block Z (related to A) and speculative results between Z and B based on branch prediction. If there is an incorrect branch prediction result between Z and B, recovery of the branch predictor state involved is required due to the misprediction. As mentioned earlier, the RAS predictor predicts call-return pairs based on a stack structure, and the accuracy of pairing information is crucial to its accuracy. To precisely restore the state of the RAS predictor at the point of misprediction, the most intuitive approach is to backtrack from the current latest prediction point to the misprediction point and undo all changes made to the RAS stack during this interval. However, to improve the timeliness of misprediction recovery, modern processors typically cannot afford such high-complexity operations during recovery. In the Nanhu architecture design, when a misprediction occurs, the RAS predictor only restores the stack top entry corresponding to the mispredicted prediction block and the RAS stack top pointer. This recovery strategy can handle pollution of the RAS stack top and data above it caused by speculative execution-induced mispredictions, but it cannot handle pollution of data below the RAS stack top caused by a pop followed by a push during speculative execution.

This type of pollution can lead to incorrect jump targets for subsequent return instructions, resulting in mispredictions. To address this issue, the new RAS predictor in the Kunming Lake architecture introduces a persistent queue to save all local states during RAS speculative execution, achieving prediction resistant to the aforementioned pollution. Specifically, the stack structure simulated by the persistent queue has three pointers: read pointer TOSR, write pointer TOSW, and bottom pointer BOS. Each entry, in addition to recording its own data, also records a pointer to its previous entry in the persistent queue, NOS. During each stack push operation, TOSW is advanced once to allocate a new storage space for the current data being pushed, and the current TOSR is made to point to the original position of TOSW (i.e., the newly allocated space). The newly pushed entry records the TOSR position before the push operation in its NOS field. During each stack pop operation, TOSR is moved to the position indicated by the NOS pointer within the entry currently pointed to by TOSR. Through the above methods, RAS can traverse all entries of the current version (corresponding to a speculative execution path) of the stack via a forward linked list. This design also does not overwrite any RAS stack data belonging to other versions (corresponding to other speculative execution paths).

To improve the effective storage capacity of RAS, not all RAS entries are stored in the persistent queue form. Based on empirical data, the persistent queue needs about 28 entries at maximum to satisfy the needs of speculative execution paths, so the RTL implementation uses a 32-entry persistent queue. After the instruction block corresponding to a RAS entry (i.e., the prediction block containing the call instruction) is committed, we can release this block from the persistent queue and move it into the commit stack. The release operation is done by changing the BOS pointer to the TOSW pointer corresponding to the committed prediction block. The commit stack uses the same design as the Nanhu architecture: increment the commit stack pointer nsp when pushing and write the pushed data to the new stack top; decrement the stack pointer nsp when popping. Since it only stores committed deterministic information, there is no issue of speculative execution pollution. The range from the original BOS to the new TOSW in the persistent queue may contain push results from other erroneous paths, and these results are naturally released during this BOS movement.

Due to the introduction of both the persistent queue and the commit stack structures, the stack top entry could be in either one. When providing prediction results, it is necessary to dynamically determine the location. The persistent queue is a circular queue, and its pointers, in addition to the address value, also have a flag bit. This bit can help determine the positional relationship between BOS, TOSW, and TOSR. When TOSR is above BOS and below TOSW, the stack top entry is inside the persistent queue. When TOSR is below BOS, the stack top entry is outside the persistent queue, meaning it is in the commit stack. Therefore, we can dynamically select the stack top entry at runtime. Note that the stack top entry we obtain from the commit stack is not always consistent with the stack top of committed instructions. Therefore, we need to maintain another commit stack pointer ssp for the RAS predictor, which represents the position in the commit stack where the entry in the persistent queue will reside after being pushed into the commit stack. When accessing the stack top entry from the commit stack, ssp is used for data reading instead of nsp.

The above discussion is entirely based on the assumption that the persistent queue has sufficient capacity and does not overflow. If push operations are too dense in the current program segment and the backend execution speed is not fast enough, the persistent queue may run out of capacity and overflow. There are two possible handling schemes for this situation: forced overwrite or dynamic disabling of return stack prediction. Currently, the Kunming Lake architecture chooses the latter implementation strategy. When the current BOS and TOSW are about to overlap, BOS is forcibly advanced by one entry to prevent the persistent queue from being accidentally cleared. Since the entry recorded by BOS is not necessarily needed during this period, this strategy can slightly reduce frontend stalls. The drawback is that dense push operations might occur on erroneous paths, and if the overwritten entries are needed, it can lead to a small number of incorrect pops, and the causes of pop errors are complex. The drawback of the dynamic disabling scheme is that dense push operations might occur on the correct path, and because there are entries that have not been written to the return stack, it might lead to incorrect data in the stack. If it is recursion, the impact of this error might be lessened. For controllability, Kunming Lake currently uses the latter scheme.

For timing optimization, the reading/updating of the stack top entry is not completed in the cycle the read request is received, but rather in the previous cycle or N previous cycles based on the push/pop operations in the current cycle. To reduce write operations to the speculative queue, the push/pop results in the BPU stage 2 pipeline are not written directly to the speculative queue, but are delayed by one cycle. Considering the situation where data written in the current cycle needs to be read in the next cycle, a write bypass mechanism is designed. The data prepared for writing will first be used in the current cycle to update the writeEntry item in the write bypass. If the pointer requested for reading in the next cycle matches the pointer position recorded by the write bypass, the bypass value is used; otherwise, the value read from the stack top is used (in practice, due to timing optimization, this logic is moved to the cycle before reading the stack top).

Stage 2 Results

In BPU stage 2, as the prediction results of other branch predictors are not yet fully determined, there might be prediction results that need to be updated, and the current prediction result is not the final determined speculative execution path. The current prediction result assumes that the starting address of the current stage 2 prediction block will not change based on the branch prediction block located in stage 3 at the same time, and that other branch instructions before the call/ret instruction in the current stage 2 prediction block will not be predicted to jump. If the FTB entry transmitted from FTB in stage 2 is valid and contains a push-type instruction, the PC value of the instruction immediately following this instruction is pushed onto the RAS stack. If the FTB entry transmitted from the FTB predictor in stage 2 is valid and contains a pop-type instruction, the address at the current stack top is returned as the result, and the result is popped from the stack.

Within RAS, the above behavior is broken down as follows:

Upon seeing a predicted jump call instruction in the stage 2 FTB entry, the s2_spec_push signal is asserted, and the address information to be pushed is generated based on the current call instruction PC, instructing the internal RASStack module to act. When the RASStack module detects a push action, it uses the incoming stack top address as the predicted address for the newly written persistent queue entry. If the newly written address is the same as the original address and the original stack top counter is not saturated, the new entry's counter is counter+1 based on the original stack top entry's counter; otherwise, the new entry's counter is 0. The generated new entry is used for three purposes: 1. Updates the writeByassEntry register in the next cycle for continuous prediction reading of the stack top entry; 2. Updates the stack top entry in the current cycle for continuous prediction reading; 3. After being pipelined, it is used to update the persistent queue in the cycle after next. At the same time, as described above, TOSR is updated to the current TOSW, TOSW is updated to current TOSW+1. Global ssp and sctr follow similar update algorithms as the counter described above: if the new and old stack top addresses are the same and the original sctr (same as the original stack top entry's ctr) is not saturated, ssp remains unchanged, and sctr=sctr+1; otherwise, ssp=ssp+1, and sctr=0. To handle possible persistent queue overflow situations, if the persistent queue is close to overflowing at this time, the return stack predictor will be paused.

Upon seeing a predicted jump call instruction in the stage 2 FTB entry, the s2_spec_pop signal is asserted, instructing the internal RASStack module to act. When the RASStack module detects a pop action, if the current sctr is non-zero, then sctr=sctr-1 and ssp remains unchanged; otherwise, ssp=ssp-1 and sctr=the new stack top entry's sctr. TOSR = the original top's NOS, and TOSW remains unchanged. The separately maintained stack top entry is also updated using the updated ssp, sctr, TOSR, and TOSW.

Stage 3 Results

Upon seeing a predicted jump return instruction in the stage 3 FTB entry, the s3_push signal is asserted. Upon seeing a predicted jump call instruction in the stage 3 FTB entry, the s3_pop signal is asserted. The prediction result made for the current prediction block in stage 2 is also pipelined to stage 3 (s3_pushed_in_s2 and s3_poped_in_s2). If the action determined in stage 2 differs from that in stage 3, recovery is needed in stage 3. Regardless of whether recovery is needed, stage 3 uses the RAS stack top entry read out in stage 2 as the prediction result.

Since only the following situations exist for push/pop in stage 2 and push/pop in stage 3, stage 3 can undo stage 2 operations via push/pop operations.


S2 push	S3 push	No fix needed
S2 push	S3 keep	Fix by pop
S2 keep	S3 push	Fix by push
S2 keep	S3 keep	No fix needed
S2 keep	S3 pop	Fix by pop
S2 pop	S3 keep	Fix by push
S2 pop	S3 pop	No fix needed

The specific actions of stage 3 push/pop operations within RASStack are the same as in stage 2 and will not be repeated here.

Misprediction State Recovery

After a prediction block leaves the BPU, it may trigger a redirect during execution in the IFU or backend. When encountering a redirect, the RAS predictor needs to perform state recovery. Specifically, the RAS TOSR, TOSW, ssp, and sctr all need to be restored based on the corresponding meta-information before the misprediction occurred. Subsequently, depending on whether the mispredicted instruction itself is a call/return instruction, the stack structure also needs to be adjusted through push and pop operations, respectively. The specific actions of push and pop operations are the same as in the stage 2 and 3 prediction pipelines, and are omitted here.

Committed Entry Migration

As mentioned above, when a prediction block containing a call instruction is committed, BOS is updated to the TOSW value at the time of prediction. Simultaneously, the entry corresponding to the prediction block is written to the top of the commit stack (addressed by nsp) and nsp is updated. The nsp update algorithm is similar to ssp: if recursion exists and the stack top counter is not full, counter=counter+1 and nsp remains unchanged; otherwise, nsp=nsp+1 and counter=0.

Overall Block Diagram

Interface Timing

RAS Module Stage 2 Update Input/Output Interface

The figure above shows one pop and one push update in stage 2 of the RAS module. Since the preceding branch prediction slot for both the push and pop blocks is invalid, a return and a call instruction jump are observed in this pipeline stage, respectively, instructing the RASStack module to pop/push onto the stack. During the push, the FTB's fallThrough address is used as the jump return address. If the last instruction is a truncated RVI call instruction, then this address + 2 is the correct return address.

RAS Module Stage 3 Update Input/Output Interface

RASStack Module Input Interface

The figure above shows one push and one pop operation during stage 2 update of the RASStack module. It can be noted that after the push operation, the stack top read from the RASStack module is updated to the newly pushed value, while after the pop operation, the stack top is restored to the value before the push operation.

RAS Module Redirect Recovery Interface

The figure above shows the scenario of RAS and RASStack module redirect recovery where the recovery point instruction is a call instruction. The redirect signal input by the BPU is pipelined by one cycle within the RAS predictor before being sent to RASStack, used to restore the pointers of various entries in the persistent queue. Since the mispredicted instruction at this point is a call instruction, a new entry also needs to be pushed.

Similarly, if the mispredicted instruction at the point of misprediction is a return instruction, an entry needs to be popped based on the stack top at that time.

RAS Module Instruction Commit Training Interface

The figure shows one instruction commit containing a return and one containing a call instruction. It can be seen that upon commit, the commit stack top changes, and the BOS pointer is adjusted accordingly after the call instruction commits.

RAS Storage Structure

The return stack predictor is divided into a speculative queue (persistent queue) and a commit stack. The speculative queue has 32 entries, and the commit stack has 16 entries.

The structure of a speculative queue entry is as follows:

retAddr	sctr	nos
Return Address	Consecutive occurrences of the same return address	Position of the older entry in the speculative queue

The structure of a commit stack entry is as follows:

retAddr	sctr
Return Address	Consecutive occurrences of the same return address

Prediction and Update

The return stack is different from other predictors. For the speculative queue, prediction and update occur simultaneously. The commit stack is updated when instructions are retired. The following figures show the logic for obtaining the return stack predicted address, speculative Pop, and speculative Push updates to the speculative queue.

Getting Stack Top Address

The return stack top data may be in the speculative stack or in the commit stack. When the speculative stack is empty, the stack top data is in the commit stack. The logic for determining if the speculative stack is not empty is BOS <= TOSR < TOSW. Conversely, if this is false, the speculative stack is empty. (Explanation: When popping from the speculative stack, TOSR moves towards BOS. Therefore, if TOSR is not within the speculative stack range, it means the speculative stack is empty. Where did the speculative stack data go? It's in the commit stack, which is related to the BOS update logic, provided the speculative stack does not overflow.)

Speculative Pop

When performing a Pop operation on the speculative stack, only the TOSR pointer is moved, used to point to the current top of the speculative stack. TOSR := spec_nos(TOSR). NOS is not updated because each speculative stack entry records the nos point. As TOSR moves, it means the current NOS also moves with it. TOSW does not move, the purpose being to retain push history for ease of tracking and backtracking. There is also another pointer, ssp, used to record the position of the stack top during prediction. Its update follows the update rules of the original RAS stack structure. When popping, ssp decrements. Of course, whether ssp decrements depends on the value of sctr.

Speculative Push

When performing a Push operation on the speculative stack, TOSW := TOSW + 1, spec_nos(TOSW) := TOSR, TOSR := TOSW, spec_queue(TOSW) := io.spec_push_addr, ssp = ssp + 1.U. When a new entry is pushed, TOSW points to the new unallocated entry, TOSR points to the new stack top, the NOS pointer will record the position of the previous stack top, which is recorded in the spec_nos queue for subsequent recovery. The address of the newly pushed entry will be recorded in spec_queue. (You might wonder what happens if the stack keeps growing and crashes. --- This is handled by the BOS pointer update and stack overflow mechanisms.)

Commit Stack Update

The update of the commit stack is similar to the usual return stack structure and will not be repeated here. A slight difference is that its Push and Pop signals come from the retirement of corresponding call and ret instructions.