Kunming Lake IFU Module Documentation

Version: V2R2
Status: OK
Date: 2025/01/03
Commit: 7d889d887f665295eec9cdb987e037e008f875a6

Terminology

Abbreviation	Full Name	Description
CRU	Clock Reset Unit	Clock Reset Unit
RVC	RISC-V Compressed Instructions	16-bit compressed instructions defined by RISC-V "C" extension
RVI	RISC-V Integer Instructions	32-bit base integer instructions defined by RISC-V specification
IFU	Instruction Fetch Unit	Instruction Fetch Unit
FTQ	Fetch Target Queue	Fetch Target Queue
PreDecode	Predecoder Module	Predecoder Module
PredChecker	Prediction Check Module	Branch Prediction Result Checker
ICache	L1 Instruction Cache	Level 1 Instruction Cache
IBuffer	Instruction Buffer	Instruction Buffer
CFI	Control Flow Instruction	Control Flow Instruction
PC	Program Counter	Program Counter
ITLB	Instruction Translation Lookaside Buffer	Instruction Address Translation Lookaside Buffer
InstrUncache	Instruction Ucache Module	Instruction MMIO Fetch Handling Unit

Sub-module List

Sub-module	Description
PreDecoder	Predecoder Module
InstrUncache	Instruction MMIO Fetch Handling Unit

Functional Description

The FTQ sends predicted block requests to the ICache and IFU modules respectively. The IFU waits for up to two cache lines of instruction code returned from the ICache, then segments it to generate the initial instruction code within the limited fetch request range, and sends it to the predecoder for predecoding. In the next cycle, the effective instruction range is corrected based on the predecode information. At the same time, instruction code expansion is performed, and the instruction code and other information are sent to the IBuffer module. When the ICache query address attribute indicates it is an MMIO address space, the IFU needs to send the address to the MMIO handling unit to fetch instructions. In this case, the processor enters multi-cycle sequential execution mode. The IFU stalls the pipeline until it receives a commit signal from the ROB, only then allowing the next instruction fetch request. Meanwhile, the IFU needs to handle 32-bit instructions across MMIO address spaces with special processing (resend mechanism).

Receiving FTQ Fetch Requests

The IFU receives instruction fetch requests from the FTQ in units of predicted blocks, including the predicted block start address, the start address of the next cache line following the start address's cache line, the start address of the next predicted block, the queue pointer of this predicted block in the FTQ, whether this predicted block has a taken CFI instruction and the position of this taken CFI instruction within the predicted block, and request control signals (whether the request is valid and whether the IFU is ready). Each predicted block contains a maximum of 32 bytes of instruction code, up to 16 instructions.

Fetching Two Cache Lines

Only when the fetch address of a predicted block is in the latter half of a cache line, to meet the requirement of a predicted block containing up to 34 bytes, the IFU will fetch two consecutive cache lines from the ICache, generating exception information (page fault and access fault) for each, and then segmenting as described in Feature 3 below.

After 2024/06, the ICache implemented a low-power design that performs data selection and concatenation internally. Therefore, the IFU does not need to worry about how the data from the two cache lines are concatenated and selected. It only needs to simply copy the data returned by the ICache and concatenate them together for segmentation. Please refer to the ICache documentation.

You can also refer to the comments in IFU.scala.

Instruction Segmentation to Generate Initial Instruction Code

In the next pipeline stage (F1), the PC for every 2 bytes within the predicted block and some other information are calculated. Then it enters the F2 pipeline stage, waiting for the ICache to return the instruction code. In the F2 stage, it is necessary to check if the instruction code returned by the ICache matches this pipeline stage (because the IFU pipeline stages can be flushed while the ICache is not). Then, based on the cache line exception information returned by the ICache, the exception information (page fault and access fault) for each instruction is generated. Simultaneously, based on the taken information from the FTQ, a valid instruction range for a jump jump_range (i.e., the range from the start address of this predicted block to the first jump address) and a valid instruction range for no jump ftr_range (i.e., the range from the start address of this predicted block to the start address of the next predicted block) are calculated. For timing considerations, the two ports of the ICache return cache lines from two sources (miss and hit). These four cache lines need to generate 4 combinations (two from port 0 and two from port 1) for predecoding simultaneously. F2 will parallelly select 17×2 bytes of initial instruction code from the returned 64 bytes of data (40 bytes of which are effective data) based on the predicted block's start address, and send them to 4 PreDecode modules for predecoding.

Generating Predecode Information

The PreDecode module accepts the 17 2-byte initial instruction codes segmented in F2. On one hand, it predecodes these initial instruction codes based on the decode table to obtain predecode information, including whether the instruction is the start of a valid instruction, whether it is an RVC instruction, whether it is a CFI instruction, the CFI instruction type (branch/jal/jalr/call/ret), and the target address calculation offset for CFI instructions. The encoding of the brType field in the output predecode information is as follows:

Table 1.2 CFI Instruction Type Encoding

CFI Instruction Type	Type Encoding (brType)
Non-CFI Instruction	00
branch Instruction	01
jal Instruction	10
jalr Instruction	11

Generating Instruction Code and Instruction Code Expansion

While generating predecode information, the initial instructions are combined into 4-byte groups (starting from the start address, incrementing the address by 2 bytes, and using the 4 bytes starting from the address as a 32-bit initial instruction code) to generate the instruction code for each instruction.

In the cycle after generating the instruction code and predecode information (F3), the instruction codes for 16 instructions are sent to 16 instruction expanders for 32-bit instruction expansion (RVC instructions are expanded according to the specification, RVI instructions retain their instruction code).

Branch Prediction Overriding Flushes the Pipeline

When the FTQ has not cached enough predicted blocks, the IFU may directly use the predicted address provided by a simple branch predictor to fetch instructions. In this case, when the accurate predictor discovers an error in the simple predictor, it needs to notify the IFU to cancel the ongoing fetch request. Specifically, when the BPU's S2 pipeline stage detects an error, it needs to flush the IFU's F0 pipeline stage; when the BPU's S3 pipeline stage detects an error, it needs to flush the IFU's F0/F1 pipeline stages (the BPU's simple predictor provides results in S1, and overriding occurs at the latest in S3, so the IFU's F2/F3 pipeline stages are always based on the best prediction and do not need to be flushed; similarly, there is no flush from BPU S2 to IFU F1).

Upon receiving a flush request from the BPU, the IFU will compare the fetch request pointer in the F0/F1 pipeline stage with the flush request pointer sent by the BPU. If the flush pointer is before the fetch pointer, it indicates that the current fetch request is on an incorrect execution path and the pipeline needs to be flushed; otherwise, the IFU can ignore this flush request sent by the BPU.

Early Branch Prediction Error Checking

To reduce flushes for some easily identifiable branch prediction errors, the IFU uses the predecode information generated in F2 in the F3 pipeline stage to perform front-end branch prediction error checking. The predecode information is first sent to the PredChecker module. Based on the CFI instruction type within it, it checks for jal type errors, ret type errors, invalid instruction prediction errors, non-CFI instruction prediction errors. Simultaneously, it calculates 16 branch target addresses based on the instruction code and compares them with the predicted target address to check for branch target address errors. The PredChecker will correct the prediction results for jal type errors and ret errors, and regenerate a corrected instruction valid range vector fixedRange (1 indicates the instruction is within the predicted block). fixedRange is based on jump_range and ftr_range and, according to the jal and ret check results, narrows the range to the start address up to the first detected jal or ret instruction. Below are the error types checked by the PredChecker module for branch prediction:

jal type error: There is a jal instruction within the predicted block range, but the predictor did not predict a jump for this instruction.
ret type error: There is a ret instruction within the predicted block range, but the predictor did not predict a jump for this instruction.
Invalid instruction prediction error: The predictor made a prediction for an invalid instruction (not within the predicted block range / is the middle of a 32-bit instruction).
Non-CFI instruction prediction error: The predictor made a prediction for a valid but non-CFI instruction.
Branch target address error: The branch target address provided by the predictor is incorrect.

Frontend Redirection

If the F3 branch prediction check result shows that this predicted block has one of the 5 prediction errors described in Feature 7, the IFU will generate a frontend redirection in the next cycle, flushing pipeline stages except F3. The FTQ and predictor flushes will be completed by the FTQ after being written back by the IFU.

Sending Instruction Code and Frontend Instruction Information to IBuffer

The F3 pipeline stage finally obtains the expanded 32-bit instruction code, as well as the exception information, predecode information, FTQ pointer, and other information required by the backend (such as folded PC) for each of the 16 instructions. Besides the standard valid-ready control signals, the IFU also provides two special signals to the IBuffer: one is the 16-bit io_toIbuffer_bits_valid, which indicates the valid instructions in the predicted block (1 indicates the start of an instruction, 0 indicates the middle of an instruction). The other is the 16-bit io_toIbuffer_bits_enqEnable, which is the logical AND of io_toIbuffer_bits_valid and the corrected predicted block instruction range fixedRange. enqEnable being 1 means this 2-byte instruction code is the start of an instruction and is within the instruction range indicated by the predicted block.

Writing Back Instruction Information and Misprediction Information to FTQ

In the WB stage, which is the stage after F3, the IFU writes back the instruction PC, predecode information, the position of the mispredicted instruction, the correct jump address, and the correct instruction range of the predicted block to the FTQ. At the same time, it passes the FTQ pointer of that predicted block to distinguish different requests.

Handling 32-bit Instructions Across Predicted Blocks

Because predicted blocks have length constraints, there are cases where the first two bytes and the last two bytes of an RVI instruction are in two different predicted blocks. The IFU first checks in the first predicted block whether the last 2 bytes are the start of an RVI instruction. If so, and if there is no jump in that predicted block, it sets a flag register f3_lastHalf_valid to indicate that the subsequent predicted block contains the latter half of the instruction. During F2 predecoding, two different instruction valid vectors are generated:

One assumes the predicted block start address is the start of an instruction, and generates the instruction valid vector based on whether subsequent instructions are RVC or RVI.
The other assumes the predicted block start address is the middle of an RVI instruction, and generates the instruction valid vector starting from the address + 2 as the beginning of an instruction.

In F3, the choice of which to use as the final instruction valid vector depends on whether the cross-predicted block RVI flag is set. If f3_lastHalf_valid is high, the latter option is chosen (meaning the first 2 bytes of this predicted block are not the start of an instruction). As described in Feature 2, only when the start address is in the latter half cache line will two cache lines be fetched from the ICache. Therefore, even if this cross-predicted block RVI instruction also crosses cache lines, each predicted block will receive its complete instruction code. The IFU's processing is merely to count this instruction in the first predicted block and invalidate the 2 bytes at the start address position of the second predicted block by changing the instruction valid vector.

MMIO Instruction Fetch

When the processor powers on and resets, since memory initialization is not yet complete, the processor needs to fetch instructions from flash storage to run. In this case, the IFU needs to send a 64-bit request to the MMIO bus to fetch instructions from the flash address space. At the same time, the IFU prohibits speculative execution on the MMIO bus. That is, the IFU needs to wait until each instruction has completed execution and the accurate next instruction address is obtained before continuing to send requests to the bus.

After the processor powers on and resets, instruction fetching starts from address 0x10000000. The ICache translates the address through the ITLB to get the physical address. The physical address is checked against the PMP to see if it belongs to the MMIO space, and the check result is returned to the IFU F2 pipeline stage (see ICache documentation). If it is an instruction fetch request for the MMIO address space, the IFU will stall the request in F3 and control MMIO instruction fetching by a state machine, as shown in the figure below:

The state machine is in the m_idle state by default. If the F3 pipeline stage is an MMIO instruction fetch request and no exception has occurred previously, the state machine enters the m_waitLastCmt state.
(m_waitLastCmt) The IFU queries the FTQ via the mmioCommitRead port to check if all instructions before the IF3 predicted block have been committed. If not committed, it blocks and waits for the preceding instructions to commit¹.
(m_sendReq) The request is sent to the InstrUncache module, sending a request to the MMIO bus.
(m_waitResp) After the InstrUncache module returns, the instruction code is truncated from the 64-bit data based on the pc.
If the low bits of the pc are 3'b110, due to the MMIO bus bandwidth limit of 8B and access being restricted to aligned regions, the higher 2B of this request will not be valid data. If the returned instruction data indicates the instruction is not an RVC instruction, this situation requires resending the request for pc+2 (i.e., aligned to the next 8B position) to fetch the complete 4B instruction code.
Before resending, a new ITLB address translation and PMP check for pc+2 are required (because it might cross a page) (m_sendTLB, m_TLBResp, m_sendPMP). If an exception occurs with ITLB or PMP (access fault, page fault, guest page fault), or if the check reveals that the pc+2 position is not in the MMIO address space, the exception information is sent directly to the backend without fetching.
If there is no exception, (m_resendReq, m_waitResendResp) similar to steps 2/3, a request is sent to InstrUncache and the instruction code is received.
When the IFU has registered the complete instruction code, or an error occurs (ITLB/PMP error during resend, or corrupt return from Uncache module tilelink bus), (m_waitCommit) the instruction data and exception information can be sent to the IBuffer. It is important to note that MMIO instruction fetching can only non-speculatively initiate one instruction fetch request to the bus at a time, therefore only one instruction data can be sent to the IBuffer. And it waits for the instruction to commit.
If this instruction is a CFI instruction, the backend initiates a flush to the FTQ.
If it is a sequential instruction, the IFU reuses the frontend redirection path to flush the pipeline, and also reuses the FTQ write-back mechanism, treating it as a mispredicted instruction to be flushed, redirecting to the instruction address +2 or +4 (depending on whether this instruction is RVI or RVC). This mechanism ensures that MMIO fetches only one instruction at a time.
After committing, (m_commited) the state machine resets to m_idle and clears all kinds of registers.

Besides power-on, the debug extension and Svpbmt extension may also cause the processor to jump to an MMIO address space to fetch instructions at any point during execution. Please refer to the RISC-V manual. The handling of MMIO fetching in these cases is the same.

Trigger Implementation for PC Hardware Breakpoint Functionality

In the IFU's FrontendTrigger module, there are a total of 4 Triggers, numbered 0-3. The configuration information for each Trigger (breakpoint type, matching address, etc.) is stored in the tdata register.

When software writes specific values to the CSR registers tselect, tdata1/2, the CSR will send a tUpdate request to the IFU, updating the configuration information in the tdata registers within FrontendTrigger. Currently, the frontend Triggers can only be configured as PC breakpoints (mcontrol.select register is 0; when mcontrol.select=1, this Trigger will never hit and will not generate an exception).

During instruction fetch, the IFU's F3 pipeline stage will query the FrontendTrigger module and receive the result in the same cycle. The latter checks each instruction within the fetch block against each Trigger. When not in debug mode, if the relationship between the instruction's PC and the content of the tdata2 register satisfies the relationship indicated by the mcontrol.match bit (XiangShan supports mcontrol.match bits 0, 2, 3, corresponding to equals, greater than, less than), the instruction will be marked as a Trigger hit. As it executes, a breakpoint exception will be generated in the backend, entering M-Mode or debug mode. Frontend Triggers support Chain functionality. When their corresponding mcontrol.chain bit is set, an exception will only be generated if this Trigger and the Trigger with the next higher number hit simultaneously².

Overall Design

Overall Block Diagram and Pipeline Stages

Interface Timing

FTQ Request Interface Timing Example

The figure above shows an example of three FTQ requests. req1 only requests cache line line0, followed by req2 requesting line1 and line2. When it comes to req3, due to the instruction cache SRAM write-first mechanism, the read request ready of the instruction cache is pulled low. The valid signal and address of req3 are held until the request is received.

ICache Return Interface and Interfaces to IBuffer and Writeback to FTQ Timing Example

The figure above shows the timing from the instruction cache returning data to the IFU discovering a misprediction until the FTQ sending the correct address. The request corresponding to group0 fetched two cache lines, line0 and line1, in the f2 stage. In the next cycle, the IFU performs misprediction checking and simultaneously sends the instruction to the IBuffer. However, at this point, the backend pipeline is stalled, causing the IBuffer to be full. The ready signal on the IBuffer receiver side is pulled low. The signals related to group0 are held until the request is received by the IBuffer. But the writeback from the IFU to the FTQ is pulled high in the cycle after tio_toIbuffer_valid is asserted, because at this time the request has entered the wb stage without blocking. This stage latches the check result from PredChecker, reporting that the instruction at the 4th 2-byte position (starting from 0) in group0 experienced a misprediction and should be redirected to vaddrA. After 4 cycles (flushing and re-running the predictor pipeline), the FTQ resends a predicted block starting at vaddrA to the IFU.

MMIO Request Interface Timing Example

The figure above shows the instruction fetch timing for an MMIO request req1. First, the tlbExcp information returned by the ICache reports that this is an instruction in the MMIO space (other exception signals must be low). Two cycles later, the IFU sends a request to the InstrUncache, and after some time receives the response and the 32-bit instruction code. In the same cycle, the IFU sends this instruction as a predicted block to the IBuffer and simultaneously sends a writeback to the FTQ, reusing the misprediction signal port, with the redirection address being the address of the very next instruction. At this point, the IFU enters a wait state for instruction execution completion. After some time, the rob_commits port reports that this instruction has completed execution and there is no backend redirection. The IFU then re-initiates the instruction fetch request for the next MMIO instruction.

It should be specifically pointed out that the Svpbmt extension adds an NC attribute, which represents that the memory area is non-cacheable but idempotent. This means we can perform speculative execution on NC areas, i.e., sending fetch requests to the bus without needing to "wait for preceding instructions to commit". This manifests as the state machine skipping the wait state. See #3944 for the implementation. ↩
In the past (riscv-debug-spec-draft, corresponding to versions before XiangShan 2024.10.05 merging PR#3693), Chain also required that the mcontrol.timing of the two Triggers be the same. In the new version (riscv-debug-spec-v1.0.0), mcontrol.timing is removed. Currently, XiangShan's scala implementation still retains this bit, but its value is always 0 and cannot be written. The generated verilog code does not have this bit. Reference: https://github.com/riscv/riscv-debug-spec/pull/807. ↩