L1 TLB Secondary Module

Design Specifications

Supports receiving address translation requests from Frontend and MemBlock
Supports PLRU replacement algorithm
Supports returning physical addresses to Frontend and MemBlock
Both ITLB and DTLB use non-blocking access
Both ITLB and DTLB entries are implemented using register files
Both ITLB and DTLB entries are fully associative structures
ITLB and DTLB use the current processor privilege level and effective privilege level for memory access execution, respectively
Supports judging whether virtual memory is enabled and whether two-stage translation is enabled within the L1 TLB
Supports sending PTW requests to L2 TLB
DTLB supports duplicating the queried physical address
Supports exception handling
Supports TLB compression
Supports TLB Hint mechanism
Stores four types of TLB entries
TLB refill merges page tables from two stages
Logic for judging a TLB entry hit
Supports resending PTW to get gpaddr after guest page fault

Functionality

Receiving Address Translation Requests from Frontend and MemBlock

Before conducting memory reads and writes within the core, including frontend instruction fetch and backend memory access, address translation must be performed by the L1 TLB. Due to physical distance and to avoid mutual interference, it is divided into ITLB (Instruction TLB) for frontend instruction fetch and DTLB (Data TLB) for backend memory access. The ITLB adopts a fully associative mode, with 48 fully associative entries storing all page sizes. The ITLB receives address translation requests from the Frontend. itlb_requestors(0) to itlb_requestors(2) come from the icache, where itlb_requestors(2) is an icache prefetch request; itlb_requestors(3) comes from the IFU, for address translation requests of MMIO instructions.

The ITLB entry configuration and request sources are shown in 此表、此表 respectively.

ITLB Entry Configuration
Entry Name	Number of Entries	Organization Structure	Replacement Algorithm	Stored Content
Page	48	Fully Associative	PLRU	All Page Sizes

ITLB Request Sources
Index	Source
requestors(0)	Icache, mainPipe
requestors(1)	Icache, mainPipe
requestors(2)	Icache, fdipPrefetch
requestors(3)	IFU

XiangShan's memory access pipeline has 2 Load pipelines, 2 Store pipelines, as well as an SMS prefetcher and an L1 Load stream & stride prefetcher. To handle numerous requests, the two Load pipelines and the L1 Load stream & stride prefetcher use the Load DTLB, the two Store pipelines use the Store DTLB, and prefetch requests use the Prefetch DTLB. In total, there are 3 DTLBs, all employing the PLRU replacement algorithm (see Section 5.1.1.2).

The DTLB adopts a fully associative mode, with 48 fully associative entries storing all page sizes. The DTLB receives address translation requests from MemBlock. dtlb_ld receives requests from loadUnits and the L1 Load stream & stride prefetcher, and is responsible for address translation of Load instructions; dtlb_st receives requests from StoreUnits, and is responsible for address translation of Store instructions. Specifically, for AMO instructions, it uses the dtlb_ld_requestor of loadUnit(0) to send requests to dtlb_ld. The SMSPrefetcher sends prefetch requests to a separate DTLB.

The DTLB entry configuration and request sources are shown in 此表、此表 respectively.

DTLB Entry Configuration
Entry Name	Number of Entries	Organization Structure	Replacement Algorithm	Stored Content
Page	48	Fully Associative	PLRU	All Page Sizes

DTLB Request Sources
Module	Index	Source
DTLB_LD
	ld_requestors(0)	loadUnit(0), AtomicsUnit
	ld_requestors(1)	loadUnit(1)
	ld_requestors(2)	loadUnit(2)
	ld_requestors(3)	L1 Load stream & stride Prefetch
DTLB_ST
	st_requestors(0)	StoreUnit(0)
	st_requestors(1)	StoreUnit(1)
DTLB_PF
	pf_requestors(0)	SMSPrefetch
	pf_requestors(1)	L2 Prefetch

Adopting PLRU Replacement Algorithm

The L1 TLB adopts a configurable replacement policy, defaulting to the PLRU replacement algorithm. In the Nanhu architecture, both ITLB and DTLB include NormalPage and SuperPage, with a relatively complex refill policy. The NormalPage in Nanhu architecture ITLB is responsible for address translation of 4KB pages, while SuperPage is responsible for 2MB and 1GB page address translation, requiring refill into NormalPage or SuperPage based on the size of the refilled page (4KB, 2MB, or 1GB). The NormalPage in Nanhu architecture DTLB is responsible for address translation of 4KB pages, while SuperPage is responsible for address translation of all page sizes. NormalPage is direct-mapped, and although it has more entries, its utilization is low. SuperPage is fully associative with high utilization, but due to timing constraints, it has fewer entries, resulting in a high miss rate.

Please note that the Kunming Lake architecture proposes optimizations for the above issues. Under timing constraints, both ITLB and DTLB are unified into a 48-entry fully associative structure, allowing any page size to be refilled. Both ITLB and DTLB use the PLRU replacement policy.

The refill policy for ITLB and DTLB is shown in 此表.

ITLB and DTLB Refill Policy
Module	Entry Name	Policy
ITLB
	Page	48 fully associative entries, can refill any page size
DTLB
	Page	48 fully associative entries, can refill any page size

Returning Physical Address to Frontend and MemBlock

After the L1 TLB obtains the physical address from the virtual address, it returns the physical address for the corresponding request to the Frontend and MemBlock, along with information on whether a miss occurred, whether a guest page fault occurred, a page fault, or an access fault. For each request from the Frontend or MemBlock, a response is sent by the ITLB or DTLB, with tlb_requestor(i)_resp_valid indicating a valid response.

In the Nanhu architecture, although both SuperPage and NormalPage are implemented using register files physically, SuperPage is a 16-entry fully associative structure, and NormalPage is a direct-mapped structure. After reading data from the direct-mapped NormalPage, a tag comparison is still required. Although SuperPage has 16 fully associative entries, only one entry can hit at a time, which is marked by hitVec, selecting the data read from SuperPage. The time to read data + compare tags from NormalPage is much longer than reading data + selecting data from SuperPage. Therefore, from a timing perspective, the DTLB returns a fast_miss signal to MemBlock, indicating a SuperPage miss; the miss signal indicates a miss in both SuperPage and NormalPage.

Simultaneously, in the Nanhu architecture, due to tight timing for DTLB's PMP & PMA checks, PMP needs to be divided into dynamic and static check parts. (See Section 5.4) When an L2 TLB page table entry is refilled into the DTLB, the refilled page table entry is simultaneously sent to PMP and PMA for permission checks, and the check results are stored in the DTLB. The DTLB needs to additionally return signals to MemBlock indicating valid static checks and the check results.

It is important to note that the Kunming Lake architecture optimizes the TLB query entry configuration and corresponding timing. Currently, fast_miss is canceled, and no additional static PMP & PMA check is required. However, this might be restored later due to timing or other reasons, so the previous two paragraphs are kept for documentation completeness and compatibility. The Kunming Lake architecture cancels fast_miss and static PMP & PMA checks; please note this again.

Blocking and Non-blocking Access

In the Nanhu architecture, the Frontend's instruction fetch requires blocking access to the ITLB, while the backend's memory access requires non-blocking access to the DTLB. In fact, the TLB itself is non-blocking access and does not store request information. The reason for the TLB using blocking or non-blocking access is the requirements of the request source. When the Frontend instruction fetch misses the TLB, it needs to wait for the TLB to retrieve the result before sending the instruction to the processor backend for processing, exhibiting a blocking effect; memory access operations can be scheduled out of order, and when one request misses, another load/store instruction can be scheduled to continue execution, thus exhibiting a non-blocking effect.

The above functionality in the Nanhu architecture is achieved by the TLB. The TLB uses some control logic to continuously wait for the page table entry to be retrieved via PTW when the ITLB misses. In Kunming Lake, the above functionality is guaranteed by the ICache. When the ITLB misses and reports this to the ICache, the ICache continuously resends the same request until it hits, ensuring the non-blocking access effect.

However, it should be noted that both the ITLB and DTLB in the Kunming Lake architecture are non-blocking. Whether the external effect is blocking or non-blocking is controlled by the fetch unit or memory access unit.

Storage Structure of L1 TLB Entries

XiangShan's TLB can be configured in terms of organization structure, including associativity mode, number of entries, and replacement policy. The default configuration is: both ITLB and DTLB are 48-entry fully associative structures, and both are implemented using register files (see Section 5.1.2.3). If an address is read and written in the same cycle, the result can be obtained directly through bypass.

Reference ITLB or DTLB configuration: both use a fully associative structure with 8 / 16 / 32 / 48 entries. Parameterization to change the TLB structure (fully associative / set associative / direct mapped) is currently not supported and requires manual code modification.

Supporting Judging Whether Virtual Memory is Enabled and Whether Two-Stage Translation is Enabled Within the L1 TLB

XiangShan supports the Sv39 page table from the RISC-V manual, with a virtual address length of 39 bits. XiangShan's physical address is 36 bits and can be parameterized.

Whether virtual memory is enabled is determined by the privilege level, the MODE field of the SATP register, etc. This judgment is made inside the TLB and is transparent to the outside of the TLB. For a description of privilege levels, see Section 5.1.2.7; regarding the MODE field of SATP, the Kunming Lake architecture of XiangShan only supports MODE field value 8, which is the Sv39 paging mechanism, otherwise an illegal instruction fault will be reported. From the perspective of modules outside the TLB (Frontend, LoadUnit, StoreUnit, AtomicsUnit, etc.), all addresses are translated by the TLB.

When the H extension is added, whether address translation is enabled also requires determining if two-stage address translation is active. Two-stage address translation is enabled by two conditions: first, if a virtualized memory access instruction is currently executing, and second, if virtualization mode is enabled and the MODE field of VSATP or HGATP is non-zero. The translation modes in this case are as follows. The translation mode is used to look up the corresponding type of page table in the TLB and to send the PTW request to the L2TLB.

Two-Stage Translation Modes
VSATP Mode	HGATP Mode	Translation Mode
Non-zero	Non-zero	allStage, both stages of translation present
Non-zero	Zero	onlyStage1, only the first stage of translation
Zero	Non-zero	onlyStage2, only the second stage of translation

Privilege Level of L1 TLB

According to the RISC-V manual requirements, the privilege level for frontend instruction fetch (ITLB) is the current processor privilege level, and the privilege level for backend memory access (DTLB) is the effective privilege level for memory access execution. Both the current processor privilege level and the effective privilege level for memory access execution are determined in the CSR module and passed to the ITLB and DTLB. The current processor privilege level is stored in the CSR module; the effective privilege level for memory access execution is jointly determined by the MPRV, MPV, and MPP bits of the mstatus register and the SPVP bit of hstatus. If a virtualized memory access instruction is executed, the effective privilege level for memory access execution is the privilege level stored in the SPVP bit of hstatus; if the executed instruction is not a virtualized memory access instruction and the MPRV bit is 0, the effective privilege level for memory access execution is the same as the current processor privilege level, and the effective virtualization mode for memory access execution is also the same as the current virtualization mode; if the MPRV bit is 1, the effective privilege level for memory access execution is the privilege level stored in the MPP field of the mstatus register, and the effective virtualization mode for memory access execution is the virtualization mode stored in the MPV bit of the hstatus register. The privilege levels for ITLB and DTLB are shown in the table.

ITLB and DTLB Privilege Levels
Module	Privilege Level
ITLB	Current processor privilege level
DTLB	For executing non-virtualized memory access instructions, if mstatus.MPRV=0, it is the current processor privilege level and virtualization mode; if mstatus.MPRV=1, it is the privilege level stored in mstatus.MPP and the virtualization mode stored in hstatus.MPV

Sending PTW Requests

When an L1 TLB misses, it needs to send a Page Table Walk request to the L2 TLB. Due to the relatively long physical distance between the L1 TLB and L2 TLB, a pipeline stage, called Repeater, is needed in between. Additionally, the repeater needs to filter out duplicate requests to prevent duplicate entries in the L1 TLB. (See Section 5.2) Therefore, the first-level Repeater for ITLB or DTLB is also referred to as a Filter. The L1 TLB sends PTW requests and receives PTW replies via the Repeater. (See Section 5.3)

DTLB Duplicating the Queried Physical Address

In the physical implementation, the dcache and lsu in Memblock are physically distant. If a hitVec is generated in the load_s1 stage of the LoadUnit and then sent separately to the dcache and lsu, it will cause severe timing issues. Therefore, two hitVec signals need to be generated in parallel near the dcache and lsu and sent separately to the dcache and lsu. To cooperate with resolving timing issues in Memblock, the DTLB needs to duplicate the queried physical address into 2 copies, sending one to the dcache and one to the lsu. The physical addresses sent to the dcache and lsu are identical.

Exception Handling Mechanism

Exceptions that can be generated by the ITLB include inst guest page fault, inst page fault, and inst access fault, all of which are delivered to the requesting source, ICache or IFU, for handling. Exceptions that can be generated by the DTLB include: load guest page fault, load page fault, load access fault, store guest page fault, store page fault, and store access fault, all of which are delivered to the requesting source, LoadUnits, StoreUnits, or AtomicsUnit, for handling. The L1TLB does not store the gpaddr, so when a guest page fault occurs, PTW needs to be performed again to obtain the gpaddr. See Section 6 of this document: Exception Handling Mechanism.

Here, additional supplementary explanations are needed regarding exceptions related to virtual-to-physical address translation. We classify exceptions as follows:

Page table related exceptions
When not in virtualization or during the virtualized VS-Stage, if a page table entry has reserved bits that are non-zero / is non-aligned / write permission (w) is missing, etc. (see manual for details), a page fault needs to be reported.
During the virtualized G-Stage, if a page table entry has reserved bits that are non-zero / is non-aligned / write permission (w) is missing, etc. (see manual for details), a guest page fault needs to be reported.
Virtual address or physical address related exceptions
1. Exceptions related to virtual or physical addresses during the address translation process. This part of the check is performed during the L2 TLB PTW process.
2. When not in virtualization or during the virtualized all-Stage, the G-stage gvpn needs to be checked. If hgatp's mode is 8 (representing Sv39x4), then bits [41-12 = 29] and above of gvpn must all be 0; if hgatp's mode is 9 (representing Sv48x4), then bits [50-12 = 38] and above of gvpn must all be 0. Otherwise, a guest page fault will be reported.
3. When the page table entry is obtained during address translation, the upper [48-12 = 36] bits of the PPN part of the page table must all be 0. Otherwise, an access fault will be reported.
4. Exceptions related to virtual or physical addresses in the original address. These are summarized below. In theory, these checks should be performed in the L1 TLB, but since the ITLB's redirect result comes entirely from the Backend, the corresponding exceptions in the ITLB will be recorded when the Backend sends redirect to the Frontend and will not be checked again in the ITLB. Please refer to the Backend's explanation for this part.
5. Sv39 mode: Includes enabled virtual memory and disabled virtualization, where satp mode is 8; or enabled virtual memory and enabled virtualization, where vsatp mode is 8. In these cases, bits [63:39] of vaddr must have the same sign as bit 38 of vaddr. Otherwise, based on instruction fetch / load / store requests, instruction page fault, load page fault, or store page fault should be reported respectively.
6. Sv48 mode: Includes enabled virtual memory and disabled virtualization, where satp mode is 9; or enabled virtual memory and enabled virtualization, where vsatp mode is 9. In these cases, bits [63:48] of vaddr must have the same sign as bit 47 of vaddr. Otherwise, based on instruction fetch / load / store requests, instruction page fault, load page fault, or store page fault should be reported respectively.
7. Sv39x4 mode: Enabled virtual memory and enabled virtualization, where vsatp mode is 0 and hgatp mode is 8. (Note: When vsatp mode is 8/9 and hgatp mode is 8, the second stage address translation is also in Sv39x4 mode, and corresponding exceptions may also occur. However, this part belongs to "Exceptions related to virtual or physical addresses during the address translation process" and will be handled during the L2 TLB page table walk process. It is not within the scope of L1 TLB handling. The L1 TLB will only handle "Exceptions related to virtual or physical addresses in the original address".) In this case, bits [63:41] of vaddr must all be 0. Otherwise, based on instruction fetch / load / store requests, instruction guest page fault, load guest page fault, or store guest page fault should be reported respectively.
8. Sv48x4 mode: Enabled virtual memory and enabled virtualization, where vsatp mode is 0 and hgatp mode is 9. (Note: When vsatp mode is 8/9 and hgatp mode is 9, the second stage address translation is also in Sv48x4 mode, and corresponding exceptions may also occur. However, this part belongs to "Exceptions related to virtual or physical addresses during the address translation process" and will be handled during the L2 TLB page table walk process. It is not within the scope of L1 TLB handling. The L1 TLB will only handle "Exceptions related to virtual or physical addresses in the original address".) In this case, bits [63:50] of vaddr must all be 0. Otherwise, based on instruction fetch / load / store requests, instruction guest page fault, load guest page fault, or store guest page fault should be reported respectively.
9. Bare mode: Virtual memory is not enabled, in which case paddr = vaddr. Since the physical address of the XiangShan processor is currently limited to 48 bits, vaddr bits [63:48] are required to all be 0. Otherwise, based on instruction fetch / load / store requests, instruction access fault, load access fault, or store access fault should be reported respectively.

To support handling the "original address" related exceptions mentioned above, the L1 TLB needs to add fullva (64 bits) and checkfullva (1 bit) as input signals. Simultaneously, vaNeedExt should be added to the output. Specifically:

checkfullva is not a control signal for fullva. That is, the content of fullva is not only valid when checkfullva is high.
When is checkfullva valid (needs to be high)?
1. For ITLB, checkfullva is always false, so when Chisel generates Verilog, checkfullva may be optimized out and not appear in the input.
2. For DTLB, for all load / store / amo / vector instructions, the checkfullva check needs to be performed when first sent by the Backend to MemBlock. Here, it is additionally stated that "Exceptions related to virtual or physical addresses in the original address" is a check only targeting vaddr (for load / store instructions, the calculated vaddr is typically a 64-bit value obtained from a register value + immediate value), so it does not need to wait for a TLB hit. Furthermore, when this check exception occurs, the TLB will not return miss, indicating that the exception is valid. Therefore, when "first sent by the Backend to MemBlock", this exception can definitely be discovered and reported. For misaligned memory accesses, they will not enter the misalign buffer; for load instructions, they will not enter the load replay queue; for store instructions, they will not be resent by the reservation station. Therefore, if this exception is not discovered "when first sent by the Backend to MemBlock", it will definitely not occur when resent by load replay, and no checkfullva check is needed. For prefetch instructions, checkfullva is not raised.
When is fullva valid (when is it used)?
1. Except for one specific case, fullva is only valid when checkfullva is high, representing the complete vaddr to be checked. Here, it needs to be stated that the original vaddr calculated for a load / store instruction is 64 bits (the value read from the register is 64 bits); however, only the lower 48 / 50 bits (Sv48 / Sv48x4) are used for TLB lookup, and the full 64 bits are needed for exception lookup.
2. Specific case: Misaligned instruction causing gpf, need to get gpaddr. The current handling logic for misaligned exceptions on the memory access side is as follows:
3. For example, original vaddr is 0x81000ffb, need to ld 8 bytes of data.
4. The misalign buffer will split this instruction into two loads with vaddr 0x81000ff8 (load 1) and 0x81001000 (load 2), and these two loads do not belong to the same virtual page.
5. For load 1, the vaddr passed to the TLB is 0x81000ff8, fullva is always the original vaddr 0x81000ffb; for load 2, the vaddr passed to the TLB is 0x81001000, fullva is always the original vaddr 0x81000ffb.
6. If load 1 throws an exception, the offset written to the *tval register is conventionally the offset of the original addr (i.e., 0xffb); if load 2 throws an exception, the offset written to the *tval register is conventionally the starting value of the next page (0x000). In the onlyStage2 case of virtualization, gpaddr = vaddr where the exception occurred. Therefore, for cross-page misaligned requests, and if an exception occurs at the address after crossing the page, the generation of gpaddr will only use vaddr (at this point, the offset is actually 0x000) and will not use fullva; for non-cross-page misaligned requests, or for cross-page misaligned requests where an exception occurs at the original address, the generation of gpaddr will use the offset of fullva (0xffb). Here, fullva is always valid, regardless of whether checkfullva is high.
When is vaNeedExt valid (in what situation is it used)?
In the memory access queues, load queue / store queue, for area saving considerations, the 64-bit original address is truncated to 50 bits for storage. However, when writing to the *tval register, a 64-bit value needs to be written. As mentioned above, for exceptions related to "virtual or physical addresses in the original address", the original full 64-bit address must be retained; for other page table related exceptions, the high bits of the address itself meet the requirements. For example: * fullva = 0xffff,ffff,8000,0000; vaddr = 0xffff,8000,0000. Mode is non-virtualized Sv39. The original address here does not cause an exception. Assume this is a load request, and the first access to the TLB misses. Thus, this load enters the load replay queue to wait for retransmission, and the address is truncated to 50 bits. After the load instruction is retransmitted, it is found that the V bit of the page table is 0, causing a page fault, and the vaddr needs to be written to the *tval register. Since the address has already been truncated in the load queue replay, sign extension is needed (e.g., for Sv39 case, extending bits above 39 with the value of bit 38), and the returned vaNeedExt is raised. * fullva = 0x0000,ffff,8000,0000; vaddr = 0xffff,8000,0000. Mode is non-virtualized Sv39. Here, it can be seen that the original address itself caused an exception. We will write this address directly into the corresponding exception buffer (the exception buffer will store the complete 64-bit value). In this case, the original value 0x0000,ffff,8000,0000 needs to be written directly to *tval, and sign extension cannot be performed, so vaNeedExt is low.

Supports Pointer Masking Extension

Currently, the XiangShan processor supports the pointer masking extension.

The essence of the pointer masking extension is to change the fullva for memory access from the original value "register value + immediate value" to "effective vaddr", where the high bits may be ignored. When the value of pmm is 2, the high 7 bits are ignored; when the value of pmm is 3, the high 16 bits are ignored. pmm = 0 means high bits are not ignored, and pmm = 1 is reserved.

The value of pmm can come from the PMM ([33:32]) bits of mseccfg/menvcfg/henvcfg/senvcfg, or from the HUPMM ([49:48]) bits of the hstatus register. The specific selection is as follows:

For frontend instruction fetch requests, or an hlvx instruction as specified in the manual, pointer masking is not used (pmm is 0).
If the current effective privilege level for memory access (dmode) is M mode, select the PMM ([33:32]) bits of mseccfg.
In a non-virtualized scenario, and if the current effective privilege level for memory access is S mode (HS), select the PMM ([33:32]) bits of menvcfg.
In a virtualized scenario, and if the current effective privilege level for memory access is S mode (VS), select the PMM ([33:32]) bits of henvcfg.
If it is a virtualized instruction, and the current processor privilege level (imode) is U mode, select the HUPMM ([49:48]) bits of hstatus.
In other U mode scenarios, select the PMM ([33:32]) bits of senvcfg.

Since pointer masking only applies to memory access and not frontend instruction fetch, there is no concept of "effective vaddr" in the ITLB, and these signals passed from CSR are not introduced into the ports.

Since the high bits of these addresses are only checked in the "Exceptions related to virtual or physical addresses in the original address" mentioned above, for cases where high bits are masked, we simply prevent exceptions from being triggered. Specifically:

For non-virtualized scenarios with enabled virtual memory, or virtualized scenarios that are not onlyStage2 (where vsatp mode is not 0), perform sign extension on the high 7 or 16 bits of the address, depending on whether pmm is 2 or 3, respectively.
For onlyStage2 scenarios in virtualization, or when virtual memory is not enabled, perform zero extension on the high 7 or 16 bits of the address, depending on whether pmm is 2 or 3, respectively.

Supports TLB Compression

The Kunming Lake architecture supports TLB compression. Each compressed TLB entry saves 8 consecutive page table entries, as shown in the figure above. The theoretical basis for TLB compression is that when the operating system allocates pages, due to mechanisms like the buddy allocation algorithm, it tends to allocate contiguous physical pages to contiguous virtual pages. Although page allocation becomes increasingly disordered as programs run, this page contiguity is generally present. Therefore, multiple consecutive page table entries can be merged into a single TLB entry in hardware, thereby increasing the effective capacity of the TLB.

That is, for page table entries with the same high bits of the virtual page number, when the high bits of the physical page number and page table attributes of these page table entries are also the same, these page table entries can be compressed into a single entry for storage, thus increasing the effective capacity of the TLB. The compressed TLB entry shares the high bits of the physical page number and page table attributes. Each page table entry has its own low bits of the physical page number, and a valid bit indicates whether this page table entry is valid in the compressed TLB entry, as shown in Table 5.1.8.

Table 5.1.8 shows the comparison before and after compression. The tag before compression is vpn. The tag after compression is the high 24 bits of vpn; the low 3 bits do not need to be stored. In fact, for the i-th entry of 8 consecutive page table entries, i is the low 3 bits of the tag. The high 21 bits of ppn are the same. ppn_low separately stores the low 3 bits of the ppn for the 8 entries. valididx indicates the validity of these 8 page table entries; they are valid only when valididx(i) is 1. pteidx(i) represents the i-th entry corresponding to the original request, i.e., the value of the low 3 bits of the original request's vpn.

Here is an example for illustration. For example, if a vpn is 0x0000154, its low three bits are 100, which is 4. When refilled into L1 TLB, the 8 page table entries from vpn 0x0000150 to 0x0000157 will all be refilled and compressed into 1 entry. For example, if the high 21 bits of the ppn for vpn 0x0000154 are PPN0 and the page table attributes are PERM0, and if the high 21 bits of the ppn and page table attributes for the i-th of these 8 page table entries are also PPN0 and PERM0, then valididx(i) is 1, and the low 3 bits of the i-th page table entry are stored in ppn_low(i). Additionally, pteidx(i) represents the i-th entry corresponding to the original request. Here, the low three bits of the original request's vpn are 4, so pteidx(4) is 1, and all other pteidx(i) are 0.

Furthermore, the TLB does not compress query results for large pages (1GB, 2MB). For large pages, when returning, each bit of valididx(i) will be set to 1. According to the page table query rules, large pages do not actually use ppn_low, so the value of ppn_low can be arbitrary.

Stored Content Per Entry Before and After TLB Compression
Compressed?	tag	asid	level	ppn	perm	valididx	pteidx	ppn_low
No	27 bits	16 bits	2 bits	24 bits	Page Table Attributes	Not Stored	Not Stored	Not Stored
Yes	24 bits	16 bits	2 bits	21 bits	Page Table Attributes	8 bits	8 bits	8×3 bits

After implementing TLB compression, the hit condition for L1 TLB changes from a TAG hit to a TAG hit (high bits of vpn match), which must also satisfy that the valididx(i) indexed by the low 3 bits of vpn is valid. The PPN is obtained by concatenating ppn (high 21 bits) with ppn_low(i).

Note that when the H extension is added, L1TLB entries are divided into four types. The TLB compression mechanism is not enabled for virtualized TLB entries (but TLB compression is still used in L2TLB). The four types will be described in detail next.

Storing Four Types of TLB Entries

In the L1TLB with the H extension added, the TLB entries are modified as shown in 此图.

Compared to the original design, g_perm, vmid, and s2xlate have been added. g_perm is used to store the perm of the second-stage page table, vmid is used to store the vmid of the second-stage page table, and s2xlate is used to distinguish the type of the TLB entry. Depending on s2xlate, the content stored in the TLB entry varies.

Types of TLB Entries
Type	s2xlate	tag	ppn	perm	g_perm	level
noS2xlate	b00	Virtual page number in non-virtualization	Physical page number in non-virtualization	Page table entry perm in non-virtualization	Not used	Page table entry level in non-virtualization
allStage	b11	Virtual page number of first-stage page table	Physical page number of second-stage page table	Perm of first-stage page table	Perm of second-stage page table	Maximum level in two-stage translation
onlyStage1	b01	Virtual page number of first-stage page table	Physical page number of first-stage page table	Perm of first-stage page table	Not used	Level of first-stage page table
onlyStage2	b10	Virtual page number of second-stage page table	Physical page number of second-stage page table	Not used	Perm of second-stage page table	Level of second-stage page table

TLB compression is enabled for noS2xlate and onlyStage1 cases, and not enabled in other cases. For allStage and onlyS2xlate cases, the L1TLB hit mechanism uses pteidx to calculate the tag and ppn of the effective pte, and there are also differences in how these two cases are refilled. In addition, asid is valid in noS2xlate, allStage, and onlyStage1, and vmid is valid in allStage and onlyStage2.

TLB Refill Merging Page Tables from Two Stages

In the MMU with the H extension added, the structure returned by PTW is divided into three parts. The first part, s1, is the original PtwSectorResp, which stores the first-stage translation page table. The second part, s2, is the HptwResp, which stores the second-stage translation page table. The third part is s2xlate, which represents the type of this response, still divided into noS2xlate, allStage, onlyStage1, and onlyStage2, as shown in 此图. PtwSectorEntry is a PtwEntry that uses TLB compression technology. The main difference between the two is the length of tag and ppn.

For noS2xlate and onlyStage1 cases, only the result of s1 needs to be filled into the TLB entry. The writing method is similar to the original design, filling the corresponding fields of the returned s1 into the corresponding fields of the entry. It should be noted that in the noS2xlate case, the vmid field is invalid.

For the onlyS2xlate case, we fill the result of s2 into the TLB entry. Here, some special processing is required to comply with the TLB compression structure. First, the asid and perm of this entry are not used, so we don't care what values are filled at this time. vmid is filled with the vmid from s1 (since the PTW module always fills this field regardless of the situation, this field can be used directly). The tag of s2 is filled into the tag of the TLB entry. pteidx is determined based on the low sectortlbwidth bits of the tag of s2. If s2 is a large page, then all bits of valididx in the TLB entry are valid. Otherwise, the pteidx corresponding to the TLB entry has valididx valid. Regarding the filling of ppn, the logic from the allStage case is reused and will be described in the allStage case.

For allStage, the page tables from both stages need to be merged. First, the tag, asid, vmid, etc. are filled based on s1. Since there is only one level, the level is filled with the maximum value of s1 and s2. This is considering that if there is a case where the first stage is a large page and the second stage is a small page, a query at a certain address might hit the large page, but the actual address is outside the range of the second-stage page table. The tag for such requests also needs to be merged. For example, if the first tag is from a level 1 page table and the second tag is from a level 2 page table, we need to combine the level 1 page number from the first tag with the level 2 page number from the second tag (the level 3 page number can be padded with zeros) to get the tag of the new page table. In addition, the perm of s1 and s2 and s2xlate need to be filled. For ppn, since we do not save the guest physical address, if we directly store the ppn of s2 in the case where the first stage is a small page and the second stage is a large page, the physical address calculated when querying this page table will be incorrect. Therefore, the tag of s2 and ppn should first be concatenated according to the level of s2. s2ppn is the high ppn, and s2ppn_tmp is constructed for calculating the low bits. Then the high bits are filled into the ppn field of the TLB entry, and the low bits are filled into the ppn_low field of the TLB entry.

Logic for Judging a TLB Entry Hit

There are three types of hits used in L1TLB: query TLB hit, fill TLB hit, and PTW request response hit.

For query TLB hit, new parameters are added, including vmid, hasS2xlate, onlyS2, onlyS1, etc. Asid hit is always true during second-stage translation. The H extension adds pteidx hit, which is enabled for small pages and in the allStage and onlyS2 cases, used to mask the TLB compression mechanism.

For fill TLB hit (wbhit), the input is PtwRespS2. It needs to judge the current vpn being compared. If it is only second-stage translation, the high bits of the tag of s2 are used; in other cases, the tag of s1vpn is used, and then padded with 0 in the low sectortlbwidth bits, and then vpn is compared with the tag of the TLB entry. The H extension modifies the judgment of wb_valid and adds pteidx_hit and s2xlate_hit. If it is a PTW response for only second-stage translation, wb_valididx is determined based on the tag of s2; otherwise, it is directly connected to the valididx of s1. s2xlate_hit compares the s2xlate of the TLB entry with the s2xlate of the PTW response, used to filter the type of the TLB entry. pteidx_hit is used to invalidate TLB compression. If it is only second-stage translation, the low bits of the tag of s2 are compared with the pteidx of the TLB entry; for other two-stage translation cases, the pteidx of the TLB entry is compared with the pteidx of s1.

For PTW request response hit, it is mainly used when a PTW response arrives to determine if the PTW request sent by the TLB at this time corresponds exactly to this response, or to determine if the PTW response is the PTW result needed by this TLB request during TLB lookup. This method is defined in PtwRespS2. Within this method, there are three types of hits. For noS2_hit (noS2xlate), only s1 hit needs to be checked. For onlyS2_hit (onlyStage2), only s2 hit needs to be checked. For all_onlyS1_hit (allStage or onlyStage1), the logic for judging vpn_hit needs to be redesigned and cannot simply check s1hit. The level for judging vpn_hit should be the maximum of s1 and s2, and then the hit is judged based on the level. Additionally, vasid (from vsatp) hit and vmid hit are added.

Supports Resending PTW to Get gpaddr After Guest Page Fault

Since L1TLB does not store the gpaddr in the translation result, when a guest page fault occurs after querying a TLB entry, PTW needs to be performed again to obtain the gpaddr. In this case, the TLB response is still a miss. Some new registers are added here.

New Registers for Getting gpaddr
Name	Type	Purpose
need_gpa	Bool	Indicates that a request is currently getting gpaddr
need_gpa_robidx	RobPtr	The robidx of the request getting gpaddr
need_gpa_vpn	vpnLen	The vpn of the request getting gpaddr
need_gpa_gvpn	vpnLen	Stores the gvpn of the obtained gpaddr
need_gpa_refill	Bool	Indicates that the gpaddr of this request has been filled into need_gpa_gvpn

When a TLB request finds a guest page fault in the queried TLB entry, PTW needs to be performed again. In this case, need_gpa will be valid, the vpn of the request will be filled into need_gpa_vpn, the robidx of the request will be filled into need_gpa_robidx, and resp_gpa_refill will be initialized to false. When a PTW response arrives, and it is determined that it was the request to get gpaddr sent previously by checking need_gpa_vpn, the s2 tag of the PTW response is filled into need_gpa_gvpn, and need_gpa_refill is made valid, indicating that the gvpn of gpaddr has been obtained. When the previous request re-enters the TLB, it can use this need_gpa_gvpn to calculate the gpaddr and return it. After a request completes this process, need_gpa is invalidated. Here, resp_gpa_refill remains valid, so the refilled gvpn might be used by other TLB requests (as long as they are equal to need_gpa_vpn).

Additionally, a redirect situation might occur, causing the entire instruction stream to change. The request that was getting gpaddr will no longer enter the TLB. So, if a redirect occurs, we check the saved need_gpa_robidx to determine if the registers related to getting gpaddr within the TLB need to be invalidated.

Furthermore, to ensure that the PTW request for getting gpaddr does not refill the TLB when it returns, a new output signal getGpa is added when sending the PTW request. The path of this signal is similar to memidx; please refer to memidx. This signal will be passed into the Repeater. When the PTW response returns to the TLB, this signal will also be sent back. If this signal is valid, it indicates that this PTW request was only for getting gpaddr, so the TLB will not be refilled at this time.

Regarding the process of getting gpaddr after a guest page fault occurs, here are some key points reiterated:

The mechanism for getting gpa can be seen as a buffer with only 1 entry. When a request causes a guest page fault, the corresponding information for need_gpa is written to this buffer; this continues until the condition need_gpa_vpn_hit && resp_gpa_refill is valid, or a flush (itlb) / redirect (dtlb) signal is received to flush the gpa information.
need_gpa_vpn_hit means: After a request causes a guest page fault, its vpn information is written into need_gpa_vpn. If the same vpn queries the TLB again, the need_gpa_vpn_hit signal will be raised, indicating that the obtained gpaddr corresponds to the original get_gpa request. If resp_gpa_refill is also high at this time, it means that the vpn has already obtained the corresponding gpaddr, and the gpaddr can be returned to the frontend fetch / backend memory access for exception handling.
Therefore, for any request from the frontend or memory access that triggers gpa, the following two conditions must be met later:
1. The request that triggered gpa must be able to be resent (the TLB will continue to return miss for this request until the gpaddr result is obtained).
2. The gpa request needs to be flushed by sending a flush or redirect signal to the TLB. Specifically, for all possible requests:
  1. ITLB fetch requests: If a fetch request causing gpf is on the speculative path and an incorrect speculation is detected, it will be flushed via the flushPipe signal (including backend redirect, or when the prediction result of a later stage predictor in the frontend multi-stage branch predictor updates the prediction result of an earlier stage predictor, etc.); for other cases, since the ITLB will return miss for this request, the frontend will ensure that the same vpn request is resent.
  2. DTLB load requests: If a load request causing gpf is on the speculative path and an incorrect speculation is detected, it will be flushed via the redirect signal (it is necessary to judge the relationship between the robidx where gpf occurred and the robidx of the incoming redirect); for other cases, since the DTLB will return miss for this request and will also raise the tlbreplay signal, the load queue replay will definitely resend this request.
  3. DTLB store requests: If a store request causing gpf is on the speculative path and an incorrect speculation is detected, it will be flushed via the redirect signal (it is necessary to judge the relationship between the robidx where gpf occurred and the robidx of the incoming redirect); for other cases, since the DTLB will return miss for this request, the backend will definitely schedule this store instruction to resend the request again.
  4. DTLB prefetch requests: The returned gpf signal will be raised, indicating that the address of this prefetch request caused a gpf, but it will not be written to the gpa* series of registers and will not trigger the mechanism to look up gpaddr, so it does not need to be considered.
  5. In the current handling mechanism, it is necessary to ensure that the TLB entry that caused gpf and is waiting for gpa is not replaced during the waiting process. Here, we simply prevent the TLB refill when waiting for gpa, thereby avoiding the replacement operation. Since an exception handling program is required when gpf occurs, and subsequent instructions will be redirected and flushed, preventing refill while waiting for gpa will not cause performance issues.

Overall Block Diagram

The overall block diagram of the L1 TLB is shown in 此图, including the ITLB and DTLB in the green box. The ITLB receives PTW requests from the Frontend, and the DTLB receives PTW requests from Memblock. The PTW requests from the Frontend include 3 requests from the ICache and 1 request from the IFU. The PTW requests from Memblock include 2 requests from LoadUnit (AtomicsUnit occupies 1 LoadUnit request channel), 1 request from L1 Load Stream & Stride prefetch, 2 requests from StoreUnit, and 1 request from SMSPrefetcher.

After the ITLB and DTLB query results are obtained, PMP and PMA checks need to be performed. Since the area of L1 TLB is small, backups of PMP and PMA registers are not stored inside the L1 TLB, but in the Frontend or Memblock, providing checks for ITLB and DTLB respectively. After ITLB and DTLB miss, they need to go through a repeater to send a query request to L2 TLB.

Interface Timing

ITLB and Frontend Interface Timing

Frontend Sends PTW Request Hitting ITLB

The timing diagram when a PTW request sent by the Frontend to the ITLB hits the ITLB is shown in 此图.

Timing Diagram of PTW Request Sent by Frontend to ITLB Hitting ITLB

When a PTW request sent by the Frontend to the ITLB hits the ITLB, the resp_miss signal remains 0. At the next clock rising edge after req_valid is 1, the ITLB will set the resp_valid signal to 1 and simultaneously return the physical address translated from the virtual address to the Frontend, as well as information on whether a guest page fault, page fault, or access fault occurred. The timing description is as follows:

Cycle 0: Frontend sends a PTW request to ITLB, setting req_valid to 1.
Cycle 1: ITLB returns the physical address to Frontend, setting resp_valid to 1.

Frontend Sends PTW Request Missing ITLB

The timing diagram when a PTW request sent by the Frontend to the ITLB misses the ITLB is shown in 此图.

Timing Diagram of PTW Request Sent by Frontend to ITLB Missing ITLB

When a PTW request sent by the Frontend to the ITLB misses the ITLB, resp_miss signal is returned in the next cycle, indicating an ITLB miss. At this time, this requestor channel of the ITLB no longer receives new PTW requests. The Frontend repeatedly sends this request until the page table in L2 TLB or memory is queried and returned. (Please note that "this requestor channel of the ITLB no longer receives new PTW requests" is controlled by the Frontend, meaning that whether the Frontend chooses not to resend the missed request or resend other requests, the Frontend's behavior is transparent to the TLB. If the Frontend chooses to send a new request, the ITLB will directly discard the old request.)

When the ITLB misses, it will send a PTW request to the L2 TLB until the result is obtained. The timing interaction between ITLB and L2 TLB, and the information like physical address returned to the Frontend are shown in the timing diagram in Figure 4.4 and the following timing description:

Cycle 0: Frontend sends a PTW request to ITLB, setting req_valid to 1.
Cycle 1: ITLB queries and gets a miss, returns resp_miss as 1 and sets resp_valid to 1 to the Frontend. Simultaneously, in the same cycle, ITLB sends a PTW request to L2 TLB (actually to itlbrepeater1), setting ptw_req_valid to 1.
Cycle X: L2 TLB returns a PTW reply to ITLB, including the virtual page number of the PTW request, the obtained physical page number, page table information, etc., with ptw_resp_valid as 1. In this cycle, ITLB has received the PTW reply from L2 TLB, and ptw_req_valid is set to 0.
Cycle X+1: ITLB now hits, resp_valid is 1, and resp_miss is 0. ITLB returns the physical address and information on whether an access fault, page fault, etc. occurred to the Frontend.
Cycle X+2: The resp_valid signal returned by ITLB to Frontend is set to 0.

DTLB and Memblock Interface Timing

Memblock Sends PTW Request Hitting DTLB

The timing diagram when a PTW request sent by Memblock to the DTLB hits the DTLB is shown in 此图.

Timing Diagram of PTW Request Sent by Memblock to DTLB Hitting DTLB

When a PTW request sent by Memblock to the DTLB hits the DTLB, the resp_miss signal remains 0. At the next clock rising edge after req_valid is 1, the DTLB will set the resp_valid signal to 1 and simultaneously return the physical address translated from the virtual address to Memblock, as well as information on whether a page fault or access fault occurred. The timing description is as follows:

Cycle 0: Memblock sends a PTW request to DTLB, setting req_valid to 1.
Cycle 1: DTLB returns the physical address to Memblock, setting resp_valid to 1.

Memblock Sends PTW Request Missing DTLB

Both DTLB and ITLB are non-blocking access (i.e., there is no blocking logic inside the TLB. If the request source remains unchanged, i.e., it continuously resends the same request after a miss, it exhibits a blocking-like effect; if the request source schedules other different requests to query the TLB after receiving the miss feedback, it exhibits a non-blocking-like effect). Unlike frontend instruction fetch, when a PTW request sent by Memblock to the DTLB misses the DTLB, it does not block the pipeline. The DTLB will return the request miss and resp_valid signals to Memblock in the cycle after req_valid. Memblock can perform scheduling after receiving the miss signal and continue querying other requests.

After Memblock's access to DTLB misses, DTLB will send a PTW request to L2 TLB to query the page table from L2 TLB or memory. DTLB passes the request to L2 TLB through a Filter. The Filter can merge duplicate requests sent by DTLB to L2 TLB, ensuring no duplicate entries in DTLB and improving the utilization of L2 TLB. The timing diagram when a PTW request sent by Memblock to the DTLB misses the DTLB is shown in 此图. This diagram only describes the process from the request miss to DTLB sending the PTW request to L2 TLB.

Timing Diagram of PTW Request Sent by Memblock to DTLB Missing DTLB

After the DTLB receives the PTW reply from the L2 TLB, it stores the page table entry in the DTLB. When Memblock accesses the DTLB again, it will hit, which is the same situation as 此图. The timing interaction between DTLB and L2 TLB is the same as the ptw_req and ptw_resp parts of 此图.

TLB and tlbRepeater Interface Timing

TLB Sends PTW Request to tlbRepeater

The interface timing diagram for TLB sending a PTW request to tlbRepeater is shown in 此图.

Timing Diagram of TLB Sending PTW Request to Repeater

In the Kunming Lake architecture, both ITLB and DTLB use non-blocking access. Upon TLB miss, they send a PTW request to L2 TLB, but they do not block the pipeline or the PTW channel between TLB and Repeater due to not receiving a PTW reply. The TLB can continuously send PTW requests to the tlbRepeater. The tlbRepeater will merge duplicate requests based on the virtual page numbers of these requests, avoiding wasting L2 TLB resources and preventing duplicate entries in L1 TLB.

From the timing relationship in 此图, it can be seen that in the cycle after the TLB sends the PTW request to the Repeater, the Repeater will continue to pass the PTW request downwards. Since the Repeater has already sent a PTW request with virtual page number vpn1 to L2 TLB, when the Repeater receives the same virtual page number PTW request again, it will not pass it to L2 TLB.

itlbRepeater Returns PTW Reply to ITLB

The interface timing diagram for itlbRepeater returning a PTW reply to ITLB is shown in 此图.

Timing Diagram of itlbRepeater Returning PTW Reply to ITLB

Timing description is as follows:

Cycle X: itlbRepeater receives the PTW reply from L2 TLB passed through the lower-level itlbRepeater, and itlbrepeater_ptw_resp_valid is high.
Cycle X+1: ITLB receives the PTW reply from itlbRepeater.

dtlbRepeater Returns PTW Reply to DTLB

The interface timing diagram for dtlbRepeater returning a PTW reply to DTLB is shown in 此图.

Timing Diagram of dtlbRepeater Returning PTW Reply to DTLB

Timing description is as follows:

Cycle X: dtlbRepeater receives the PTW reply from L2 TLB passed through the lower-level dtlbRepeater, and dtlbrepeater_ptw_resp_valid is high.
Cycle X+1: dtlbRepeater passes the PTW reply to memblock.
Cycle X+2: DTLB receives the PTW reply.