L1 TLB Secondary Module
Design Specifications
- Supports receiving address translation requests from Frontend and MemBlock
- Supports PLRU replacement algorithm
- Supports returning physical addresses to Frontend and MemBlock
- Both ITLB and DTLB use non-blocking access
- Both ITLB and DTLB entries are implemented using register files
- Both ITLB and DTLB entries are fully associative structures
- ITLB and DTLB use the current processor privilege level and effective privilege level for memory access execution, respectively
- Supports judging whether virtual memory is enabled and whether two-stage translation is enabled within the L1 TLB
- Supports sending PTW requests to L2 TLB
- DTLB supports duplicating the queried physical address
- Supports exception handling
- Supports TLB compression
- Supports TLB Hint mechanism
- Stores four types of TLB entries
- TLB refill merges page tables from two stages
- Logic for judging a TLB entry hit
- Supports resending PTW to get gpaddr after guest page fault
Functionality
Receiving Address Translation Requests from Frontend and MemBlock
Before conducting memory reads and writes within the core, including frontend instruction fetch and backend memory access, address translation must be performed by the L1 TLB. Due to physical distance and to avoid mutual interference, it is divided into ITLB (Instruction TLB) for frontend instruction fetch and DTLB (Data TLB) for backend memory access. The ITLB adopts a fully associative mode, with 48 fully associative entries storing all page sizes. The ITLB receives address translation requests from the Frontend. itlb_requestors(0)
to itlb_requestors(2)
come from the icache, where itlb_requestors(2)
is an icache prefetch request; itlb_requestors(3)
comes from the IFU, for address translation requests of MMIO instructions.
The ITLB entry configuration and request sources are shown in 此表、此表 respectively.
Entry Name | Number of Entries | Organization Structure | Replacement Algorithm | Stored Content |
---|---|---|---|---|
Page | 48 | Fully Associative | PLRU | All Page Sizes |
Index | Source |
---|---|
requestors(0) | Icache, mainPipe |
requestors(1) | Icache, mainPipe |
requestors(2) | Icache, fdipPrefetch |
requestors(3) | IFU |
XiangShan's memory access pipeline has 2 Load pipelines, 2 Store pipelines, as well as an SMS prefetcher and an L1 Load stream & stride prefetcher. To handle numerous requests, the two Load pipelines and the L1 Load stream & stride prefetcher use the Load DTLB, the two Store pipelines use the Store DTLB, and prefetch requests use the Prefetch DTLB. In total, there are 3 DTLBs, all employing the PLRU replacement algorithm (see Section 5.1.1.2).
The DTLB adopts a fully associative mode, with 48 fully associative entries storing all page sizes. The DTLB receives address translation requests from MemBlock. dtlb_ld
receives requests from loadUnits
and the L1 Load stream & stride prefetcher, and is responsible for address translation of Load instructions; dtlb_st
receives requests from StoreUnits
, and is responsible for address translation of Store instructions. Specifically, for AMO instructions, it uses the dtlb_ld_requestor
of loadUnit(0)
to send requests to dtlb_ld
. The SMSPrefetcher sends prefetch requests to a separate DTLB.
The DTLB entry configuration and request sources are shown in 此表、此表 respectively.
Entry Name | Number of Entries | Organization Structure | Replacement Algorithm | Stored Content |
---|---|---|---|---|
Page | 48 | Fully Associative | PLRU | All Page Sizes |
Module | Index | Source |
---|---|---|
DTLB_LD | ||
ld_requestors(0) | loadUnit(0), AtomicsUnit | |
ld_requestors(1) | loadUnit(1) | |
ld_requestors(2) | loadUnit(2) | |
ld_requestors(3) | L1 Load stream & stride Prefetch | |
DTLB_ST | ||
st_requestors(0) | StoreUnit(0) | |
st_requestors(1) | StoreUnit(1) | |
DTLB_PF | ||
pf_requestors(0) | SMSPrefetch | |
pf_requestors(1) | L2 Prefetch |
Adopting PLRU Replacement Algorithm
The L1 TLB adopts a configurable replacement policy, defaulting to the PLRU replacement algorithm. In the Nanhu architecture, both ITLB and DTLB include NormalPage and SuperPage, with a relatively complex refill policy. The NormalPage in Nanhu architecture ITLB is responsible for address translation of 4KB pages, while SuperPage is responsible for 2MB and 1GB page address translation, requiring refill into NormalPage or SuperPage based on the size of the refilled page (4KB, 2MB, or 1GB). The NormalPage in Nanhu architecture DTLB is responsible for address translation of 4KB pages, while SuperPage is responsible for address translation of all page sizes. NormalPage is direct-mapped, and although it has more entries, its utilization is low. SuperPage is fully associative with high utilization, but due to timing constraints, it has fewer entries, resulting in a high miss rate.
Please note that the Kunming Lake architecture proposes optimizations for the above issues. Under timing constraints, both ITLB and DTLB are unified into a 48-entry fully associative structure, allowing any page size to be refilled. Both ITLB and DTLB use the PLRU replacement policy.
The refill policy for ITLB and DTLB is shown in 此表.
Module | Entry Name | Policy |
---|---|---|
ITLB | ||
Page | 48 fully associative entries, can refill any page size | |
DTLB | ||
Page | 48 fully associative entries, can refill any page size |
Returning Physical Address to Frontend and MemBlock
After the L1 TLB obtains the physical address from the virtual address, it returns the physical address for the corresponding request to the Frontend and MemBlock, along with information on whether a miss occurred, whether a guest page fault occurred, a page fault, or an access fault. For each request from the Frontend or MemBlock, a response is sent by the ITLB or DTLB, with tlb_requestor(i)_resp_valid
indicating a valid response.
In the Nanhu architecture, although both SuperPage and NormalPage are implemented using register files physically, SuperPage is a 16-entry fully associative structure, and NormalPage is a direct-mapped structure. After reading data from the direct-mapped NormalPage, a tag comparison is still required. Although SuperPage has 16 fully associative entries, only one entry can hit at a time, which is marked by hitVec
, selecting the data read from SuperPage. The time to read data + compare tags from NormalPage is much longer than reading data + selecting data from SuperPage. Therefore, from a timing perspective, the DTLB returns a fast_miss
signal to MemBlock, indicating a SuperPage miss; the miss
signal indicates a miss in both SuperPage and NormalPage.
Simultaneously, in the Nanhu architecture, due to tight timing for DTLB's PMP & PMA checks, PMP needs to be divided into dynamic and static check parts. (See Section 5.4) When an L2 TLB page table entry is refilled into the DTLB, the refilled page table entry is simultaneously sent to PMP and PMA for permission checks, and the check results are stored in the DTLB. The DTLB needs to additionally return signals to MemBlock indicating valid static checks and the check results.
It is important to note that the Kunming Lake architecture optimizes the TLB query entry configuration and corresponding timing. Currently, fast_miss
is canceled, and no additional static PMP & PMA check is required. However, this might be restored later due to timing or other reasons, so the previous two paragraphs are kept for documentation completeness and compatibility. The Kunming Lake architecture cancels fast_miss
and static PMP & PMA checks; please note this again.
Blocking and Non-blocking Access
In the Nanhu architecture, the Frontend's instruction fetch requires blocking access to the ITLB, while the backend's memory access requires non-blocking access to the DTLB. In fact, the TLB itself is non-blocking access and does not store request information. The reason for the TLB using blocking or non-blocking access is the requirements of the request source. When the Frontend instruction fetch misses the TLB, it needs to wait for the TLB to retrieve the result before sending the instruction to the processor backend for processing, exhibiting a blocking effect; memory access operations can be scheduled out of order, and when one request misses, another load/store instruction can be scheduled to continue execution, thus exhibiting a non-blocking effect.
The above functionality in the Nanhu architecture is achieved by the TLB. The TLB uses some control logic to continuously wait for the page table entry to be retrieved via PTW when the ITLB misses. In Kunming Lake, the above functionality is guaranteed by the ICache. When the ITLB misses and reports this to the ICache, the ICache continuously resends the same request until it hits, ensuring the non-blocking access effect.
However, it should be noted that both the ITLB and DTLB in the Kunming Lake architecture are non-blocking. Whether the external effect is blocking or non-blocking is controlled by the fetch unit or memory access unit.
Storage Structure of L1 TLB Entries
XiangShan's TLB can be configured in terms of organization structure, including associativity mode, number of entries, and replacement policy. The default configuration is: both ITLB and DTLB are 48-entry fully associative structures, and both are implemented using register files (see Section 5.1.2.3). If an address is read and written in the same cycle, the result can be obtained directly through bypass.
Reference ITLB or DTLB configuration: both use a fully associative structure with 8 / 16 / 32 / 48 entries. Parameterization to change the TLB structure (fully associative / set associative / direct mapped) is currently not supported and requires manual code modification.
Supporting Judging Whether Virtual Memory is Enabled and Whether Two-Stage Translation is Enabled Within the L1 TLB
XiangShan supports the Sv39 page table from the RISC-V manual, with a virtual address length of 39 bits. XiangShan's physical address is 36 bits and can be parameterized.
Whether virtual memory is enabled is determined by the privilege level, the MODE field of the SATP register, etc. This judgment is made inside the TLB and is transparent to the outside of the TLB. For a description of privilege levels, see Section 5.1.2.7; regarding the MODE field of SATP, the Kunming Lake architecture of XiangShan only supports MODE field value 8, which is the Sv39 paging mechanism, otherwise an illegal instruction fault will be reported. From the perspective of modules outside the TLB (Frontend, LoadUnit, StoreUnit, AtomicsUnit, etc.), all addresses are translated by the TLB.
When the H extension is added, whether address translation is enabled also requires determining if two-stage address translation is active. Two-stage address translation is enabled by two conditions: first, if a virtualized memory access instruction is currently executing, and second, if virtualization mode is enabled and the MODE field of VSATP or HGATP is non-zero. The translation modes in this case are as follows. The translation mode is used to look up the corresponding type of page table in the TLB and to send the PTW request to the L2TLB.
VSATP Mode | HGATP Mode | Translation Mode |
---|---|---|
Non-zero | Non-zero | allStage, both stages of translation present |
Non-zero | Zero | onlyStage1, only the first stage of translation |
Zero | Non-zero | onlyStage2, only the second stage of translation |
Privilege Level of L1 TLB
According to the RISC-V manual requirements, the privilege level for frontend instruction fetch (ITLB) is the current processor privilege level, and the privilege level for backend memory access (DTLB) is the effective privilege level for memory access execution. Both the current processor privilege level and the effective privilege level for memory access execution are determined in the CSR module and passed to the ITLB and DTLB. The current processor privilege level is stored in the CSR module; the effective privilege level for memory access execution is jointly determined by the MPRV, MPV, and MPP bits of the mstatus register and the SPVP bit of hstatus. If a virtualized memory access instruction is executed, the effective privilege level for memory access execution is the privilege level stored in the SPVP bit of hstatus; if the executed instruction is not a virtualized memory access instruction and the MPRV bit is 0, the effective privilege level for memory access execution is the same as the current processor privilege level, and the effective virtualization mode for memory access execution is also the same as the current virtualization mode; if the MPRV bit is 1, the effective privilege level for memory access execution is the privilege level stored in the MPP field of the mstatus register, and the effective virtualization mode for memory access execution is the virtualization mode stored in the MPV bit of the hstatus register. The privilege levels for ITLB and DTLB are shown in the table.
Module | Privilege Level |
---|---|
ITLB | Current processor privilege level |
DTLB | For executing non-virtualized memory access instructions, if mstatus.MPRV=0, it is the current processor privilege level and virtualization mode; if mstatus.MPRV=1, it is the privilege level stored in mstatus.MPP and the virtualization mode stored in hstatus.MPV |
Sending PTW Requests
When an L1 TLB misses, it needs to send a Page Table Walk request to the L2 TLB. Due to the relatively long physical distance between the L1 TLB and L2 TLB, a pipeline stage, called Repeater, is needed in between. Additionally, the repeater needs to filter out duplicate requests to prevent duplicate entries in the L1 TLB. (See Section 5.2) Therefore, the first-level Repeater for ITLB or DTLB is also referred to as a Filter. The L1 TLB sends PTW requests and receives PTW replies via the Repeater. (See Section 5.3)
DTLB Duplicating the Queried Physical Address
In the physical implementation, the dcache and lsu in Memblock are physically distant. If a hitVec
is generated in the load_s1
stage of the LoadUnit and then sent separately to the dcache and lsu, it will cause severe timing issues. Therefore, two hitVec
signals need to be generated in parallel near the dcache and lsu and sent separately to the dcache and lsu. To cooperate with resolving timing issues in Memblock, the DTLB needs to duplicate the queried physical address into 2 copies, sending one to the dcache and one to the lsu. The physical addresses sent to the dcache and lsu are identical.
Exception Handling Mechanism
Exceptions that can be generated by the ITLB include inst guest page fault
, inst page fault
, and inst access fault
, all of which are delivered to the requesting source, ICache or IFU, for handling. Exceptions that can be generated by the DTLB include: load guest page fault
, load page fault
, load access fault
, store guest page fault
, store page fault
, and store access fault
, all of which are delivered to the requesting source, LoadUnits, StoreUnits, or AtomicsUnit, for handling. The L1TLB does not store the gpaddr
, so when a guest page fault occurs, PTW needs to be performed again to obtain the gpaddr
. See Section 6 of this document: Exception Handling Mechanism.
Here, additional supplementary explanations are needed regarding exceptions related to virtual-to-physical address translation. We classify exceptions as follows:
- Page table related exceptions
- When not in virtualization or during the virtualized VS-Stage, if a page table entry has reserved bits that are non-zero / is non-aligned / write permission (w) is missing, etc. (see manual for details), a page fault needs to be reported.
- During the virtualized G-Stage, if a page table entry has reserved bits that are non-zero / is non-aligned / write permission (w) is missing, etc. (see manual for details), a guest page fault needs to be reported.
- Virtual address or physical address related exceptions
- Exceptions related to virtual or physical addresses during the address translation process. This part of the check is performed during the L2 TLB PTW process.
- When not in virtualization or during the virtualized all-Stage, the G-stage
gvpn
needs to be checked. Ifhgatp
's mode is 8 (representing Sv39x4), then bits [41-12 = 29] and above ofgvpn
must all be 0; ifhgatp
's mode is 9 (representing Sv48x4), then bits [50-12 = 38] and above ofgvpn
must all be 0. Otherwise, a guest page fault will be reported. - When the page table entry is obtained during address translation, the upper [48-12 = 36] bits of the PPN part of the page table must all be 0. Otherwise, an access fault will be reported.
- Exceptions related to virtual or physical addresses in the original address. These are summarized below. In theory, these checks should be performed in the L1 TLB, but since the ITLB's
redirect
result comes entirely from the Backend, the corresponding exceptions in the ITLB will be recorded when the Backend sendsredirect
to the Frontend and will not be checked again in the ITLB. Please refer to the Backend's explanation for this part. - Sv39 mode: Includes enabled virtual memory and disabled virtualization, where
satp
mode is 8; or enabled virtual memory and enabled virtualization, wherevsatp
mode is 8. In these cases, bits [63:39] ofvaddr
must have the same sign as bit 38 ofvaddr
. Otherwise, based on instruction fetch / load / store requests,instruction page fault
,load page fault
, orstore page fault
should be reported respectively. - Sv48 mode: Includes enabled virtual memory and disabled virtualization, where
satp
mode is 9; or enabled virtual memory and enabled virtualization, wherevsatp
mode is 9. In these cases, bits [63:48] ofvaddr
must have the same sign as bit 47 ofvaddr
. Otherwise, based on instruction fetch / load / store requests,instruction page fault
,load page fault
, orstore page fault
should be reported respectively. - Sv39x4 mode: Enabled virtual memory and enabled virtualization, where
vsatp
mode is 0 andhgatp
mode is 8. (Note: Whenvsatp
mode is 8/9 andhgatp
mode is 8, the second stage address translation is also in Sv39x4 mode, and corresponding exceptions may also occur. However, this part belongs to "Exceptions related to virtual or physical addresses during the address translation process" and will be handled during the L2 TLB page table walk process. It is not within the scope of L1 TLB handling. The L1 TLB will only handle "Exceptions related to virtual or physical addresses in the original address".) In this case, bits [63:41] ofvaddr
must all be 0. Otherwise, based on instruction fetch / load / store requests,instruction guest page fault
,load guest page fault
, orstore guest page fault
should be reported respectively. - Sv48x4 mode: Enabled virtual memory and enabled virtualization, where
vsatp
mode is 0 andhgatp
mode is 9. (Note: Whenvsatp
mode is 8/9 andhgatp
mode is 9, the second stage address translation is also in Sv48x4 mode, and corresponding exceptions may also occur. However, this part belongs to "Exceptions related to virtual or physical addresses during the address translation process" and will be handled during the L2 TLB page table walk process. It is not within the scope of L1 TLB handling. The L1 TLB will only handle "Exceptions related to virtual or physical addresses in the original address".) In this case, bits [63:50] ofvaddr
must all be 0. Otherwise, based on instruction fetch / load / store requests,instruction guest page fault
,load guest page fault
, orstore guest page fault
should be reported respectively. - Bare mode: Virtual memory is not enabled, in which case
paddr = vaddr
. Since the physical address of the XiangShan processor is currently limited to 48 bits,vaddr
bits [63:48] are required to all be 0. Otherwise, based on instruction fetch / load / store requests,instruction access fault
,load access fault
, orstore access fault
should be reported respectively.
To support handling the "original address" related exceptions mentioned above, the L1 TLB needs to add fullva
(64 bits) and checkfullva
(1 bit) as input signals. Simultaneously, vaNeedExt
should be added to the output. Specifically:
checkfullva
is not a control signal forfullva
. That is, the content offullva
is not only valid whencheckfullva
is high.- When is
checkfullva
valid (needs to be high)?- For ITLB,
checkfullva
is always false, so when Chisel generates Verilog,checkfullva
may be optimized out and not appear in the input. - For DTLB, for all load / store / amo / vector instructions, the
checkfullva
check needs to be performed when first sent by the Backend to MemBlock. Here, it is additionally stated that "Exceptions related to virtual or physical addresses in the original address" is a check only targetingvaddr
(for load / store instructions, the calculatedvaddr
is typically a 64-bit value obtained from a register value + immediate value), so it does not need to wait for a TLB hit. Furthermore, when this check exception occurs, the TLB will not returnmiss
, indicating that the exception is valid. Therefore, when "first sent by the Backend to MemBlock", this exception can definitely be discovered and reported. For misaligned memory accesses, they will not enter the misalign buffer; for load instructions, they will not enter the load replay queue; for store instructions, they will not be resent by the reservation station. Therefore, if this exception is not discovered "when first sent by the Backend to MemBlock", it will definitely not occur when resent by load replay, and nocheckfullva
check is needed. For prefetch instructions,checkfullva
is not raised.
- For ITLB,
- When is
fullva
valid (when is it used)?- Except for one specific case,
fullva
is only valid whencheckfullva
is high, representing the completevaddr
to be checked. Here, it needs to be stated that the originalvaddr
calculated for a load / store instruction is 64 bits (the value read from the register is 64 bits); however, only the lower 48 / 50 bits (Sv48 / Sv48x4) are used for TLB lookup, and the full 64 bits are needed for exception lookup. - Specific case: Misaligned instruction causing gpf, need to get gpaddr. The current handling logic for misaligned exceptions on the memory access side is as follows:
- For example, original
vaddr
is 0x81000ffb, need to ld 8 bytes of data. - The misalign buffer will split this instruction into two loads with
vaddr
0x81000ff8 (load 1) and 0x81001000 (load 2), and these two loads do not belong to the same virtual page. - For load 1, the
vaddr
passed to the TLB is 0x81000ff8,fullva
is always the originalvaddr
0x81000ffb; for load 2, thevaddr
passed to the TLB is 0x81001000,fullva
is always the originalvaddr
0x81000ffb. - If load 1 throws an exception, the offset written to the
*tval
register is conventionally the offset of the original addr (i.e., 0xffb); if load 2 throws an exception, the offset written to the*tval
register is conventionally the starting value of the next page (0x000). In the onlyStage2 case of virtualization,gpaddr = vaddr
where the exception occurred. Therefore, for cross-page misaligned requests, and if an exception occurs at the address after crossing the page, the generation ofgpaddr
will only usevaddr
(at this point, the offset is actually 0x000) and will not usefullva
; for non-cross-page misaligned requests, or for cross-page misaligned requests where an exception occurs at the original address, the generation ofgpaddr
will use the offset offullva
(0xffb). Here,fullva
is always valid, regardless of whethercheckfullva
is high.
- Except for one specific case,
- When is
vaNeedExt
valid (in what situation is it used)? - In the memory access queues, load queue / store queue, for area saving considerations, the 64-bit original address is truncated to 50 bits for storage. However, when writing to the
*tval
register, a 64-bit value needs to be written. As mentioned above, for exceptions related to "virtual or physical addresses in the original address", the original full 64-bit address must be retained; for other page table related exceptions, the high bits of the address itself meet the requirements. For example: *fullva = 0xffff,ffff,8000,0000
;vaddr = 0xffff,8000,0000
. Mode is non-virtualized Sv39. The original address here does not cause an exception. Assume this is a load request, and the first access to the TLB misses. Thus, this load enters the load replay queue to wait for retransmission, and the address is truncated to 50 bits. After the load instruction is retransmitted, it is found that the V bit of the page table is 0, causing a page fault, and thevaddr
needs to be written to the*tval
register. Since the address has already been truncated in the load queue replay, sign extension is needed (e.g., for Sv39 case, extending bits above 39 with the value of bit 38), and the returnedvaNeedExt
is raised. *fullva = 0x0000,ffff,8000,0000
;vaddr = 0xffff,8000,0000
. Mode is non-virtualized Sv39. Here, it can be seen that the original address itself caused an exception. We will write this address directly into the corresponding exception buffer (the exception buffer will store the complete 64-bit value). In this case, the original value 0x0000,ffff,8000,0000 needs to be written directly to*tval
, and sign extension cannot be performed, sovaNeedExt
is low.
Supports Pointer Masking Extension
Currently, the XiangShan processor supports the pointer masking extension.
The essence of the pointer masking extension is to change the fullva
for memory access from the original value "register value + immediate value" to "effective vaddr", where the high bits may be ignored. When the value of pmm
is 2, the high 7 bits are ignored; when the value of pmm
is 3, the high 16 bits are ignored. pmm
= 0 means high bits are not ignored, and pmm
= 1 is reserved.
The value of pmm
can come from the PMM ([33:32]) bits of mseccfg
/menvcfg
/henvcfg
/senvcfg
, or from the HUPMM ([49:48]) bits of the hstatus
register. The specific selection is as follows:
- For frontend instruction fetch requests, or an
hlvx
instruction as specified in the manual, pointer masking is not used (pmm
is 0). - If the current effective privilege level for memory access (
dmode
) is M mode, select the PMM ([33:32]) bits ofmseccfg
. - In a non-virtualized scenario, and if the current effective privilege level for memory access is S mode (HS), select the PMM ([33:32]) bits of
menvcfg
. - In a virtualized scenario, and if the current effective privilege level for memory access is S mode (VS), select the PMM ([33:32]) bits of
henvcfg
. - If it is a virtualized instruction, and the current processor privilege level (
imode
) is U mode, select the HUPMM ([49:48]) bits ofhstatus
. - In other U mode scenarios, select the PMM ([33:32]) bits of
senvcfg
.
Since pointer masking only applies to memory access and not frontend instruction fetch, there is no concept of "effective vaddr" in the ITLB, and these signals passed from CSR are not introduced into the ports.
Since the high bits of these addresses are only checked in the "Exceptions related to virtual or physical addresses in the original address" mentioned above, for cases where high bits are masked, we simply prevent exceptions from being triggered. Specifically:
- For non-virtualized scenarios with enabled virtual memory, or virtualized scenarios that are not onlyStage2 (where
vsatp
mode is not 0), perform sign extension on the high 7 or 16 bits of the address, depending on whetherpmm
is 2 or 3, respectively. - For onlyStage2 scenarios in virtualization, or when virtual memory is not enabled, perform zero extension on the high 7 or 16 bits of the address, depending on whether
pmm
is 2 or 3, respectively.
Supports TLB Compression
The Kunming Lake architecture supports TLB compression. Each compressed TLB entry saves 8 consecutive page table entries, as shown in the figure above. The theoretical basis for TLB compression is that when the operating system allocates pages, due to mechanisms like the buddy allocation algorithm, it tends to allocate contiguous physical pages to contiguous virtual pages. Although page allocation becomes increasingly disordered as programs run, this page contiguity is generally present. Therefore, multiple consecutive page table entries can be merged into a single TLB entry in hardware, thereby increasing the effective capacity of the TLB.
That is, for page table entries with the same high bits of the virtual page number, when the high bits of the physical page number and page table attributes of these page table entries are also the same, these page table entries can be compressed into a single entry for storage, thus increasing the effective capacity of the TLB. The compressed TLB entry shares the high bits of the physical page number and page table attributes. Each page table entry has its own low bits of the physical page number, and a valid
bit indicates whether this page table entry is valid in the compressed TLB entry, as shown in Table 5.1.8.
Table 5.1.8 shows the comparison before and after compression. The tag
before compression is vpn
. The tag
after compression is the high 24 bits of vpn
; the low 3 bits do not need to be stored. In fact, for the i-th entry of 8 consecutive page table entries, i is the low 3 bits of the tag
. The high 21 bits of ppn
are the same. ppn_low
separately stores the low 3 bits of the ppn
for the 8 entries. valididx
indicates the validity of these 8 page table entries; they are valid only when valididx(i)
is 1. pteidx(i)
represents the i-th entry corresponding to the original request, i.e., the value of the low 3 bits of the original request's vpn
.
Here is an example for illustration. For example, if a vpn
is 0x0000154, its low three bits are 100, which is 4. When refilled into L1 TLB, the 8 page table entries from vpn
0x0000150 to 0x0000157 will all be refilled and compressed into 1 entry. For example, if the high 21 bits of the ppn
for vpn
0x0000154 are PPN0 and the page table attributes are PERM0, and if the high 21 bits of the ppn
and page table attributes for the i-th of these 8 page table entries are also PPN0 and PERM0, then valididx(i)
is 1, and the low 3 bits of the i-th page table entry are stored in ppn_low(i)
. Additionally, pteidx(i)
represents the i-th entry corresponding to the original request. Here, the low three bits of the original request's vpn
are 4, so pteidx(4)
is 1, and all other pteidx(i)
are 0.
Furthermore, the TLB does not compress query results for large pages (1GB, 2MB). For large pages, when returning, each bit of valididx(i)
will be set to 1. According to the page table query rules, large pages do not actually use ppn_low
, so the value of ppn_low
can be arbitrary.
Compressed? | tag | asid | level | ppn | perm | valididx | pteidx | ppn_low |
---|---|---|---|---|---|---|---|---|
No | 27 bits | 16 bits | 2 bits | 24 bits | Page Table Attributes | Not Stored | Not Stored | Not Stored |
Yes | 24 bits | 16 bits | 2 bits | 21 bits | Page Table Attributes | 8 bits | 8 bits | 8×3 bits |
After implementing TLB compression, the hit condition for L1 TLB changes from a TAG hit to a TAG hit (high bits of vpn
match), which must also satisfy that the valididx(i)
indexed by the low 3 bits of vpn
is valid. The PPN is obtained by concatenating ppn
(high 21 bits) with ppn_low(i)
.
Note that when the H extension is added, L1TLB entries are divided into four types. The TLB compression mechanism is not enabled for virtualized TLB entries (but TLB compression is still used in L2TLB). The four types will be described in detail next.
Storing Four Types of TLB Entries
In the L1TLB with the H extension added, the TLB entries are modified as shown in 此图.
Compared to the original design, g_perm
, vmid
, and s2xlate
have been added. g_perm
is used to store the perm
of the second-stage page table, vmid
is used to store the vmid
of the second-stage page table, and s2xlate
is used to distinguish the type of the TLB entry. Depending on s2xlate
, the content stored in the TLB entry varies.
Type | s2xlate | tag | ppn | perm | g_perm | level |
---|---|---|---|---|---|---|
noS2xlate | b00 | Virtual page number in non-virtualization | Physical page number in non-virtualization | Page table entry perm in non-virtualization | Not used | Page table entry level in non-virtualization |
allStage | b11 | Virtual page number of first-stage page table | Physical page number of second-stage page table | Perm of first-stage page table | Perm of second-stage page table | Maximum level in two-stage translation |
onlyStage1 | b01 | Virtual page number of first-stage page table | Physical page number of first-stage page table | Perm of first-stage page table | Not used | Level of first-stage page table |
onlyStage2 | b10 | Virtual page number of second-stage page table | Physical page number of second-stage page table | Not used | Perm of second-stage page table | Level of second-stage page table |
TLB compression is enabled for noS2xlate
and onlyStage1
cases, and not enabled in other cases. For allStage
and onlyS2xlate
cases, the L1TLB hit mechanism uses pteidx
to calculate the tag and ppn of the effective pte, and there are also differences in how these two cases are refilled. In addition, asid
is valid in noS2xlate
, allStage
, and onlyStage1
, and vmid
is valid in allStage
and onlyStage2
.
TLB Refill Merging Page Tables from Two Stages
In the MMU with the H extension added, the structure returned by PTW is divided into three parts. The first part, s1
, is the original PtwSectorResp
, which stores the first-stage translation page table. The second part, s2
, is the HptwResp
, which stores the second-stage translation page table. The third part is s2xlate
, which represents the type of this response, still divided into noS2xlate
, allStage
, onlyStage1
, and onlyStage2
, as shown in 此图. PtwSectorEntry
is a PtwEntry
that uses TLB compression technology. The main difference between the two is the length of tag
and ppn
.
For noS2xlate
and onlyStage1
cases, only the result of s1
needs to be filled into the TLB entry. The writing method is similar to the original design, filling the corresponding fields of the returned s1
into the corresponding fields of the entry. It should be noted that in the noS2xlate
case, the vmid
field is invalid.
For the onlyS2xlate
case, we fill the result of s2
into the TLB entry. Here, some special processing is required to comply with the TLB compression structure. First, the asid
and perm
of this entry are not used, so we don't care what values are filled at this time. vmid
is filled with the vmid
from s1
(since the PTW module always fills this field regardless of the situation, this field can be used directly). The tag
of s2
is filled into the tag
of the TLB entry. pteidx
is determined based on the low sectortlbwidth
bits of the tag
of s2
. If s2
is a large page, then all bits of valididx
in the TLB entry are valid. Otherwise, the pteidx
corresponding to the TLB entry has valididx
valid. Regarding the filling of ppn
, the logic from the allStage
case is reused and will be described in the allStage
case.
For allStage
, the page tables from both stages need to be merged. First, the tag
, asid
, vmid
, etc. are filled based on s1
. Since there is only one level, the level
is filled with the maximum value of s1
and s2
. This is considering that if there is a case where the first stage is a large page and the second stage is a small page, a query at a certain address might hit the large page, but the actual address is outside the range of the second-stage page table. The tag
for such requests also needs to be merged. For example, if the first tag is from a level 1 page table and the second tag is from a level 2 page table, we need to combine the level 1 page number from the first tag with the level 2 page number from the second tag (the level 3 page number can be padded with zeros) to get the tag of the new page table. In addition, the perm
of s1
and s2
and s2xlate
need to be filled. For ppn
, since we do not save the guest physical address, if we directly store the ppn
of s2
in the case where the first stage is a small page and the second stage is a large page, the physical address calculated when querying this page table will be incorrect. Therefore, the tag
of s2
and ppn
should first be concatenated according to the level
of s2
. s2ppn
is the high ppn
, and s2ppn_tmp
is constructed for calculating the low bits. Then the high bits are filled into the ppn
field of the TLB entry, and the low bits are filled into the ppn_low
field of the TLB entry.
Logic for Judging a TLB Entry Hit
There are three types of hits used in L1TLB: query TLB hit, fill TLB hit, and PTW request response hit.
For query TLB hit, new parameters are added, including vmid
, hasS2xlate
, onlyS2
, onlyS1
, etc. Asid hit is always true during second-stage translation. The H extension adds pteidx hit
, which is enabled for small pages and in the allStage
and onlyS2
cases, used to mask the TLB compression mechanism.
For fill TLB hit (wbhit
), the input is PtwRespS2
. It needs to judge the current vpn
being compared. If it is only second-stage translation, the high bits of the tag
of s2
are used; in other cases, the tag
of s1vpn
is used, and then padded with 0 in the low sectortlbwidth
bits, and then vpn
is compared with the tag
of the TLB entry. The H extension modifies the judgment of wb_valid
and adds pteidx_hit
and s2xlate_hit
. If it is a PTW response for only second-stage translation, wb_valididx
is determined based on the tag
of s2
; otherwise, it is directly connected to the valididx
of s1
. s2xlate_hit
compares the s2xlate
of the TLB entry with the s2xlate
of the PTW response, used to filter the type of the TLB entry. pteidx_hit
is used to invalidate TLB compression. If it is only second-stage translation, the low bits of the tag
of s2
are compared with the pteidx
of the TLB entry; for other two-stage translation cases, the pteidx
of the TLB entry is compared with the pteidx
of s1
.
For PTW request response hit, it is mainly used when a PTW response arrives to determine if the PTW request sent by the TLB at this time corresponds exactly to this response, or to determine if the PTW response is the PTW result needed by this TLB request during TLB lookup. This method is defined in PtwRespS2
. Within this method, there are three types of hits. For noS2_hit
(noS2xlate
), only s1
hit needs to be checked. For onlyS2_hit
(onlyStage2
), only s2
hit needs to be checked. For all_onlyS1_hit
(allStage
or onlyStage1
), the logic for judging vpn_hit
needs to be redesigned and cannot simply check s1hit
. The level for judging vpn_hit
should be the maximum of s1
and s2
, and then the hit is judged based on the level. Additionally, vasid
(from vsatp
) hit and vmid
hit are added.
Supports Resending PTW to Get gpaddr After Guest Page Fault
Since L1TLB does not store the gpaddr
in the translation result, when a guest page fault occurs after querying a TLB entry, PTW needs to be performed again to obtain the gpaddr
. In this case, the TLB response is still a miss. Some new registers are added here.
Name | Type | Purpose |
---|---|---|
need_gpa | Bool | Indicates that a request is currently getting gpaddr |
need_gpa_robidx | RobPtr | The robidx of the request getting gpaddr |
need_gpa_vpn | vpnLen | The vpn of the request getting gpaddr |
need_gpa_gvpn | vpnLen | Stores the gvpn of the obtained gpaddr |
need_gpa_refill | Bool | Indicates that the gpaddr of this request has been filled into need_gpa_gvpn |
When a TLB request finds a guest page fault in the queried TLB entry, PTW needs to be performed again. In this case, need_gpa
will be valid, the vpn
of the request will be filled into need_gpa_vpn
, the robidx
of the request will be filled into need_gpa_robidx
, and resp_gpa_refill
will be initialized to false. When a PTW response arrives, and it is determined that it was the request to get gpaddr
sent previously by checking need_gpa_vpn
, the s2 tag
of the PTW response is filled into need_gpa_gvpn
, and need_gpa_refill
is made valid, indicating that the gvpn
of gpaddr
has been obtained. When the previous request re-enters the TLB, it can use this need_gpa_gvpn
to calculate the gpaddr
and return it. After a request completes this process, need_gpa
is invalidated. Here, resp_gpa_refill
remains valid, so the refilled gvpn
might be used by other TLB requests (as long as they are equal to need_gpa_vpn
).
Additionally, a redirect
situation might occur, causing the entire instruction stream to change. The request that was getting gpaddr
will no longer enter the TLB. So, if a redirect
occurs, we check the saved need_gpa_robidx
to determine if the registers related to getting gpaddr
within the TLB need to be invalidated.
Furthermore, to ensure that the PTW request for getting gpaddr
does not refill the TLB when it returns, a new output signal getGpa
is added when sending the PTW request. The path of this signal is similar to memidx
; please refer to memidx
. This signal will be passed into the Repeater. When the PTW response returns to the TLB, this signal will also be sent back. If this signal is valid, it indicates that this PTW request was only for getting gpaddr
, so the TLB will not be refilled at this time.
Regarding the process of getting gpaddr
after a guest page fault occurs, here are some key points reiterated:
-
The mechanism for getting
gpa
can be seen as a buffer with only 1 entry. When a request causes a guest page fault, the corresponding information forneed_gpa
is written to this buffer; this continues until the conditionneed_gpa_vpn_hit && resp_gpa_refill
is valid, or aflush
(itlb) /redirect
(dtlb) signal is received to flush thegpa
information. -
need_gpa_vpn_hit
means: After a request causes a guest page fault, itsvpn
information is written intoneed_gpa_vpn
. If the samevpn
queries the TLB again, theneed_gpa_vpn_hit
signal will be raised, indicating that the obtainedgpaddr
corresponds to the originalget_gpa
request. Ifresp_gpa_refill
is also high at this time, it means that thevpn
has already obtained the correspondinggpaddr
, and thegpaddr
can be returned to the frontend fetch / backend memory access for exception handling. -
Therefore, for any request from the frontend or memory access that triggers
gpa
, the following two conditions must be met later:- The request that triggered
gpa
must be able to be resent (the TLB will continue to returnmiss
for this request until thegpaddr
result is obtained). -
The
gpa
request needs to be flushed by sending aflush
orredirect
signal to the TLB. Specifically, for all possible requests:- ITLB fetch requests: If a fetch request causing
gpf
is on the speculative path and an incorrect speculation is detected, it will be flushed via theflushPipe
signal (including backendredirect
, or when the prediction result of a later stage predictor in the frontend multi-stage branch predictor updates the prediction result of an earlier stage predictor, etc.); for other cases, since the ITLB will returnmiss
for this request, the frontend will ensure that the samevpn
request is resent. - DTLB load requests: If a load request causing
gpf
is on the speculative path and an incorrect speculation is detected, it will be flushed via theredirect
signal (it is necessary to judge the relationship between therobidx
wheregpf
occurred and therobidx
of the incomingredirect
); for other cases, since the DTLB will returnmiss
for this request and will also raise thetlbreplay
signal, the load queue replay will definitely resend this request. - DTLB store requests: If a store request causing
gpf
is on the speculative path and an incorrect speculation is detected, it will be flushed via theredirect
signal (it is necessary to judge the relationship between therobidx
wheregpf
occurred and therobidx
of the incomingredirect
); for other cases, since the DTLB will returnmiss
for this request, the backend will definitely schedule this store instruction to resend the request again. - DTLB prefetch requests: The returned
gpf
signal will be raised, indicating that the address of this prefetch request caused agpf
, but it will not be written to thegpa*
series of registers and will not trigger the mechanism to look upgpaddr
, so it does not need to be considered. - In the current handling mechanism, it is necessary to ensure that the TLB entry that caused
gpf
and is waiting forgpa
is not replaced during the waiting process. Here, we simply prevent the TLB refill when waiting forgpa
, thereby avoiding the replacement operation. Since an exception handling program is required whengpf
occurs, and subsequent instructions will be redirected and flushed, preventing refill while waiting forgpa
will not cause performance issues.
- ITLB fetch requests: If a fetch request causing
- The request that triggered
Overall Block Diagram
The overall block diagram of the L1 TLB is shown in 此图, including the ITLB and DTLB in the green box. The ITLB receives PTW requests from the Frontend, and the DTLB receives PTW requests from Memblock. The PTW requests from the Frontend include 3 requests from the ICache and 1 request from the IFU. The PTW requests from Memblock include 2 requests from LoadUnit (AtomicsUnit occupies 1 LoadUnit request channel), 1 request from L1 Load Stream & Stride prefetch, 2 requests from StoreUnit, and 1 request from SMSPrefetcher.
After the ITLB and DTLB query results are obtained, PMP and PMA checks need to be performed. Since the area of L1 TLB is small, backups of PMP and PMA registers are not stored inside the L1 TLB, but in the Frontend or Memblock, providing checks for ITLB and DTLB respectively. After ITLB and DTLB miss, they need to go through a repeater to send a query request to L2 TLB.
Interface Timing
ITLB and Frontend Interface Timing
Frontend Sends PTW Request Hitting ITLB
The timing diagram when a PTW request sent by the Frontend to the ITLB hits the ITLB is shown in 此图.
When a PTW request sent by the Frontend to the ITLB hits the ITLB, the resp_miss
signal remains 0. At the next clock rising edge after req_valid
is 1, the ITLB will set the resp_valid
signal to 1 and simultaneously return the physical address translated from the virtual address to the Frontend, as well as information on whether a guest page fault, page fault, or access fault occurred. The timing description is as follows:
- Cycle 0: Frontend sends a PTW request to ITLB, setting
req_valid
to 1. - Cycle 1: ITLB returns the physical address to Frontend, setting
resp_valid
to 1.
Frontend Sends PTW Request Missing ITLB
The timing diagram when a PTW request sent by the Frontend to the ITLB misses the ITLB is shown in 此图.
When a PTW request sent by the Frontend to the ITLB misses the ITLB, resp_miss
signal is returned in the next cycle, indicating an ITLB miss. At this time, this requestor channel of the ITLB no longer receives new PTW requests. The Frontend repeatedly sends this request until the page table in L2 TLB or memory is queried and returned. (Please note that "this requestor channel of the ITLB no longer receives new PTW requests" is controlled by the Frontend, meaning that whether the Frontend chooses not to resend the missed request or resend other requests, the Frontend's behavior is transparent to the TLB. If the Frontend chooses to send a new request, the ITLB will directly discard the old request.)
When a PTW request sent by the Frontend to the ITLB misses the ITLB, resp_miss
signal is returned in the next cycle, indicating an ITLB miss. At this time, this requestor channel of the ITLB no longer receives new PTW requests. The Frontend repeatedly sends this request until the page table in L2 TLB or memory is queried and returned. (Please note that "this requestor channel of the ITLB no longer receives new PTW requests" is controlled by the Frontend, meaning that whether the Frontend chooses not to resend the missed request or resend other requests, the Frontend's behavior is transparent to the TLB. If the Frontend chooses to send a new request, the ITLB will directly discard the old request.)
When the ITLB misses, it will send a PTW request to the L2 TLB until the result is obtained. The timing interaction between ITLB and L2 TLB, and the information like physical address returned to the Frontend are shown in the timing diagram in Figure 4.4 and the following timing description:
- Cycle 0: Frontend sends a PTW request to ITLB, setting
req_valid
to 1. - Cycle 1: ITLB queries and gets a miss, returns
resp_miss
as 1 and setsresp_valid
to 1 to the Frontend. Simultaneously, in the same cycle, ITLB sends a PTW request to L2 TLB (actually toitlbrepeater1
), settingptw_req_valid
to 1. - Cycle X: L2 TLB returns a PTW reply to ITLB, including the virtual page number of the PTW request, the obtained physical page number, page table information, etc., with
ptw_resp_valid
as 1. In this cycle, ITLB has received the PTW reply from L2 TLB, andptw_req_valid
is set to 0. - Cycle X+1: ITLB now hits,
resp_valid
is 1, andresp_miss
is 0. ITLB returns the physical address and information on whether an access fault, page fault, etc. occurred to the Frontend. - Cycle X+2: The
resp_valid
signal returned by ITLB to Frontend is set to 0.
DTLB and Memblock Interface Timing
Memblock Sends PTW Request Hitting DTLB
The timing diagram when a PTW request sent by Memblock to the DTLB hits the DTLB is shown in 此图.
When a PTW request sent by Memblock to the DTLB hits the DTLB, the resp_miss
signal remains 0. At the next clock rising edge after req_valid
is 1, the DTLB will set the resp_valid
signal to 1 and simultaneously return the physical address translated from the virtual address to Memblock, as well as information on whether a page fault or access fault occurred. The timing description is as follows:
- Cycle 0: Memblock sends a PTW request to DTLB, setting
req_valid
to 1. - Cycle 1: DTLB returns the physical address to Memblock, setting
resp_valid
to 1.
Memblock Sends PTW Request Missing DTLB
Both DTLB and ITLB are non-blocking access (i.e., there is no blocking logic inside the TLB. If the request source remains unchanged, i.e., it continuously resends the same request after a miss, it exhibits a blocking-like effect; if the request source schedules other different requests to query the TLB after receiving the miss feedback, it exhibits a non-blocking-like effect). Unlike frontend instruction fetch, when a PTW request sent by Memblock to the DTLB misses the DTLB, it does not block the pipeline. The DTLB will return the request miss and resp_valid
signals to Memblock in the cycle after req_valid
. Memblock can perform scheduling after receiving the miss signal and continue querying other requests.
After Memblock's access to DTLB misses, DTLB will send a PTW request to L2 TLB to query the page table from L2 TLB or memory. DTLB passes the request to L2 TLB through a Filter. The Filter can merge duplicate requests sent by DTLB to L2 TLB, ensuring no duplicate entries in DTLB and improving the utilization of L2 TLB. The timing diagram when a PTW request sent by Memblock to the DTLB misses the DTLB is shown in 此图. This diagram only describes the process from the request miss to DTLB sending the PTW request to L2 TLB.
After the DTLB receives the PTW reply from the L2 TLB, it stores the page table entry in the DTLB. When Memblock accesses the DTLB again, it will hit, which is the same situation as 此图. The timing interaction between DTLB and L2 TLB is the same as the ptw_req
and ptw_resp
parts of 此图.
TLB and tlbRepeater Interface Timing
TLB Sends PTW Request to tlbRepeater
The interface timing diagram for TLB sending a PTW request to tlbRepeater
is shown in 此图.
In the Kunming Lake architecture, both ITLB and DTLB use non-blocking access. Upon TLB miss, they send a PTW request to L2 TLB, but they do not block the pipeline or the PTW channel between TLB and Repeater due to not receiving a PTW reply. The TLB can continuously send PTW requests to the tlbRepeater
. The tlbRepeater
will merge duplicate requests based on the virtual page numbers of these requests, avoiding wasting L2 TLB resources and preventing duplicate entries in L1 TLB.
From the timing relationship in 此图, it can be seen that in the cycle after the TLB sends the PTW request to the Repeater, the Repeater will continue to pass the PTW request downwards. Since the Repeater has already sent a PTW request with virtual page number vpn1 to L2 TLB, when the Repeater receives the same virtual page number PTW request again, it will not pass it to L2 TLB.
itlbRepeater Returns PTW Reply to ITLB
The interface timing diagram for itlbRepeater
returning a PTW reply to ITLB is shown in 此图.
Timing description is as follows:
- Cycle X:
itlbRepeater
receives the PTW reply from L2 TLB passed through the lower-levelitlbRepeater
, anditlbrepeater_ptw_resp_valid
is high. - Cycle X+1: ITLB receives the PTW reply from
itlbRepeater
.
dtlbRepeater Returns PTW Reply to DTLB
The interface timing diagram for dtlbRepeater
returning a PTW reply to DTLB is shown in 此图.
Timing description is as follows:
- Cycle X:
dtlbRepeater
receives the PTW reply from L2 TLB passed through the lower-leveldtlbRepeater
, anddtlbrepeater_ptw_resp_valid
is high. - Cycle X+1:
dtlbRepeater
passes the PTW reply to memblock. - Cycle X+2: DTLB receives the PTW reply.