Level 2 Module L2 TLB

L2 TLBWrapper refers to the following module:

L2TLBWrapper ptw, L2TLBWrapper provides an abstraction layer for the L2 TLB.

L2 TLB refers to:

L2TLB ptw

Design Specification

This subsection describes the overall design specification of the L2 TLB module. For the design specification of the L2 TLB module's sub-modules, please refer to the content in the Level 3 Modules section of this document.

Supports receiving PTW requests from the L1 TLB.
Supports returning PTW replies to the L1 TLB.
Supports signal and register replication.
Supports the exception handling mechanism.
Supports TLB compression.
Supports two-stage address translation.

Functionality

The L2 TLB is a larger page table cache, shared by the ITLB and DTLB. When an L1 TLB miss occurs, a Page Table Walk request is sent to the L2 TLB. The L2 TLB is divided into five parts: Page Cache (see Section 5.3.7), Page Table Walker (see Section 5.3.8), Last Level Page Table Walker (see Section 5.3.9), Hypervisor Page Table Walker (see Section 5.3.10), Miss Queue (see Section 5.3.11), and Prefetcher (see Section 5.3.12).

Requests from the L1 TLB will first access the Page Cache. For requests without two-stage address translation, if a leaf node is hit, it is directly returned to the L1 TLB. Otherwise, depending on the page table level hit in the Page Cache and the availability of the Page Table Walker and Last Level Page Table Walker, the request enters the Page Table Walker, Last Level Page Table Walker, or Miss Queue (see Section 5.3.7). For requests with two-stage address translation, if the request is onlyStage1, the handling is the same as requests without two-stage address translation; if the request is onlyStage2, and a leaf page table is hit, it is returned directly; if not hit, it is sent to the Page Table Walker for translation; if the request is allStage, since the Page Cache can only query one page table at a time, it first queries the stage one page table. There are two cases: if the stage one page table hits, it is sent to the Page Table Walker for the subsequent translation process; if the stage one page table does not hit a leaf node, depending on the hit page table level and the availability of the Page Table Walker and Last Level Page Table Walker, the request enters the Page Table Walker, Last Level Page Table Walker, or Miss Queue. To speed up page table access, the Page Cache caches all three levels of page tables separately and can query all three levels simultaneously (see Section 5.3.7). The Page Cache supports ECC checking. If an ECC check error occurs, the current entry is invalidated, and a Page Walk is re-initiated.

The Page Table Walker receives requests from the Page Cache and performs Hardware Page Table Walk. For requests without two-stage address translation, the Page Table Walker only accesses the first two levels (1GB and 2MB) of page tables and does not access the 4KB page table. The access to the 4KB page table is handled by the Last Level Page Table Walker. If the Page Table Walker accesses a leaf node (large page), it is returned to the L1 TLB. Otherwise, it needs to be returned to the Last Level Page Table Walker, which performs the access to the last level page table. The Page Table Walker can only handle one request at a time and cannot access the first two levels of page tables in parallel. For requests with two-stage address translation, in the first case, if it is an allStage request and the stage one translated page table hits, the PTW sends a stage two request to the Page Cache for query. If it misses, it is sent to the Hypervisor Page Table Walker, and the stage two translation result is returned to the PTW; in the second case, if it is an allStage request and the stage one translated leaf node misses, the PTW translation process is similar to non-virtualized requests, with the difference being that the physical addresses encountered during the PTW translation process are guest physical addresses, which require a stage two address translation before memory access. For details, see the Page Table Walker module description; in the third case, if it is an onlyStage2 request, the PTW sends an external request for stage two translation. After receiving the response, it is returned to the L1TLB; the fourth type of request is an onlyStage1 request. The handling process for this request inside the PTW is consistent with the handling process for non-virtualized requests.

The Miss Queue receives requests from the Page Cache and Last Level Page Table Walker, waiting for the next access to the Page Cache. The Prefetcher uses the Next-Line prefetching algorithm. When a miss occurs, or when a hit occurs but the hit entry is a prefetched entry, it generates the next prefetch request.

Receiving Requests from L1 TLB and Returning Replies

The L2 TLB, as a whole, receives PTW requests from the L1 TLB. PTW requests sent by the L1 TLB are passed into the L2 TLB through two levels of Repeaters. The L2 TLB returns replies to the itlbRepeater or dtlbRepeater depending on whether the request originated from itlbRepeater or dtlbRepeater. The L2 TLB receives the virtual page number sent by the L1 TLB and returns information to the L1 TLB including: stage one page table, stage two page table, etc. The behavior of the L2 TLB is transparent to the L1 TLB; the L1 TLB and L2 TLB only need to interact through partial signal interfaces.

Sending PTW Requests to L2 Cache

The L2 TLB sends PTW requests to the L2 Cache via the TileLink bus and is connected to the L2 Cache through the ptw_to_l2_buffer. It provides and receives relevant signals for the TileLink A channel and D channel to/from ptw_to_l2_buffer.

Signal and Register Replication

Since the L2 TLB module is large, signals like sfence and CSR registers need to drive many parts. Therefore, sfence signals and CSR registers need to be replicated multiple times. Replicating registers facilitates timing optimization and physical implementation and is not related to functional implementation. The replicated content is completely identical. The replicated signals and registers can be used to drive components in different locations.

The replication situations and the driven parts are shown in 此表:

Signal Replication and Driven Components
Replicated Signal	Index	Driven Components
sfence
	sfence_dup(0)	Prefetch
	sfence_dup(1)	Last Level Page Table Walker
	sfence_dup(2)	cache(0)
	sfence_dup(3)	cache(1)
	sfence_dup(4)	cache(2)
	sfence_dup(5)	cache(3)
	sfence_dup(6)	Miss Queue
	sfence_dup(7)	Page Table Walker
	sfence_dup(8)	Hypervisor Page Table Walker
csr
	csr_dup(0)	Prefetch
	csr_dup(1)	Last Level Page Table Walker
	csr_dup(2)	cache(0)
	csr_dup(3)	cache(1)
	csr_dup(4)	cache(2)
	csr_dup(5)	Miss Queue
	csr_dup(6)	Page Table Walker
	csr_dup(7)	Hypervisor Page Table Walker

Exception Handling Mechanism

Exceptions that may occur in the L2 TLB include: guest page fault, page fault, access fault, and ECC check error. For guest page fault, page fault, and access fault, they are handed over to the L1 TLB, and the L1 TLB handles them based on the request source; for ECC check errors, they are handled internally by the L2 TLB, the current entry is invalidated, a miss result is returned, and a Page Walk is re-initiated. See Section 6 of this document: Exception Handling Mechanism.

TLB Compression

After adding the content for virtualization extensions, the stage1 logic in the L2TLB reuses the TLB compression design, and the structure returned to the L1TLB is also a TLB compressed structure, but TLB compression is not enabled in the L1TLB. Stage2 does not use the TLB compressed structure, and only a single page table is returned in the response.

The L2 TLB accesses memory with a width of 512 bits each time and can return 8 page table entries each time. The L3 entries of the Page Cache are composed of SRAM and can store 8 consecutive page table entries when refilling; the SP entries of the Page Cache are composed of a register file and only store a single page table entry when refilling. Therefore, when the Page Cache hits and returns to the L1 TLB (in the case of non-two-stage address translation; if it is two-stage address translation, a stage one hit is sent to the PTW for subsequent processing), if a 4KB page table is hit, the 8 consecutive page table entries stored in the Page Cache can be compressed; if a large page is hit, compression is not performed, and it is refilled directly into the L1 TLB. (In fact, misses for 1GB or 2MB large pages are rare, so compression is only considered for 4KB pages. For 4KB pages, the low 3 bits of the physical address of the page table are the low 3 bits of the virtual page number. Thus, after compressing 8 consecutive entries, it is required that the high 21 bits of the PPN of these page tables are the same for compression. For 1GB or 2MB large pages, the low 9 bits of the PPN are not used to generate the physical address, so they are meaningless in the current design.)

When the Page Cache misses, after accessing the page table in memory through the Page Table Walker or Last Level Page Table Walker, the page table from memory is returned to the L1 TLB and refilled into the Page Cache. If two-stage address translation is required, the page table accessed in the Hypervisor Page Table Walker is also refilled into the Page Cache. The HPTW returns the final translation result to the PTW or LLPTW, and the PTW or LLPTW returns both stage one and stage two page tables to the L1TLB. Non-two-stage translation requests, when returned to the L1 TLB, can also compress the 8 consecutive page table entries. Since the Page Table Walker only returns directly to the L1 TLB when it accesses a leaf node, the page tables returned by the Page Table Walker to the L1 TLB are all large pages. Since large pages have little impact on performance, considering the simplicity of the optimization solution implementation and reusing the data path of the SP entry in the Page Cache, large pages returned by the Page Table Walker are not compressed.

The L2 TLB only performs compression for 4KB page tables. According to the Sv39 paging mechanism in the RISC-V privileged architecture manual, the low 3 bits of the physical address of the 4KB page table are the low 3 bits of the virtual page number. Therefore, the 8 consecutive page table entries returned by the Page Cache or Last Level Page Table Walker can be indexed by the low 3 bits of the virtual page number. Among them, the valid bit indicates whether the compressed page table entry is valid. By indexing using the low three bits of the virtual page number from the page table lookup request sent by the L1 TLB, the page table entry corresponding to that virtual page number is found. The valid bit of this page table entry must be 1. For the other 7 page table entries consecutive to this entry, their high bits of the physical page number and page table attributes are compared. If the high bits of the physical page number and the page table attributes are the same as the page table entry indexed by the low three bits of the virtual page number, the valid bit is 1, otherwise it is 0. At the same time, the L2 TLB also returns pteidx, which indicates which entry among these 8 consecutive page table entries corresponds to the VPN sent by the L1 TLB. L2 TLB compression is shown in 此图、此图.

After implementing TLB compression, each entry in the L1 TLB is a compressed TLB entry, looked up by the high bits of the virtual page number. The hit condition for a TLB entry is that in addition to the high bits of the virtual page number being the same, the valid bit corresponding to the low bits of the virtual page number must also be 1, indicating that the looked-up page table entry is valid in the compressed TLB entry. The part related to L1 TLB and TLB compression is detailed in the L1TLB module description.

Overall Block Diagram

As shown in 此图, the L2 TLB is divided into six parts: Page Cache, Page Table Walker, Last Level Page Table Walker, Hypervisor Page Table Walker, Miss Queue, and Prefetcher.

Requests from the L1 TLB will first access the Page Cache. For requests without two-stage address translation, if a leaf node is hit, it is directly returned to the L1 TLB. Otherwise, depending on the page table level hit in the Page Cache and the availability of the Page Table Walker and Last Level Page Table Walker, the request enters the Page Table Walker, Last Level Page Table Walker, or Miss Queue (see Section 5.3.7). For requests with two-stage address translation, if the request is onlyStage1, the handling is the same as requests without two-stage address translation; if the request is onlyStage2, and a leaf page table is hit, it is returned directly; if not hit, it is sent to the Page Table Walker for translation; if the request is allStage, since the Page Cache can only query one page table at a time, it first queries the stage one page table. There are two cases: if the stage one page table hits, it is sent to the Page Table Walker for the subsequent translation process; if the stage one page table does not hit a leaf node, depending on the hit page table level and the availability of the Page Table Walker and Last Level Page Table Walker, the request enters the Page Table Walker, Last Level Page Table Walker, or Miss Queue. To speed up page table access, the Page Cache caches all three levels of page tables separately and can query all three levels simultaneously (see Section 5.3.7). The Page Cache supports ECC checking. If an ECC check error occurs, the current entry is invalidated, and a Page Walk is re-initiated.

The Page Table Walker receives requests from the Page Cache and performs Hardware Page Table Walk. For requests without two-stage address translation, the Page Table Walker only accesses the first two levels (1GB and 2MB) of page tables and does not access the 4KB page table. The access to the 4KB page table is handled by the Last Level Page Table Walker. If the Page Table Walker accesses a leaf node (large page), it is returned to the L1 TLB. Otherwise, it needs to be returned to the Last Level Page Table Walker, which performs the access to the last level page table. The Page Table Walker can only handle one request at a time and cannot access the first two levels of page tables in parallel. For requests with two-stage address translation, in the first case, if it is an allStage request and the stage one translated page table hit, the PTW sends a stage two request to the Page Cache for query. If it misses, it is sent to the Hypervisor Page Table Walker, and the stage two translation result is returned to the PTW; in the second case, if it is an allStage request and the stage one translated leaf node misses, the PTW translation process is similar to non-virtualized requests, with the difference being that the physical addresses encountered during the PTW translation process are guest physical addresses, which require a stage two address translation before memory access. For details, see the Page Table Walker module description; the third case, if it is an onlyStage2 request, the PTW sends an external request for stage two translation. After receiving the response, it is returned to the L1TLB; the fourth type of request is an onlyStage1 request. The handling process for this request inside the PTW is consistent with the handling process for non-virtualized requests.

The Miss Queue receives requests from the Page Cache and Last Level Page Table Walker, waiting for the next access to the Page Cache. The Prefetcher uses the Next-Line prefetching algorithm. When a miss occurs, or when a hit occurs but the hit entry is a prefetched entry, it generates the next prefetch request.

The following arbiters are involved in the diagram, named according to the Chisel code:

arb1: 2 to 1 arbiter, shown as Arbiter 2 to 1 in the diagram. Inputs are ITLB (itlbRepeater2) and DTLB (dtlbRepeater2); output to Arbiter 5 to 1.
arb2: 5 to 1 arbiter, shown as Arbiter 5 to 1 in the diagram. Inputs are Miss Queue, Page Table Walker, arb1, hptw_req_arb, and Prefetcher; output to Page Cache.
hptw_req_arb: 2 to 1 arbiter. Inputs are Page Table Walker and Last Level Page Table Walker, output to Page Cache.
hptw_resp_arb: 2 to 1 arbiter. Inputs are Page Cache and Hypervisor Page Table Walker, output to PTW or LLPTW.
outArb: 1 to 1 arbiter. Input is mergeArb, output to L1TLB resp.
mergeArb: 3 to 1 arbiter. Inputs are Page Cache, Page Table Walker, and Last Level Page Table Walker, output to outArb.
mq_arb: 2 to 1 arbiter. Inputs are Page Cache and Last Level Page Table Walker; output to Miss Queue.
mem_arb: 3 to 1 arbiter. Inputs are Page Table Walker, Last Level Page Table Walker, and Last Level Page Table Walker; output to L2 Cache. (There is also a mem_arb inside the Last Level Page Table Walker that arbitrates all PTW requests sent by all Last Level Page Table Walkers to the L2 Cache, and then passes them to this mem_arb).

The hit path of the L2 TLB module is shown in 此图. Requests from ITLB and DTLB are first arbitrated, then sent to the Page Cache for lookup. For requests without two-stage address translation, only stage two translation, or only stage one translation, if the Page Cache hits, the hit page table entry and physical address information from the Page Cache are directly returned to the L1 TLB. For allStage requests, the Page Cache first queries the stage one page table. If stage one hits, it is sent to the PTW. The PTW sends an hptw request, which enters the Page Cache for query. If it hits, it is sent to the PTW. If it misses, it is sent to the HPTW. After the HPTW query is complete, the result is sent to the PTW. The page table accessed by the HPTW is also refilled into the Page Cache. All PTW requests sent by ITLB and DTLB, as well as hptw requests sent by PTW or LLPTW, will definitely first perform a Page Cache query.

For miss situations, all modules may be involved. Requests from ITLB and DTLB are first arbitrated, then sent to the Page Cache for query. If the Page Cache misses, the request may enter the MissQueue depending on the situation (requests sent by PTW or LLPTW to the Page Cache for hptw requests or prefetch requests do not enter the MissQueue). Cases where a miss request enters the MissQueue include bypass requests, requests sent by L1TLB to PageCache (isFirst) that are to enter the PTW, requests sent by MissQueue to PTW when PTW is busy, and requests sent to LLPTW when LLPTW is busy. The Page Cache needs to decide whether to enter the Page Table Walker or Last Level Page Table Walker for lookup based on the page table level hit in the Page Cache (if it is an hptw request, it is sent to the HPTW). The Page Table Walker can only handle one request at a time and can access the content of the first two levels of page tables in memory; the Last Level Page Table Walker is responsible for accessing the last level 4KB page table. The Hypervisor Page Table Walker can only handle one request at a time.

Page Table Walker, Last Level Page Table Walker, and Hypervisor Page Table Walker can all send requests to memory to access the page table content in memory. Before accessing the page table content in memory via the physical address, the physical address needs to be checked by the PMP and PMA modules. If the PMP and PMA checks fail, no request is sent to memory. Requests from Page Table Walker, Last Level Page Table Walker, and Hypervisor Page Table Walker are arbitrated and then sent to the L2 Cache via the TileLink bus.

Both the Page Table Walker and Last Level Page Table Walker may return PTW replies to the L1 TLB. The Page Table Walker may generate a reply in the following situations:

Non-two-stage translation requests and only stage one translation requests: when a leaf node is accessed (a 1GB or 2MB large page), it is returned directly to the L1 TLB.
Only stage two translation requests: after receiving the stage two translation result.
Requests with both two-stage translations: after both the stage one leaf page table and stage two leaf page table are obtained.
A Page fault or Access fault occurs in the stage two translation result.
A Page fault or Access fault occurs during PMP or PMA checks; this also needs to be returned to the L1 TLB.

The Last Level Page Table Walker will definitely return a reply to the L1 TLB, including the following possibilities:

Non-two-stage translation requests and only stage one translation requests: when a leaf node is accessed (4KB page).
Requests with both two-stage translations: after both the stage one leaf page table and stage two leaf page table are obtained.
An Access fault occurs during PMP or PMA checks.

Interface Timing

L2 TLB and Repeater Interface Timing Diagram

The interface timing diagram between the L2 TLB and Repeater is shown in 此图. The L2 TLB and Repeater handshake using valid-ready signals. The Repeater sends the PTW request and the virtual address of the request issued by the L1 TLB to the L2 TLB; after querying the result, the L2 TLB returns the physical address and the corresponding page table to the Repeater.

L2 TLB and L2 Cache Interface Timing Diagram

The interface timing diagram between the L2 TLB and L2 Cache follows the TileLink bus protocol.