Repeater Module

The Repeater includes the following modules:

PTWFilter itlbRepeater1
PTWRepeaterNB itlbRepeater2
PTWRepeaterNB itlbRepeater3
PTWNewFilter dtlbRepeater

Design Specifications

Supports transferring PTW requests and replies between the L1 TLB and the L2 TLB.
Supports filtering duplicate requests.
Supports the TLB Hint mechanism.

Functionality

Transferring L1 TLB PTW Requests to L2 TLB

There is a relatively long physical distance between the L1 TLB and the L2 TLB, which results in long wire latency. Therefore, the Repeater module is needed to add pipeline stages in between. Since both ITLB and DTLB support multiple outstanding requests, the repeater also functions similarly to an MSHR and filters duplicate requests. The Filter can filter out duplicate requests, preventing duplicate entries from appearing in the L1 TLB. The number of entries in the Filter, to some extent, determines the parallelism of the L2 TLB. (See Section 5.1.1.2)

In the Kunming Lake architecture, the L2 TLB is located in the memblock module, but it is still some distance away from both the ITLB and the DTLB. XiangShan's MMU includes three itlbRepeaters and one dtlbRepeater, which serve to add pipeline stages between the L1 TLB and the L2 TLB. Interaction between adjacent Repeater stages is via valid-ready signals. The ITLB sends PTW requests and the virtual page number to itlbRepeater1, which after arbitration sends them to itlbRepeater2, and sends to itlbRepeater3, which passes the PTW requests to the L2 TLB. The L2 TLB returns the virtual page number corresponding to the PTW request, the physical page number obtained from the L2 TLB lookup, page table permission bits, page table level, whether an exception occurred, and other signals to itlbRepeater3, itlbRepeater2, and finally back to the ITLB via itlbRepeater1. The interaction between the DTLB and the dtlbRepeater is similar to that of the ITLB. The dtlbRepeater and itlbRepeater1 are Filter modules that can merge duplicate requests from the L1 TLBs. Since both the ITLB and DTLB in the Kunming Lake architecture are non-blocking, these repeaters are also blocking Repeaters.

Filtering Duplicate Requests

Both ITLB and DTLB include multiple channels. Multiple miss requests between different channels or within the same channel may be duplicates. If we only use a simple Arbiter, processing only one request at a time, other requests accessing the L1 TLB would be replayed, continue to miss, and be sent to the L2 TLB. This would lead to low utilization of the L2 TLB, and replaying would also consume processor resources. Therefore, we use the Filter module. The essence of a Filter is a multi-input, single-output queue that serves to filter duplicate requests.

It should be noted that in the Kunming Lake architecture, the dtlbrepeater is composed of three parts: load entry, store entry, and prefetch entry. Requests from the load dtlb, store dtlb, and prefetch dtlb are sent to these three types of entries respectively for processing. The three entries use a round-robin arbiter for arbitration, sending the arbitrated result to the L2 TLB. Furthermore, the itlbrepeater checks all incoming requests from the ITLB to filter duplicate requests; however, the granularity of duplicate request checking in the dtlbrepeater is per entry. It only checks that requests within the same dtlb (load dtlb, store dtlb, prefetch dtlb) are not duplicated, but requests sent to the L2 TLB from different dtlbs (e.g., load dtlb and store dtlb) can still be duplicated.

Supporting the TLB Hint Mechanism

When a TLB hit occurs, it does not affect the lifecycle of a load instruction (loadunit queries TLB at cycle 0, TLB returns result at cycle 1). When a TLB miss occurs, the lookup continues in the L2 TLB and in memory page tables until a result is returned. However, from the perspective of a load instruction's lifecycle, after missing the TLB lookup, this load instruction enters the load replay queue to wait. Only after this load instruction is replayed by the load replay queue and hits the TLB lookup to obtain the physical address, can subsequent operations based on the physical address be performed.

Therefore, when a load instruction is replayed is a critical factor in reducing load execution time. If the load instruction cannot be replayed promptly, even if the TLB refill cycle is shortened, the overall memory access performance will not improve. Therefore, the Kunming Lake architecture implements the TLB Hint mechanism to selectively wake up load instructions that need to be replayed due to a TLB miss. Specifically, at the load_s0 stage, the vaddr is sent to the TLB. If it misses, miss information is returned at the load_s1 stage. Concurrently, at the load_s1 stage, the TLB sends this miss information to the dtlbrepeater for processing by the dtlbrepeater.

The dtlbrepeater processing yields two possible results: returning an MSHRid or a full signal. In the load entry of the dtlbrepeater, it first checks if the new request is a duplicate of an existing entry. If it duplicates an existing entry, the MSHRid of the existing entry is returned. If it does not duplicate an existing entry, it checks for an available entry. If an available entry exists, the MSHRid is returned; otherwise, a full signal is returned. If two load channels simultaneously send requests to the dtlbrepeater and the virtual addresses are the same, the MSHRid from loadunit(0) will be used.

In the Kunming Lake architecture, all instructions that enter the load replay queue due to a TLB miss can only wait for a wake-up signal to replay. If a load instruction enters the load replay queue and never receives a wake-up signal, a stalling situation will occur. To avoid stalling, when the DTLB sends a request to the dtlbrepeater and the dtlbrepeater has no available entry to receive it, it must return a full signal, indicating that the dtlbrepeater is full and cannot accept the PTW request corresponding to this load instruction. Therefore, the load replay queue will not receive a Hint signal, and the load replay queue itself is responsible for ensuring replay without stalling. Besides this situation, when a refilled entry has arrived at the DTLB or dtlbrepeater but has not yet been truly written to a DTLB entry, a full signal will also be returned to the loadunit, indicating that a replay is needed.

At the load_s2 stage, the dtlbrepeater returns MSHRid information to the loadunit, and this information is written to the load replay queue at the load_s3 stage. If the MSHRid is valid, the load replay queue needs to wait for the PTW refill information to hit the MSHRid stored in the dtlbrepeater. At this point, the dtlbrepeater sends a wake-up (Hint) signal to the load replay queue, indicating that the entry for this MSHRid has been refilled and needs to be replayed, as it can now hit in the DTLB. Additionally, when a certain PTW refill request corresponds to multiple MSHR entries (e.g., two VPNs within the same 2MB space, and the PTW refill's page table level is a 2MB page), in this situation the dtlbrepeater will send a replay_all signal to the load replay queue, indicating that all load requests blocked due to a DTLB miss need to be replayed. Since this situation is rare, it is a convenient solution that results in almost no performance loss.

Overall Block Diagram

The overall block diagram of the Repeater is shown in 此图. It includes three itlbRepeaters and one dtlbRepeater, which serve to add pipeline stages between the L1 TLB and the L2 TLB. Interaction between adjacent Repeater stages is via valid-ready signals. The Repeater receives PTW requests from the ITLB and DTLB from the upper level. Both ITLB and DTLB are non-blocking. Therefore, these repeaters are also blocking Repeaters. The Repeater sends the L1 TLB's PTW requests downwards to the L2 TLB. The dtlbRepeater and itlbRepeater1 are Filter modules that can merge duplicate requests from the L1 TLBs.

Other than itlbRepeater1, the remaining two stages of itlbRepeater are essentially just simple pipeline stages. The number of added pipeline stages depends on the physical distance. In the XiangShan Kunming Lake architecture, the L2 TLB is located in the Memblock and is physically distant from the Frontend module where the ITLB is located. Therefore, two stages of repeaters were added in the Frontend and one stage of repeater in the Memblock. The DTLB, however, is located in the Memblock and is closer to the L2 TLB, requiring only one stage of Repeater to meet timing requirements.

Interface List

See Interface List Document.

Interface Timing

Interface Timing of Repeater1 and L1 TLB

See Interface Timing of TLB and tlbRepeater.

Interface Timing of itlbRepeater3 and dtlbRepeater1 with L2 TLB

The interface timing of itlbRepeater3 and dtlbRepeater1 with the L2 TLB is shown in 此图. Handshaking between them is via valid-ready signals. The Repeater sends PTW requests from the L1 TLB and the requested virtual address to the L2 TLB; after the L2 TLB obtains the lookup result, it returns the physical address and the corresponding page table information to the Repeater.

Interface Timing between Multiple itlbrepeater Stages

The interface timing between multiple itlbrepeater stages is shown in 此图. Handshaking between adjacent Repeater stages is via valid-ready signals.