\newpage

Load-Load Violation Check After Read: LoadQueueRAR

Functional Description

Load-to-load violations can occur in multi-core environments. In a single-core environment, out-of-order execution of loads to the same address is typically not a concern. However, if another core performs a store to the same address between two loads on the current core, and the two loads on the current core are scheduled out of order, it is possible that the later load does not see the result updated by the store, while the earlier load does, resulting in an ordering error.

A characteristic of load-load violations in multi-core environments is that the current DCache will definitely receive a Probe request from the L2 cache, causing the DCache to proactively release this data copy. At this point, the DCache will notify the load queue to mark entries in the load queue with the same address that have completed memory access as "released". Subsequent load instructions sent to the pipeline will query the load queue for load instructions with the same address that are younger than themselves. If a "released" flag exists, a load-load violation has occurred.

The LoadQueueRAR is used to store information about completed load instructions for load-to-load violation detection. When a load instruction is in the s2 stage of the load pipeline, it queries and allocates a free entry to save the information into the LQRAR. In the s3 stage of the pipeline, the result of the load-to-load violation check is obtained. If a violation occurs, the pipeline needs to be flushed, a redirect request is sent to the RedirectGenerator component, and all instructions after the violating load are flushed.

The LoadQueueRAR needs to mark the following information:

Allocated: Indicates whether the entry is valid.
Uop: MicroOp related information.
Paddr: The compressed physical address of the instruction entering the LoadQueueRAR, a total of 16 bits.
Released: Indicates whether the cacheline accessed by this instruction has been released. In a multi-core environment, the L1 cache receives probe requests from the L2 cache. It should be noted that if the instruction is an NC (Non-Cacheable) instruction, it will be marked as released upon enqueuing.

Feature 1: Request Enqueue

When a query reaches the s2 stage of the load pipeline, it checks if the enqueue condition is met. If there are uncompleted load instructions before the current load instruction, and the current instruction has not been flushed, the current load can be enqueued.

The free entry and index are obtained from the freelist.

The enqueue information is saved in the PaddrModule, including the compressed physical address of the query (16 bits) and the index of the allocated entry.

Feature 2: Load-to-Load Violation Check

When a load reaches the s2 stage of the pipeline, it checks the RAR queue for load instructions with the same physical address that are younger than the current instruction. If these loads have already received data and have been marked as released, it means a load-load violation has occurred, and all instructions after the current violating load need to be flushed.

This process takes two cycles:

Cycle 1 performs the condition matching to generate a mask.
Cycle 2 generates the response signal indicating whether a violation occurred.

Feature 3: Release Conditions

There are four situations where a load instruction in the LoadQueueRAR is marked as released:

The missQueue module's replace_req in the mainpipe pipeline's s3 stage initiates a release to free a Dcache block. The release signal enters the loadQueue in the next cycle.
The probeQueue module's probe_req in the mainpipe pipeline's s3 stage initiates a release to free a Dcache block. The release signal enters the loadQueue in the next cycle.
Requests from the atomicsUnit module require releasing a Dcache block when a miss occurs in the mainpipe pipeline's s3 stage. The release signal enters the loadQueue in the next cycle.
If an enqueued request is NC, it is marked as released upon enqueuing.

Overall Block Diagram

\newpage

Interface Timing Diagrams

LoadQueueRAR Request Enqueue Timing Example

When both io_query_*_req_valid and io_query_*_req_ready are high, it indicates a successful handshake. When both needEnqueue and io_canAllocate_* are high, io_doAllocate_* is asserted high, indicating that the query needs to be enqueued and the FreeList can allocate an entry. io_allocateSlot_* is the entry receiving the query for enqueue. The information written to the entry is io_w*.

Load-to-Load Violation Check Timing Example

When both io_query_*_req_valid and io_query_*_req_ready are high, it indicates a successful handshake, and the LoadQueueRAR receives the ld-ld violation query request. The mask result is obtained in the current cycle, and io_query_*_resp_valid is asserted high in the next cycle to provide the response.

In cycle 3 of the diagram, the first violation query request is received, and the response to the violation query request is obtained in cycle 4. The request information is io_query_*_req_bits_, and the response information is io_query_*_resp_bits_. When both io_query_*_resp_valid and io_query_*_resp_bits_rep_frm_fetch are high, it indicates that an ld-ld violation has occurred, and all instructions after the current violating load are flushed.