Atomic Instruction Execution Unit AtomicsUnit
Function Description
The AtomicsUnit is used to execute atomic instructions, including the A extension (LR/SC and AMO instructions) and the Zacas extension (AMOCAS.W, AMOCAS.D, and AMOCAS.Q). PMA default DDR address spaces support all AMO and AMOCAS instructions.
The basic execution flow of atomic instructions is as follows:
- sta dispatch: AtomicsUnit shares the dispatch port with StoreUnit, listening for sta uops from the Reservation Station.
- std dispatch: Atomic instructions share the StdExeUnit execution unit with store instructions. The execution results from StdExeUnit are sent to AtomicsUnit. AtomicsUnit is responsible for collecting all data required for atomic instruction execution.
- Address translation: AtomicsUnit shares the DTLB port with LoadUnit_0 for address translation, and also needs to perform physical address checks such as PMA / PMP.
- Clear SBuffer: Currently, all atomic instruction executions are treated with aq/rl bits set, so the SBuffer needs to be cleared before execution.
- Access DCache: Send an atomic operation request to the DCache. After the DCache completes, it returns the result to AtomicsUnit.
- Write back: AtomicsUnit writes the execution result back to the register file.
Overall Block Diagram
The finite state machine of the AtomicsUnit is shown in the figure:
-
s_invalid: AtomicsUnit is idle. Upon receiving an sta uop dispatched from the Reservation Station, it enters the s_tlb_and_flush_sb_req state.
-
s_tlb_and_flush_sb_req: Access the TLB for address translation. If the TLB misses, continue accessing the TLB until a hit; simultaneously request SBuffer clearing. After a TLB hit, if a debug trigger is activated, or if there is an address misalignment exception, it directly enters the s_finish state to write back to the backend. Otherwise, it enters the s_pm state to perform physical address permission checks and further exception checks. During TLB access:
- For LR instructions, read permission is required.
- For SC instructions or other AMO instructions, write permission is required.
-
s_pm: Physical address permission checking and exception handling. If any of the following exceptions occur, it enters the s_finish state to write back to the backend:
- If a TLB access for an LR instruction returns an exception, report the corresponding LoadPageFault / LoadAccessFault / LoadGuestPageFault exception.
- If a TLB access for an atomic instruction other than LR returns an exception, report the corresponding StorePageFault / StoreAccessFault / StoreGuestPageFault exception.
- If the PBMT attribute is PMA, and the PMA attribute is MMIO, report the corresponding LoadAccessFault / StoreAccessFault based on whether it is an LR instruction.
- If the PBMT attribute is IO or NC, report the corresponding LoadAccessFault / StoreAccessFault based on whether it is an LR instruction.
- If the PMP attribute is MMIO, or if a read/write permission check exception is returned, report the corresponding LoadAccessFault / StoreAccessFault based on whether it is an LR instruction.
If none of the above exceptions occur, start clearing the SBuffer: - If the SBuffer is not empty, enter the s_wait_flush_sbuffer_resp state to wait for the SBuffer to be cleared. - If the SBuffer is already clear, enter the s_cache_req state to access the DCache.
-
s_wait_flush_sbuffer_resp: After waiting for the SBuffer to be cleared, enter the s_cache_req state to access the DCache.
-
s_cache_req: After collecting all std uops, send an access request to the DCache. Upon successful handshake, enter the s_cache_resp state to wait for the DCache to complete the response.
- It should be noted that AMOCAS instructions need to receive multiple std uops from the backend. The AtomicsUnit in the s_cache_req state needs to wait until all std uops are received before sending the request to the DCache.
-
s_cache_resp: Wait for the DCache to process the atomic operation and return the result.
- If the DCache cannot handle the request temporarily and requires the AtomicsUnit to resend, return to the s_cache_req state to resend the request.
- Otherwise, resending is not required, and it enters the s_cache_resp_latch state.
-
s_cache_resp_latch: Shift and sign/zero-extend the data returned from the DCache. A one-cycle delay is added due to timing. In the next cycle, enter the s_finish state.
- If the DCache returns an error, the corresponding LoadAccessFault / StoreAccessFault needs to be recorded.
-
s_finish: Write back the execution result of the atomic instruction.
- For LR or AMO instructions, write back the old value read from memory.
- For SC instructions, write back whether the SC instruction executed successfully. If successful, write back 0; if failed, write back 1.
After a successful write-back handshake: - For AMOCAS.Q instructions, a total of 16B of data needs to be written back. As mentioned earlier, AMOCAS.Q instructions require receiving 2 sta uops, and similarly, they need to be written back in 2 cycles, with the pdest of the 2 write-backs corresponding to the pdest of the 2 dispatched uops respectively. The 2 sta uops of the AMOCAS.Q instruction do not have a fixed dispatch order, but the write-back needs to be done in order. Therefore, when performing the first write-back in the s_finish state, it is necessary to ensure that the first sta uop has been received (so that the write-back pdest is correct). After the first write-back is successful, enter the s_finish2 state for the second write-back. - If it is not an AMOCAS.Q instruction, after a successful write-back handshake, enter the s_invalid state, and the state machine finishes.
-
s_finish2: For AMOCAS.Q instructions, the AtomicsUnit needs to perform a second write-back to write back the high 8B data from the 16B. The condition for writing back is to ensure that the second sta uop has been received. After a successful write-back handshake, enter the s_invalid state, and the state machine finishes.
Zacas Extension
- AMOCAS.W instruction loads 4B of data pointed to by rs1 from memory, and compares it with the low 4B of data in rd. If they are equal, the low 4B of rs2 is written to the memory location pointed to by rs1; finally, the old value loaded from memory is written back to the rd register.
- AMOCAS.D instruction loads 8B of data pointed to by rs1 from memory, and compares it with rd. If they are equal, rs2 is written to the memory location pointed to by rs1; finally, the old value loaded from memory is written back to the rd register.
- AMOCAS.Q instruction loads 16B of data pointed to by rs1 from memory, and compares it with the concatenated data of rd and rd+1. If they are equal, the 16B concatenated data of rs2 and rs2+1 is written to the memory location pointed to by rs1; finally, the low 8B of the old value loaded from memory is written back to the rd register, and the high 8B is written back to the rd+1 register.
- It should be noted that regarding the register pair of rs2 and rd, if the source operand is the x0 register, the read result of the register pair is all zeros; if the destination register is the x0 register, neither register in the pair will be written.
Uop Splitting of Atomic Instructions
Each instruction in the A extension is split into one sta uop and one std uop, with one write-back (the number of write-backs is the same as the number of sta uops; std uops do not require write-back).
AMOCAS instructions differ from other A extension instructions in instruction uop splitting, dispatch, and write-back. When dispatched, AMOCAS instructions need to provide not only the data to be written to memory but also the data for comparison, so an AMOCAS instruction is split into multiple std uops and even multiple sta uops.
AMOCAS instructions reuse the fuOpType to distinguish multiple std uops or multiple sta uops. The fuOpType has 9 bits. Atomic instructions only use 6 bits, so the high 3 bits are used to mark the uopIdx.
The specific uop splitting rules are as follows:
- A extension instructions (including LR / SC and standard AMO instructions): The uopIdx for both sta and std is 0, carrying the data from rs1 and rs2 respectively, and storing them in the rs1 and rs2_l registers within the AtomicsUnit; the AtomicsUnit performs one write-back operation, with uopIdx 0. The write-back pdest equals the pdest of the sta uop.
-
AMOCAS.W and AMOCAS.D instructions: The backend dispatches 1 sta uop and 2 std uops:
- The uopIdx of the 1 sta uop is 0.
- The uopIdx of the 2 std uops are 0 and 1 respectively, storing rd (data for comparison) and rs2 (data to be stored if comparison is successful) respectively, and writing them to the rd_l and rs2_l registers within the AtomicsUnit.
- Finally, one write-back is performed, with uopIdx 0. The write-back pdest equals the pdest of the sta uop.
-
AMOCAS.Q instruction: The backend dispatches 2 sta uops and 4 std uops:
- The uopIdx of the 2 sta uops are 0 and 2 respectively. The pdest of the two uops are denoted as pdest1 and pdest2.
- The uopIdx of the 4 std uops are 0-3. Uops 0 and 2 store the low and high parts of rd respectively, writing them to the rd_l and rd_h registers; uops 1 and 3 store the low and high parts of rs2 respectively, writing them to the rs2_l and rs2_h registers.
- Finally, two write-backs are performed, with uopIdx 0 and 2 respectively, and pdest pdest1 and pdest2 respectively. The write-back data are the low and high parts of the old value loaded from memory respectively.
Exception Summary
Exceptions that may occur for atomic instructions include:
- Address Misalignment Exception: The address of an atomic operation must be aligned according to the operation type (Word / Doubleword / Quadword) (4B / 8B / 16B), otherwise an address misalignment exception is reported.
- Illegal Instruction Exception (checked at the backend decode stage, unrelated to memory access): AMOCAS.Q instruction requires the register numbers of the register pair rs2 and rd to be even. If they are odd, an illegal instruction exception must be reported.
- Breakpoint Exception: If a trigger comparison hits, a breakpoint exception needs to be reported.
- Exceptions related to Address Translation and Permission Checking:
- If TLB address translation returns an exception, report the corresponding Load or Store PageFault / AccessFault / GuestPageFault exception based on whether it is an LR instruction.
- If the PMP attribute is MMIO, or if PMP lacks the corresponding read/write permissions, report LoadAccessFault / StoreAccessFault.
- If the PMA + PBMT attribute is IO or NC (including the following 3 cases), report LoadAccessFault / StoreAccessFault:
- PBMT = IO
- PBMT = NC
- PBMT = PMA and PMA = MMIO