Uncache Processing Unit Uncache
Update Time | Code Version | Updater | Note |
---|---|---|---|
2025.02.26 | eca6983 | Maxpicca-Li | First version completed |
Functional Description
The Uncache unit serves as a bridge between the LSQ and the bus, primarily used to handle uncache access requests and responses to the bus. Currently, the Uncache unit does not support vector access, unaligned access, or atomic access.
The functions of the Uncache unit are summarized as follows:
- Receive uncache requests from the LSQ, including uncache load requests from LoadQueueUncache and uncache store requests from StoreQueue.
- Select pending uncache requests to send to the bus, wait for and receive bus responses.
- Return completed uncache requests to the LSQ.
- Forward data from registered uncache store requests to the LoadUnit that is currently executing a load.
Structurally, the Uncache Buffer currently has 4 (configurable number of) Entries and States, and an overall state uState
. The following are specific details for each item.
The structure of an Uncache Entry is as follows:
cmd
: Identifies whether the request is a load or a store. In the current version, 0 is load, 1 is store.addr
: Physical address of the request.vaddr
: Virtual address of the request. Primarily used for checking if virtual and physical addresses match during forwarding.data
: Data to be written for a store, or data to be read for a load. Currently only supports data access up to 64 bits.mask
: Access mask for the request. Each byte uses one bit to indicate whether data is present, totaling 8 bits.nc
: Indicates whether the request is an NC access.atomic
: Indicates whether the request is an atomic access.memBackTypeMM
: Indicates whether the address accessed by the request has a PMA type of main memory but a PBMT type of NC. Primarily used for L2 Cache NC related logic.resp_nderr
: Bus indication to Uncache whether the request can be handled.
The structure of an Uncache State is as follows:
valid
: Indicates if the entry is valid.inflight
: 1 indicates that the request for this entry has been sent to the bus.waitSame
: 1 indicates that there are other requests in the current buffer that overlap with the data block accessed by this request, which have already been sent to the bus.waitReturn
: 1 indicates that the request for this entry has received a bus response and is waiting to be written back to the LSQ.
The uState
of the Uncache unit, representing the states of a request entry when outstanding is ignored:
s_idle
: Default state.s_inflight
: A request has been sent to the bus, but no response has been received yet.s_wait_return
: A response has been received, but has not been returned to the LSQ yet.
The state transitions are as follows:
Feature 1: Enqueue Logic
(1) In each cycle, at most 1 request from the LSQ is processed. The logic then checks if the request can enter the Buffer. If it can, it checks whether to merge with an old entry or allocate a new one. The enqueue behaviors for this request are:
- Allocate a new entry, mark valid.
- No entry with the same block address.
- Allocate a new entry, mark valid and waitSame.
- There is an entry with the same block address: satisfies the primary merge condition, but not the secondary merge condition.
- Merge into an old entry.
- There is an entry with the same block address: satisfies the primary merge condition and the secondary merge condition.
- Reject.
- UBuffer is full.
- There is an entry with the same block address: does not satisfy the primary merge condition.
Here, block address refers to the starting address of each 8 Bytes. The primary merge condition means that the incoming item and the old item are both NC accesses, have the same attributes, the mask after merging with the old item is continuous and naturally aligned, and the old item has not either been selected for or already completed bus access in the current cycle. The secondary merge condition means that the old entry is valid, has not been sent to the bus yet, and has not been selected for bus access in the current cycle (because if it is or has already been sent to the bus, the bus request cannot be changed, so a new entry can only be allocated, waiting for the old entry to receive a bus response before sending this request).
In addition, allocating a new entry sets the content of each entry; merging with an old entry updates the mask, data, addr, etc. When updating addr, natural alignment must be ensured.
Since bus access is not necessarily ordered, especially with outstanding, multiple uncache access requests for the same address may be processed simultaneously on the bus. Therefore, requests for the same address cannot appear on the bus concurrently to ensure the access order of that data block. Hence, a new item can only be merged into an old item if it satisfies both the primary and secondary merge conditions.
(2) In the next cycle, the ID of the allocated Uncache Buffer entry is returned. This ID is managed by LoadQueueUncache or StoreQueue to map the uncache response returned by Uncache. Because the Uncache Buffer has a merging function, the response it returns might correspond to multiple entries in LoadQueueUncache.
Feature 2: Dequeue Logic
From the entries that have completed bus access in the current cycle (i.e., the high bits of the status have valid
and waitReturn
set), one entry is selected, returned to the LSQ, and all state flags for this entry are cleared.
Feature 3: Bus Interaction and Outstanding Logic
Bus interaction and outstanding logic are divided into the following two parts:
(1) Initiating Requests
When outstanding is not enabled, requests can only be sent to the bus when uState
is s_idle
. One request that can currently be sent to the bus is selected from the entries, meaning only the valid
bit is set among the state bits, and it is sent to the bus. When outstanding is enabled, a request entry can be selected and sent to the bus regardless of uState
. The source
bit for this request is the ID of the request entry.
When a request is sent to the bus, the entries must be iterated through, and the waitSame
bit of other entries with the same block address must be set.
(2) Receiving Replies
When a bus reply is received, the corresponding buffer entry for the request is determined based on the source
bit. The data is updated and the waitReturn
bit is set.
Additionally, the entries must be iterated through, and the waitSame
bit for entries with the same block address must be cleared.
Feature 4: Forwarding Logic
Theoretically, forwarding logic is primarily for NC access. When outstanding is enabled, after an uncache NC store successfully writes to the Uncache Buffer from StoreQueue, the StoreQueue will dequeue that item and no longer maintain it. Therefore, at this point, the Uncache Buffer will take responsibility for forwarding the data of this store. Due to the merging feature in the Uncache Buffer's enqueue logic, at most 2 entries for the same address can appear in the Uncache Buffer at any given time. If 2 entries appear, one must be inflight
, and the other must be waitSame
. Because of the sequential dequeueing of the StoreQueue, the former data is older, and the latter data is newer.
In actual processing, when an uncache NC load sends a forwarding request to the Uncache Buffer, Uncache will compare the block addresses of the existing entries. It may find a matching entry. This entry might be one that has already been sent to the bus, or one that has not yet been sent. The former data is older, while the latter data is newer, meaning it has higher priority. In the first cycle (f0
), virtual block address matching is primarily performed to return forwardMaskFast
in the current cycle. In the second cycle (f1
), physical block address matching and data merging are performed, and the result is returned.
Feature 5: Flush Logic
Flush means that the Uncache Buffer must complete bus access for all entries and return them to the LSQ before accepting new entries. When a fence, atomic, cmo occurs, or when a virtual-physical address mismatch occurs during forwarding, the Uncache Buffer will be flushed. At this point, do_uarch_drain
is set, and new entries are no longer accepted. When all entries have completed their tasks, do_uarch_drain
is cleared, and new entries are accepted normally.
Overall Block Diagram
Interface Timing
LSQ Interface Timing Example
The following figure shows a detailed interface example with 4 uncache accesses. Before cycle 5, m1, m2, and m3 are received sequentially, and idResp
is returned in the cycle following their request initiation. In cycle 6, Uncache is full, and m4 is stalled. In cycle 9+n, all accesses for s1 are completed and written back, releasing one entry. Therefore, in cycle 10+n, io_lsq_req_ready
is asserted high, and m4 is accepted. In subsequent cycles, other uncache access requests are written back sequentially.
{
signal: [
{name: 'clk', wave: 'p......|.......'},
{name: 'io_lsq_req_valid', wave: '0101...|..0....'},
{name: 'io_lsq_req_ready', wave: '1....0.|.1.....'},
{name: 'io_lsq_req_bits_id', wave: 'x3x456.|..x....', data:['m1','m2','m3','m4']},
{name: 'io_lsq_idResp_valid', wave: '0.101.0|..10...'},
{name: 'io_lsq_idResp_bits_mid', wave: 'x.3x45x|..6x...', data: ['m1', 'm2', 'm3', 'm4']},
{name: 'io_lsq_idResp_bits_sid', wave: 'x.3x45x|..5x...', data: ['s1', 's2', 's3', 's4']},
{name: 'io_lsq_resp_valid', wave: '0......|10.1010'},
{name: 'io_lsq_resp_bits_id', wave: 'x......|3x.4x5x', data: ['s1', 's2', 's3']},
],
config: { hscale: 1 },
head: {
text:'LSQ <=> Uncache',
tick:1,
every:1
},
}
Bus Interface Timing Examples
(1) Without outstanding, only one uncache request can be sent out per segment (controlled by uState
) until a response is received on the d channel before another uncache request can be initiated.
{
signal: [
{name: 'clk', wave: 'p..|.....|...'},
{name: 'auto_client_out_a_ready', wave: '1..|.....|...'},
{name: 'auto_client_out_a_valid', wave: '010|...10|...'},
{name: 'auto_client_out_a_bits_source', wave: 'x3x|...4x|...', data: ['s1','s2']},
{name: 'auto_client_out_d_valid', wave: '0..|10...|10.'},
{name: 'auto_client_out_d_bits_source', wave: 'x..|3x...|3x.', data: ['s1', 's2']},
],
config: { hscale: 1 },
head: {
text:'Uncache <=> Bus',
tick:1,
every:1
},
}
(2) With outstanding, multiple uncache accesses can be sent out per segment (controlled by auto_client_out_a_ready
). As shown in the figure below, two requests are sent consecutively in cycles 2 and 3, and the access results are received in cycles 6+n and 8+n.
{
signal: [
{name: 'clk', wave: 'p..|......'},
{name: 'auto_client_out_a_ready', wave: '1..|......'},
{name: 'auto_client_out_a_valid', wave: '01.0|.....'},
{name: 'auto_client_out_a_bits_source', wave: 'x34x|.....', data: ['s1','s2']},
{name: 'auto_client_out_d_valid', wave: '0...|1010.'},
{name: 'auto_client_out_d_bits_source', wave: 'x...|3x4x.', data: ['s1', 's2']},
],
config: { hscale: 1 },
head: {
text:'Uncache <=> Bus when outstanding',
tick:1,
every:1
},
}