\newpage

Store Submission Buffer SBuffer

Functional Description

Each entry in the SBuffer is a cacheline. Each cacheline is 64 bytes, which means 4 vwords, and each vword is 16 bytes.

Each byte uses one bit in a mask to indicate whether there is data present.

The meta information includes ptag, vtag, state, cohCount, and missqReplayCount. Their specific functions are:

ptag: Physical address tag, the part of the physical address other than the cacheline offset.
vtag: Virtual address tag, the part of the virtual address other than the cacheline offset.
state: State, indicating the current status of the entry.
- state_valid: Whether the entry is valid.
- state_inflight: The entry has sent a write request to dcache and has not received a response, or the dcache responded but it was a miss.
- w_timeout: The request sent to dcache missed, waiting for resend.
- w_sameblock_inflight: There are other entries with the same cache block address as this entry; those other entries are already inflight, and this current entry has just been allocated. It needs to wait for the other entries to complete their writeback to dcache.
cohCount: Counter. After counting from 0 to 1M, this entry will be written to dcache.
missqReplayCount: Counter. The request previously sent to dcache experienced a miss. After counting from 0 to 16, this entry will be resent to dcache.

Feature 1: SBuffer Enqueue Logic

A maximum of two requests from the StoreQueue can be processed per cycle. Then, check if the requests require allocating new entries. If both require new entries, two free entries are selected based on parity for allocation. If the ptag of the two requests is the same, they are allocated to the same free entry.
If there is already an entry for the same cacheline, a new entry does not need to be allocated; it can be merged into the existing entry. If this same cacheline entry has already been sent to dcache (state_inflight is true), it cannot be merged and requires reallocating an entry. The newly allocated entry needs to record that it depends on the inflight entry (set w_sameblock_inflight to true, and waitInflightMask to the ID of the inflight entry). The purpose of recording the dependency is to ensure that the new entry can only be written to dcache after the inflight entry has completed its write to dcache, guaranteeing the order of stores.
Set the state bit of this entry to valid.
When a request is merged into the SBuffer, if this entry is selected to be written to dcache at that moment, the write to dcache will be blocked, waiting for the merge to complete before writing.

Feature 2: SBuffer Dequeue Logic

Entries in the SBuffer are written to dcache under passive and active conditions.
- Passive: The number of entries in the SBuffer reaches a threshold, requiring replacement.
- Active: A flush sbuffer signal from the atomicsUnit or fenceUnit, a tag mismatch during merging or forwarding to a load, or resending a previously missed request.
Exiting the SBuffer takes two cycles. In the first cycle, the entry to be written to dcache is selected and latched. In the second cycle, the write request is sent to dcache.

Feature 3: Writing SBuffer Data

When a request arrives at the SBuffer, it is either allocated a new entry or merged into an existing entry. Writing data and mask takes two cycles. In the first cycle, the request is latched. In the second cycle, the data is written according to the request's mask (sb, sh, sw, sd), and the corresponding mask bit (a signal indicating whether a specific byte on a cache line is valid) is set.

For example: S0 request arrives at the SBuffer. Based on the logic, S0 can be merged into an existing entry, say entry 2. A one hot write encoding 16'b0000000000000100 is generated. Using this write encoding, the write signal for entry 2 is generated and latched to S1. The S0 write address (e.g., cache block internal address is 0), mask (e.g., sw, writing 4 bytes), and data are also latched to S1. In S1, based on the information latched from S0, the data write signal for the lower 4 bytes of the 0th word of entry 2 is asserted high, writing the corresponding data. The mask write signal for the lower 4 bytes of the 0th word of entry 2 is also asserted high, changing their status to true.

Feature 4: SBuffer Forwarding Logic

A load needs to find data from the store preceding it. This store could be in the storequeue, in the SBuffer, or already written to the cache.
When searching in the SBuffer, it compares the tags of the existing entries. It might find matching entries. These entries could be ones that haven't sent a request to dcache yet, or ones that have already sent a request to dcache. The one that hasn't been sent is the newest, so it has higher priority. The matching data is forwarded to the load.

As shown in the figure below, the forwarding query request matched simultaneously with SBuffer entries 0 and 15. The data in entry 0 is the newest, and entry 15 is old, so in the forwarding result, entry 0 has higher priority than entry 15.

Overall Block Diagram

Interface Timing

Example Timing for Receiving Store Instruction Writes

When io_in_valid and io_inready handshake, the SBuffer receives a write request from the StoreQueue. It uses the address for checking, either allocating a new entry or merging into an existing one, and updates the entry using the information from io_in*_bits.

Receiving Store Instruction Write Timing

Example Timing for Writing to dcache

When io_dcache_req_ready and io_dcache_req_valid handshake, io_dcache_req_bits_* is provided to the dcache, passing the request for dcache to process.

Example Timing for Forwarding Request

Forwarding requests do not require a ready signal. Once io_forward_valid is high, this request needs to be processed. The request's paddr and vaddr are used for querying. The data and other information are valid in the cycle after io_forward_valid is high.