跳转至

HPM

  • Version: V2R2
  • Status: OK
  • Date: 2025/02/27
  • commit: xxx

Basic Information

Terminology Description

Terminology Description
Abbreviation Full Name Description
HPM Hardware performance monitor Hardware performance counter unit

Submodule List

Submodule List
Submodule Description
HPerfCounter Individual counter module
HPerfMonitor Counter organization module
PFEvent Copy of Hpmevent register

Design Specifications

  • Implemented basic hardware performance monitoring functions based on the RISC-V privileged specification, with additional support for sstc and sscofpmf extensions.
  • Clock cycles executed by a hardware thread (cycle)
  • Number of instructions committed by a hardware thread (minstret)
  • Hardware timer (time)
  • Counter overflow flag (time)
  • 29 hardware performance counters (hpmcounter3 - hpmcounter31)
  • 29 hardware performance event selectors (mhpmcounter3 - mhpmcounter31)
  • Supports definition of up to 2^10 types of performance events

Features

The basic functions of HPM are as follows:

  • Disable all performance event monitoring via the mcountinhibit register.
  • Initialize performance event counters for each monitoring unit, including: mcycle, minstret, mhpmcounter3 - mhpmcounter31.
  • Configure performance event selectors for each monitoring unit, including: mhpmcounter3 - mhpmcounter31. In the XiangShan Kunming Lake architecture, each event selector can configure up to four event combinations. By writing the event index value, event combination method, and sampling privilege level to the event selector, configured events can be counted normally at the specified sampling privilege level, and the results are accumulated in the event counter based on the combination result.
  • Configure xcounteren for access permission authorization.
  • Enable all performance event monitoring via the mcountinhibit register to start counting.

HPM Event Overflow Interrupt

The overflow interrupt LCOFIP initiated by the Kunming Lake performance monitoring unit has a unified interrupt vector number of 12. The enabling and processing of this interrupt are consistent with standard private interrupts.

Overall Design

Performance events are defined in each submodule. Submodules assemble performance events into io_perf output by calling generatePerfEvent, sending them to four main modules: Frontend, Backend, MemBlock, and CoupledL2.

The aforementioned four modules obtain the performance event output from submodules by calling the get_perf method. Simultaneously, the PFEvent module is instantiated within each main module as a copy of the CSR mhpmevent register. It aggregates the required performance event selector data and the performance event output from submodules and connects them to the HPerfMonitor module to calculate the increment results applied to the performance event counters.

Finally, the CSR module collects the increment results of the performance event counters from the four top-level modules and inputs them into the CSR registers mhpmcounter3-31 for cumulative counting.

Specifically, performance events from CoupledL2 are directly input into the CSR module. Based on the event selection information read from the mhpmevent register, they are processed by the HPerfMonitor module instantiated within the CSR and input into the CSR registers mhpmcounter26-31 for cumulative counting.

See 此图 for the specific overall design block diagram of HPM:

HPM Overall Design

HPerfMonitor Counter Organization Module

Inputs the event selection information (events) to the corresponding HPerfCounter module and copies all performance event counting information to every HperfCounter module.

Collects the output from all HperfCounter modules.

HperfCounter Individual Counter Module

Based on the input event selection information, selects the required performance event counting information and combines the input performance events according to the counting mode specified in the event selection information for output.

PFEvent Copy of Hpmevent Register

A copy of the CSR register mhpmevent: Collects CSR write information and synchronizes changes to mhpmevent.

Machine-mode Performance Event Counter Inhibit Register (MCOUNTINHIBIT)

The Machine-mode Performance Event Counter Inhibit Register (mcountinhibit) is a 32-bit WARL register, primarily used to control whether hardware performance monitoring counters count. In scenarios where performance analysis is not needed, counters can be disabled to reduce processor power consumption.

Machine-mode Performance Event Counter Inhibit Register Description
Name Bit Field R/W Behavior Reset Value
HPMx 31:4 RW Inhibit bit for mhpmcounterx register: 0
0: Count normally
1: Inhibit counting
+--------+--------+-------+--------------------------------------------+----------+
IR 3 RW Inhibit bit for minstret register: 0
0: Count normally
1: Inhibit counting
+--------+--------+-------+--------------------------------------------+----------+
-- 2 RO 0 Reserved 0
+--------+--------+-------+--------------------------------------------+----------+
CY 1 RW Inhibit bit for mcycle register: 0
0: Count normally
1: Inhibit counting
+--------+--------+-------+--------------------------------------------+----------+

Machine-mode Performance Event Counter Access Enable Register (MCOUNTEREN)

The Machine-mode Performance Event Counter Access Enable Register (mcounteren) is a 32-bit WARL register, primarily used to control access permissions for user-mode performance monitoring counters in privilege modes lower than Machine mode (HS-mode/VS-mode/HU-mode/VU-mode).

Machine-mode Performance Event Counter Access Enable Register Description
Name Bit Field R/W Behavior Reset Value
HPMx 31:4 RW Access permission bit for hpmcounterenx register below M-mode: 0
0: Accessing hpmcounterx causes an illegal instruction exception
1: Allows normal access to hpmcounterx
+--------+--------+-------+------------------------------------------------------------+----------+
IR 3 RW Access permission bit for instret register below M-mode: 0
0: Accessing instret causes an illegal instruction exception
1: Allows normal access
+--------+--------+-------+------------------------------------------------------------+----------+
TM 2 RW Access permission bit for time/stimecmp register below M-mode: 0
0: Accessing time causes an illegal instruction exception
1: Allows normal access
+--------+--------+-------+------------------------------------------------------------+----------+
CY 1 RW Access permission bit for cycle register below M-mode: 0
0: Accessing cycle causes an illegal instruction exception
1: Allows normal access
+--------+--------+-------+------------------------------------------------------------+----------+

Supervisor-mode Performance Event Counter Access Enable Register (SCOUNTEREN)

The Supervisor-mode Performance Event Counter Access Enable Register (scounteren) is a 32-bit WARL register, primarily used to control access permissions for user-mode performance monitoring counters in User mode (HU-mode/VU-mode).

Supervisor-mode Performance Event Counter Access Enable Register Description
Name Bit Field R/W Behavior Reset Value
HPMx 31:4 RW User-mode access permission bit for hpmcounterenx register: 0
0: Accessing hpmcounterx causes an illegal instruction exception
1: Allows normal access to hpmcounterx
+--------+--------+-------+---------------------------------------------------------+----------+
IR 3 RW User-mode access permission bit for instret register: 0
0: Accessing instret causes an illegal instruction exception
1: Allows normal access
+--------+--------+-------+---------------------------------------------------------+----------+
TM 2 RW User-mode access permission bit for time register: 0
0: Accessing time causes an illegal instruction exception
1: Allows normal access
+--------+--------+-------+---------------------------------------------------------+----------+
CY 1 RW User-mode access permission bit for cycle register: 0
0: Accessing cycle causes an illegal instruction exception
1: Allows normal access
+--------+--------+-------+---------------------------------------------------------+----------+

Virtualization-mode Performance Event Counter Access Enable Register (HCOUNTEREN)

The Virtualization-mode Performance Event Counter Access Enable Register (hcounteren) is a 32-bit WARL register, primarily used to control access permissions for user-mode performance monitoring counters in guest virtual machines (VS-mode/VU-mode).

Supervisor-mode Performance Event Counter Access Enable Register Description
Name Bit Field R/W Behavior Reset Value
HPMx 31:4 RW Guest virtual machine access permission bit for hpmcounterenx register: 0
0: Accessing hpmcounterx causes an illegal instruction exception
1: Allows normal access to hpmcounterx
+--------+--------+-------+-----------------------------------------------------------------+----------+
IR 3 RW Guest virtual machine access permission bit for instret register: 0
0: Accessing instret causes an illegal instruction exception
1: Allows normal access
+--------+--------+-------+-----------------------------------------------------------------+----------+
TM 2 RW Guest virtual machine access permission bit for time/vstimecmp(via stimecmp) register: 0
0: Accessing time causes an illegal instruction exception
1: Allows normal access
+--------+--------+-------+-----------------------------------------------------------------+----------+
CY 1 RW Guest virtual machine access permission bit for cycle register: 0
0: Accessing cycle causes an illegal instruction exception
1: Allows normal access
+--------+--------+-------+-----------------------------------------------------------------+----------+

Supervisor-mode Timer Compare Register (STIMECMP)

The Supervisor-mode Timer Compare Register (stimecmp) is a 64-bit WARL register, primarily used to manage timer interrupts (STIP) in Supervisor mode.

STIMECMP Register Behavior Description:

  • Reset value is a 64-bit unsigned number 64'hffff_ffff_ffff_ffff.
  • If menvcfg.STCE is 0 and the current privilege level is lower than M-mode (HS-mode/VS-mode/HU-mode/VU-mode), accessing the stimecmp register causes an illegal instruction exception and does not generate an STIP interrupt.
  • The stimecmp register is the source of STIP interrupts: When the unsigned integer comparison time ≥ stimecmp is true, the STIP interrupt pending signal is asserted.
  • Supervisor-mode software can control the generation of timer interrupts by writing to stimecmp.

Guest Virtual Machine Supervisor-mode Timer Compare Register (VSTIMECMP)

The Guest Virtual Machine Supervisor-mode Timer Compare Register (vstimecmp) is a 64-bit WARL register, primarily used to manage timer interrupts (STIP) in guest virtual machine Supervisor mode.

VSTIMECMP Register Behavior Description:

  • Reset value is a 64-bit unsigned number 64'hffff_ffff_ffff_ffff.
  • If henvcfg.STCE is 0 or hcounteren.TM is 1, accessing the vstimecmp register via the stimecmp register causes a virtual illegal instruction exception and does not generate a VSTIP interrupt.
  • The vstimecmp register is the source of VSTIP interrupts: When the unsigned integer comparison time + htimedelta ≥ vstimecmp is true, the VSTIP interrupt pending signal is asserted.
  • Guest virtual machine Supervisor-mode software can control the generation of timer interrupts in VS-mode by writing to vstimecmp.

Machine-mode performance event selectors (mhpmevent3 - 31) are 64-bit WARL registers used to select the performance event corresponding to each performance event counter. In the XiangShan Kunming Lake architecture, each counter can be configured with up to four performance events for combined counting. After the user writes the event index value, event combination method, and sampling privilege level to the specified event selector, the event counter matched by that event selector begins counting normally.

Machine-mode Performance Event Selector Description
Name Bit Field R/W Behavior Reset Value
OF 63 RW Performance counter overflow flag bit: 0
0: Set to 1 when the corresponding performance counter overflows, generating an overflow interrupt
1: Value remains unchanged when the corresponding performance counter overflows, no overflow interrupt is generated
+----------------+--------+-------+-----------------------------------------------+----------+
MINH 62 RW When set to 1, inhibits M mode sampling count 0
+----------------+--------+-------+-----------------------------------------------+----------+
SINH 61 RW When set to 1, inhibits S mode sampling count 0
+----------------+--------+-------+-----------------------------------------------+----------+
UINH 60 RW When set to 1, inhibits U mode sampling count 0
+----------------+--------+-------+-----------------------------------------------+----------+
VSINH 59 RW When set to 1, inhibits VS mode sampling count 0
+----------------+--------+-------+-----------------------------------------------+----------+
VUINH 58 RW When set to 1, inhibits VU mode sampling count 0
+----------------+--------+-------+-----------------------------------------------+----------+
-- 57:55 RW -- 0
+----------------+--------+-------+-----------------------------------------------+----------+
Counter event combination method control bits:
5'b00000: Use OR operation for combination
OP_TYPE2 54:50
OP_TYPE1 49:45 RW 5'b00001: Use AND operation for combination 0
OP_TYPE0 44:40
5'b00010: Use XOR operation for combination
5'b00100: Use ADD operation for combination
+----------------+--------+-------+-----------------------------------------------+----------+
Counter performance event index value:
EVENT3 39:30
EVENT2 29:20 RW 0: Corresponding event counter does not count --
EVENT1 19:10
EVENT0 9:0 1: Corresponding event counter counts the event
+----------------+--------+-------+-----------------------------------------------+----------+

Among these, the event combination method for counters is:

  • EVENT0 and EVENT1 events are combined using the OP_TYPE0 operation to form RESULT0.
  • EVENT2 and EVENT3 events are combined using the OP_TYPE1 operation to form RESULT1.
  • The combined result of RESULT0 and RESULT1 uses the OP_TYPE2 operation to form RESULT2.
  • RESULT2 is accumulated in the corresponding event counter.

The reset value for the event index portion of the performance event selector is specified as 0.

The Kunming Lake architecture classifies the provided performance events into four categories based on their source: frontend, backend, memory access, and cache. It also divides the counters into four parts, which respectively record performance events from the four sources mentioned above:

  • Frontend: mhpmevent 3-10
  • Backend: mhpmevent11-18
  • Memory Access: mhpmevent19-26
  • Cache: mhpmevent27-31
Kunming Lake Frontend Performance Event Index Table
Index Event
0 noEvent
1 frontendFlush
2 ifu_req
3 ifu_miss
4 ifu_req_cacheline_0
5 ifu_req_cacheline_1
6 ifu_req_cacheline_0_hit
7 ifu_req_cacheline_1_hit
8 only_0_hit
9 only_0_miss
10 hit_0_hit_1
11 hit_0_miss_1
12 miss_0_hit_1
13 miss_0_miss_1
14 IBuffer_Flushed
15 IBuffer_hungry
16 IBuffer_1_4_valid
17 IBuffer_2_4_valid
18 IBuffer_3_4_valid
19 IBuffer_4_4_valid
20 IBuffer_full
21 Front_Bubble
22 Fetch_Latency_Bound
23 icache_miss_cnt
24 icache_miss_penalty
25 bpu_s2_redirect
26 bpu_s3_redirect
27 bpu_to_ftq_stall
28 mispredictRedirect
29 replayRedirect
30 predecodeRedirect
31 to_ifu_bubble
32 from_bpu_real_bubble
33 BpInstr
34 BpBInstr
35 BpRight
36 BpWrong
37 BpBRight
38 BpBWrong
39 BpJRight
40 BpJWrong
41 BpIRight
42 BpIWrong
43 BpCRight
44 BpCWrong
45 BpRRight
46 BpRWrong
47 ftb_false_hit
48 ftb_hit
49 fauftb_commit_hit
50 fauftb_commit_miss
51 tage_tht_hit
52 sc_update_on_mispred
53 sc_update_on_unconf
54 ftb_commit_hits
55 ftb_commit_misses
Kunming Lake Backend Performance Event Index Table
Index Event
0 noEvent
1 decoder_fused_instr
2 decoder_waitInstr
3 decoder_stall_cycle
4 decoder_utilization
5 INST_SPEC
6 RECOVERY_BUBBLE
7 rename_in
8 rename_waitinstr
9 rename_stall
10 rename_stall_cycle_walk
11 rename_stall_cycle_dispatch
12 rename_stall_cycle_int
13 rename_stall_cycle_fp
14 rename_stall_cycle_vec
15 rename_stall_cycle_v0
16 rename_stall_cycle_vl
17 me_freelist_1_4_valid
18 me_freelist_2_4_valid
19 me_freelist_3_4_valid
20 me_freelist_4_4_valid
21 std_freelist_1_4_valid
22 std_freelist_2_4_valid
23 std_freelist_3_4_valid
24 std_freelist_4_4_valid
25 std_freelist_1_4_valid
26 std_freelist_2_4_valid
27 std_freelist_3_4_valid
28 std_freelist_4_4_valid
29 std_freelist_1_4_valid
30 std_freelist_2_4_valid
31 std_freelist_3_4_valid
32 std_freelist_4_4_valid
33 std_freelist_1_4_valid
34 std_freelist_2_4_valid
35 std_freelist_3_4_valid
36 std_freelist_4_4_valid
37 dispatch_in
38 dispatch_empty
39 dispatch_utili
40 dispatch_waitinstr
41 dispatch_stall_cycle_lsq
42 dispatch_stall_cycle_rob
43 dispatch_stall_cycle_int_dq
44 dispatch_stall_cycle_fp_dq
45 dispatch_stall_cycle_ls_dq
46 rob_interrupt_num
47 rob_exception_num
48 rob_flush_pipe_num
49 rob_replay_inst_num
50 rob_commitUop
51 rob_commitInstr
52 rob_commitInstrFused
53 rob_commitInstrLoad
54 rob_commitInstrBranch
55 rob_commitInstrStore
56 rob_walkInstr
57 rob_walkCycle
58 rob_1_4_valid
59 rob_2_4_valid
60 rob_3_4_valid
61 rob_4_4_valid
62 BR_MIS_PRED
63 TOTAL_FLUSH
64 EXEC_STALL_CYCLE
65 MEMSTALL_STORE
66 MEMSTALL_L1MISS
67 MEMSTALL_L2MISS
68 MEMSTALL_L3MISS
69 issueQueue_enq_fire_cnt
70 IssueQueueAluMulBkuBrhJmp_full
71 IssueQueueAluMulBkuBrhJmp_full
72 IssueQueueAluBrhJmpI2fVsetriwiVsetriwvfI2v_full
73 IssueQueueAluCsrFenceDiv_full
74 issueQueue_enq_fire_cnt
75 IssueQueueFaluFcvtF2vFmacFdiv_full
76 IssueQueueFaluFmacFdiv_full
77 IssueQueueFaluFmac_full
78 issueQueue_enq_fire_cnt
79 IssueQueueVfmaVialuFixVimacVppuVfaluVfcvtVipuVsetrvfwvf_full
80 IssueQueueVfmaVialuFixVfalu_full
81 IssueQueueVfdivVidiv_full
82 issueQueue_enq_fire_cnt
83 IssueQueueStaMou_full
84 IssueQueueStaMou_full
85 IssueQueueLdu_full
86 IssueQueueLdu_full
87 IssueQueueLdu_full
88 IssueQueueVlduVstuVseglduVsegstu_full
89 IssueQueueVlduVstu_full
90 IssueQueueStdMoud_full
91 IssueQueueStdMoud_full
Kunming Lake Memory Access Performance Event Index Table
Index Event
0 noEvent
1 load_s0_in_fire
2 load_to_load_forward
3 stall_dcache
4 load_s1_in_fire
5 load_s1_tlb_miss
6 load_s2_in_fire
7 load_s2_dcache_miss
8 load_s0_in_fire
9 load_to_load_forward
10 stall_dcache
11 load_s1_in_fire
12 load_s1_tlb_miss
13 load_s2_in_fire
14 load_s2_dcache_miss
15 load_s0_in_fire
16 load_to_load_forward
17 stall_dcache
18 load_s1_in_fire
19 load_s1_tlb_miss
20 load_s2_in_fire
21 load_s2_dcache_miss
22 sbuffer_req_valid
23 sbuffer_req_fire
24 sbuffer_merge
25 sbuffer_newline
26 dcache_req_valid
27 dcache_req_fire
28 sbuffer_idle
29 sbuffer_flush
30 sbuffer_replace
31 mpipe_resp_valid
32 replay_resp_valid
33 coh_timeout
34 sbuffer_1_4_valid
35 sbuffer_2_4_valid
36 sbuffer_3_4_valid
37 sbuffer_full_valid
38 MEMSTALL_ANY_LOAD
39 enq
40 ld_ld_violation
41 enq
42 stld_rollback
43 enq
44 deq
45 deq_block
46 replay_full
47 replay_rar_nack
48 replay_raw_nack
49 replay_nuke
50 replay_mem_amb
51 replay_tlb_miss
52 replay_bank_conflict
53 replay_dcache_replay
54 replay_forward_fail
55 replay_dcache_miss
56 full_mask_000
57 full_mask_001
58 full_mask_010
59 full_mask_011
60 full_mask_100
61 full_mask_101
62 full_mask_110
63 full_mask_111
64 nuke_rollback
65 nack_rollback
66 mmioCycle
67 mmioCnt
68 mmio_wb_success
69 mmio_wb_blocked
70 stq_1_4_valid
71 stq_2_4_valid
72 stq_3_4_valid
73 stq_4_4_valid
74 dcache_wbq_req
75 dcache_wbq_1_4_valid
76 dcache_wbq_2_4_valid
77 dcache_wbq_3_4_valid
78 dcache_wbq_4_4_valid
79 dcache_mp_req
80 dcache_mp_total_penalty
81 dcache_missq_req
82 dcache_missq_1_4_valid
83 dcache_missq_2_4_valid
84 dcache_missq_3_4_valid
85 dcache_missq_4_4_valid
86 dcache_probq_req
87 dcache_probq_1_4_valid
88 dcache_probq_2_4_valid
89 dcache_probq_3_4_valid
90 dcache_probq_4_4_valid
91 load_req
92 load_replay
93 load_replay_for_data_nack
94 load_replay_for_no_mshr
95 load_replay_for_conflict
96 load_req
97 load_replay
98 load_replay_for_data_nack
99 load_replay_for_no_mshr
100 load_replay_for_conflict
101 load_req
102 load_replay
103 load_replay_for_data_nack
104 load_replay_for_no_mshr
105 load_replay_for_conflict
106 PTW_tlbllptw_incount
107 PTW_tlbllptw_inblock
108 PTW_tlbllptw_memcount
109 PTW_tlbllptw_memcycle
110 PTW_access
111 PTW_l2_hit
112 PTW_l1_hit
113 PTW_l0_hit
114 PTW_sp_hit
115 PTW_pte_hit
116 PTW_rwHarzad
117 PTW_out_blocked
118 PTW_fsm_count
119 PTW_fsm_busy
120 PTW_fsm_idle
121 PTW_resp_blocked
122 PTW_mem_count
123 PTW_mem_cycle
124 PTW_mem_blocked
125 ldDeqCount
126 stDeqCount
Kunming Lake Cache Performance Event Index Table
Index Event
0 noEvent
1 Slice0_l2_cache_refill
2 Slice0_l2_cache_rd_refill
3 Slice0_l2_cache_wr_refill
4 Slice0_l2_cache_long_miss
5 Slice0_l2_cache_access
6 Slice0_l2_cache_l2wb
7 Slice0_l2_cache_l1wb
8 Slice0_l2_cache_wb_victim
9 Slice0_l2_cache_wb_cleaning_coh
10 Slice0_l2_cache_access_rd
11 Slice0_l2_cache_access_wr
12 Slice0_l2_cache_inv
13 Slice1_l2_cache_refill
14 Slice1_l2_cache_rd_refill
15 Slice1_l2_cache_wr_refill
16 Slice1_l2_cache_long_miss
17 Slice1_l2_cache_access
18 Slice1_l2_cache_l2wb
19 Slice1_l2_cache_l1wb
20 Slice1_l2_cache_wb_victim
21 Slice1_l2_cache_wb_cleaning_coh
22 Slice1_l2_cache_access_rd
23 Slice1_l2_cache_access_wr
24 Slice1_l2_cache_inv
25 Slice2_l2_cache_refill
26 Slice2_l2_cache_rd_refill
27 Slice2_l2_cache_wr_refill
28 Slice2_l2_cache_long_miss
29 Slice2_l2_cache_access
30 Slice2_l2_cache_l2wb
31 Slice2_l2_cache_l1wb
32 Slice2_l2_cache_wb_victim
33 Slice2_l2_cache_wb_cleaning_coh
34 Slice2_l2_cache_access_rd
35 Slice2_l2_cache_access_wr
36 Slice2_l2_cache_inv
37 Slice3_l2_cache_refill
38 Slice3_l2_cache_rd_refill
39 Slice3_l2_cache_wr_refill
40 Slice3_l2_cache_long_miss
41 Slice3_l2_cache_access
42 Slice3_l2_cache_l2wb
43 Slice3_l2_cache_l1wb
44 Slice3_l2_cache_wb_victim
45 Slice3_l2_cache_wb_cleaning_coh
46 Slice3_l2_cache_access_rd
47 Slice3_l2_cache_access_wr
48 Slice3_l2_cache_inv

Topdown PMU

Topdown performance analysis is a top-down analysis method used to quickly analyze CPU performance bottlenecks. Its core idea is to progressively decompose performance issues from high-level categories downward, refining the problem layer by layer, ultimately pinpointing the root cause. We have implemented a three-layer Topdown performance event structure, as shown below:

Three-layer Topdown Performance Events
Level 1 Level 2 Level 3 Description Formula
Retiring - - Instruction commit impact INST_RETIRED /
(IssueBW * CPU_CYCLES)
+-------------+-------------+-------------+--------------+---------------------------------------+
FrontEnd - - Frontend impact IF_FETCH_BUBBLE /
Bound (IssueBW * CPU_CYCLES)
+-------------+-------------+-------------+--------------+---------------------------------------+
- Fetch -
Latency Fetch latency impact IF_FETCH_BUBBLE_EQ_MAX /
Bound CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+
Fetch FrontEnd Bound -
- Bandwidth - Fetch bandwidth impact Fetch Latency Bound
Bound
+-------------+-------------+-------------+--------------+---------------------------------------+
Bad (INST_SPEC - INST_RETIRED+
Speculation - - Misprediction impact RECOVERY_BUBBLE) /
(IssueBW * CPU_CYCLES)
+-------------+-------------+-------------+--------------+---------------------------------------+
- Branch - Mispredicted branch Bad Speculation *
Misspredict instruction impact BR_MIS_PRED / TOTAL_FLUSH
+-------------+-------------+-------------+--------------+---------------------------------------+
- Machine - Machine clears Bad Speculation - Branch Misspredict
Clears event impact
+-------------+-------------+-------------+--------------+---------------------------------------+
BackEnd - - Backend impact 1 - (FrontEnd Bound +
Bound Bad Speculation + Retiring)
+-------------+-------------+-------------+--------------+---------------------------------------+
- Core - Core impact (EXEC_STALL_CYCLE - MEMSTALL_ANYLOAD -
Bound MEMSTALL_STORE) / CPU_CYCLE
+-------------+-------------+-------------+--------------+---------------------------------------+
- Memory - Memory access impact (MEMSTALL_ANYLOAD + MEMSTALL_STORE) /
Bound CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+
- - L1 Bound L1 impact (MEMSTALL_ANYLOAD - MEMSTALL_L1MISS) /
CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+
- - L2 Bound L2 impact (MEMSTALL_L1MISS - MEMSTALL_L2MISS) /
CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+
- - L3 Bound L3 impact (MEMSTALL_L2MISS - MEMSTALL_L3MISS) /
CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+
- - Mem Bound External memory impact MEMSTALL_L3MISS / CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+
- - Store Bound Store instruction impact MEMSTALL_STORE / CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+

Where IssueBW is the issue width. The XiangShan Kunming Lake architecture currently supports 6 issues.

Topdown Performance Events
Name Corresponding Performance Event Description
CPU_CYCLES - Total clock cycles after all instructions are committed
+----------------------------+-------------------------------+--------------------------------------------------+
INST_RETIRED rob_commitInstr Number of successfully committed instructions
+----------------------------+-------------------------------+--------------------------------------------------+
INST_SPEC - Number of speculatively executed instructions
+----------------------------+-------------------------------+--------------------------------------------------+
IF_FETCH_BUBBLE Front_Bubble Number of bubbles fetched from the fetch buffer,
and no backend stall exists
+----------------------------+-------------------------------+--------------------------------------------------+
IF_FETCH_BUBBLE_EQ_MAX Fetch_Latency_Bound Cycles where 0 instructions are fetched from the fetch buffer,
and no backend stall exists
+----------------------------+-------------------------------+--------------------------------------------------+
BR_MIS_PRED - Number of mispredicted branch instructions
+----------------------------+-------------------------------+--------------------------------------------------+
TOTAL_FLUSH - Number of pipeline flush events
+----------------------------+-------------------------------+--------------------------------------------------+
RECOVERY_BUBBLE - Number of cycles recovered from early mispredictions
+----------------------------+-------------------------------+--------------------------------------------------+
EXEC_STALL_CYCLE - Number of cycles where Few uops are issued
+----------------------------+-------------------------------+--------------------------------------------------+
MEMSTALL_ANY_LOAD - No uops are issued, and at least one Load instruction is not completed
+----------------------------+-------------------------------+--------------------------------------------------+
MEMSTALL_STORE - Non-Store instruction uops are issued,
and there is a Store instruction not completed
+----------------------------+-------------------------------+--------------------------------------------------+
MEMSTALL_L1MISS - No uops are issued, at least one Load instruction is not completed,
and an L1-cache Miss occurred
+----------------------------+-------------------------------+--------------------------------------------------+
MEMSTALL_L2MISS - No uops are issued, at least one Load instruction is not completed,
and an L2-cache Miss occurred
+----------------------------+-------------------------------+--------------------------------------------------+
MEMSTALL_L3MISS - No uops are issued, at least one Load instruction is not completed,
and an L3-cache Miss occurred
+----------------------------+-------------------------------+--------------------------------------------------+

To count the impact of frontend fetch latency over a period, we can set the EVENT0 field of mhpmevent3 to 22, leaving the other bits at their default values. Then, run the test. After the test is completed, the mhpmcounter3 register can be read via a CSR read instruction to obtain the number of cycles of frontend fetch latency during this period. The impact caused by frontend fetch latency can then be calculated.

The performance event counters in the XiangShan Kunming Lake architecture are divided into two groups: Machine-mode event counters, Supervisor-mode event counters, and User-mode event counters.

Machine-mode Event Counter List
Name Index R/W Description Reset Value
MCYCLE 0xB00 RW Machine-mode clock cycle counter -
MINSTRET 0xB02 RW Machine-mode retired instruction counter -
MHPMCOUNTER3-31 0XB03-0XB1F RW Machine-mode performance event counter 0

The MHPMCOUNTERx counters are controlled by the corresponding MHPMEVENTx, which specifies the corresponding performance events to count.

Supervisor-mode event counters include the Supervisor-mode Counter Overflow Interrupt Flag Register (SCOUNTOVF).

Supervisor-mode Counter Overflow Interrupt Flag Register (SCOUNTOVF) Description
Name Bit Field R/W Behavior Reset Value
OFVEC 31:3 RO Overflow flag bits for mhpmcounterx registers: 0
1: Overflow occurred
0: No overflow occurred
+------------+--------+-------+--------------------------------------+-------------+
-- 2:0 RO 0 -- 0
+------------+--------+-------+--------------------------------------+-------------+

scountovf serves as a read-only mapping of the OF bit in mhpmcounter registers and is controlled by xcounteren:

  • M-mode accessing scountovf can read the correct value.
  • HS-mode accessing scountovf: If mcounteren.HPMx is 1, the corresponding OFVECx can be read correctly; otherwise, it reads 0.
  • VS-mode accessing scountovf: If both mcounteren.HPMx and hcounteren.HPMx are 1, the corresponding OFVECx can be read correctly; otherwise, it reads 0.
User-mode Event Counter List
Name Index R/W Description Reset Value
CYCLE 0xC00 RO User-mode read-only copy of mcycle register -
TIME 0xC01 RO User-mode read-only copy of memory-mapped register mtime -
INSTRET 0xC02 RO User-mode read-only copy of minstret register -
HPMCOUNTER3-31 0XC03-0XC1F RO User-mode read-only copy of mhpmcounter3-31 registers 0