HPM

Version: V2R2
Status: OK
Date: 2025/02/27
commit: xxx

Basic Information

Terminology Description

Terminology Description
Abbreviation	Full Name	Description
HPM	Hardware performance monitor	Hardware performance counter unit

Submodule List

Submodule List
Submodule	Description
HPerfCounter	Individual counter module
HPerfMonitor	Counter organization module
PFEvent	Copy of Hpmevent register

Design Specifications

Implemented basic hardware performance monitoring functions based on the RISC-V privileged specification, with additional support for sstc and sscofpmf extensions.
Clock cycles executed by a hardware thread (cycle)
Number of instructions committed by a hardware thread (minstret)
Hardware timer (time)
Counter overflow flag (time)
29 hardware performance counters (hpmcounter3 - hpmcounter31)
29 hardware performance event selectors (mhpmcounter3 - mhpmcounter31)
Supports definition of up to 2^10 types of performance events

Features

The basic functions of HPM are as follows:

Disable all performance event monitoring via the mcountinhibit register.
Initialize performance event counters for each monitoring unit, including: mcycle, minstret, mhpmcounter3 - mhpmcounter31.
Configure performance event selectors for each monitoring unit, including: mhpmcounter3 - mhpmcounter31. In the XiangShan Kunming Lake architecture, each event selector can configure up to four event combinations. By writing the event index value, event combination method, and sampling privilege level to the event selector, configured events can be counted normally at the specified sampling privilege level, and the results are accumulated in the event counter based on the combination result.
Configure xcounteren for access permission authorization.
Enable all performance event monitoring via the mcountinhibit register to start counting.

HPM Event Overflow Interrupt

The overflow interrupt LCOFIP initiated by the Kunming Lake performance monitoring unit has a unified interrupt vector number of 12. The enabling and processing of this interrupt are consistent with standard private interrupts.

Overall Design

Performance events are defined in each submodule. Submodules assemble performance events into io_perf output by calling generatePerfEvent, sending them to four main modules: Frontend, Backend, MemBlock, and CoupledL2.

The aforementioned four modules obtain the performance event output from submodules by calling the get_perf method. Simultaneously, the PFEvent module is instantiated within each main module as a copy of the CSR mhpmevent register. It aggregates the required performance event selector data and the performance event output from submodules and connects them to the HPerfMonitor module to calculate the increment results applied to the performance event counters.

Finally, the CSR module collects the increment results of the performance event counters from the four top-level modules and inputs them into the CSR registers mhpmcounter3-31 for cumulative counting.

Specifically, performance events from CoupledL2 are directly input into the CSR module. Based on the event selection information read from the mhpmevent register, they are processed by the HPerfMonitor module instantiated within the CSR and input into the CSR registers mhpmcounter26-31 for cumulative counting.

See 此图 for the specific overall design block diagram of HPM:

HPerfMonitor Counter Organization Module

Inputs the event selection information (events) to the corresponding HPerfCounter module and copies all performance event counting information to every HperfCounter module.

Collects the output from all HperfCounter modules.

HperfCounter Individual Counter Module

Based on the input event selection information, selects the required performance event counting information and combines the input performance events according to the counting mode specified in the event selection information for output.

PFEvent Copy of Hpmevent Register

A copy of the CSR register mhpmevent: Collects CSR write information and synchronizes changes to mhpmevent.

Machine-mode Performance Event Counter Inhibit Register (MCOUNTINHIBIT)

The Machine-mode Performance Event Counter Inhibit Register (mcountinhibit) is a 32-bit WARL register, primarily used to control whether hardware performance monitoring counters count. In scenarios where performance analysis is not needed, counters can be disabled to reduce processor power consumption.

Machine-mode Performance Event Counter Inhibit Register Description
Name	Bit Field	R/W	Behavior	Reset Value
HPMx	31:4	RW	Inhibit bit for mhpmcounterx register:	0

			0: Count normally

			1: Inhibit counting
+--------+--------+-------+--------------------------------------------+----------+
IR	3	RW	Inhibit bit for minstret register:	0

			0: Count normally

			1: Inhibit counting
+--------+--------+-------+--------------------------------------------+----------+
--	2	RO 0	Reserved	0
+--------+--------+-------+--------------------------------------------+----------+
CY	1	RW	Inhibit bit for mcycle register:	0

			0: Count normally

			1: Inhibit counting
+--------+--------+-------+--------------------------------------------+----------+

Machine-mode Performance Event Counter Access Enable Register (MCOUNTEREN)

The Machine-mode Performance Event Counter Access Enable Register (mcounteren) is a 32-bit WARL register, primarily used to control access permissions for user-mode performance monitoring counters in privilege modes lower than Machine mode (HS-mode/VS-mode/HU-mode/VU-mode).

Machine-mode Performance Event Counter Access Enable Register Description
Name	Bit Field	R/W	Behavior	Reset Value
HPMx	31:4	RW	Access permission bit for hpmcounterenx register below M-mode:	0

			0: Accessing hpmcounterx causes an illegal instruction exception

			1: Allows normal access to hpmcounterx
+--------+--------+-------+------------------------------------------------------------+----------+
IR	3	RW	Access permission bit for instret register below M-mode:	0

			0: Accessing instret causes an illegal instruction exception

			1: Allows normal access
+--------+--------+-------+------------------------------------------------------------+----------+
TM	2	RW	Access permission bit for time/stimecmp register below M-mode:	0

			0: Accessing time causes an illegal instruction exception

			1: Allows normal access
+--------+--------+-------+------------------------------------------------------------+----------+
CY	1	RW	Access permission bit for cycle register below M-mode:	0

			0: Accessing cycle causes an illegal instruction exception

			1: Allows normal access
+--------+--------+-------+------------------------------------------------------------+----------+

Supervisor-mode Performance Event Counter Access Enable Register (SCOUNTEREN)

The Supervisor-mode Performance Event Counter Access Enable Register (scounteren) is a 32-bit WARL register, primarily used to control access permissions for user-mode performance monitoring counters in User mode (HU-mode/VU-mode).

Supervisor-mode Performance Event Counter Access Enable Register Description
Name	Bit Field	R/W	Behavior	Reset Value
HPMx	31:4	RW	User-mode access permission bit for hpmcounterenx register:	0

			0: Accessing hpmcounterx causes an illegal instruction exception

			1: Allows normal access to hpmcounterx
+--------+--------+-------+---------------------------------------------------------+----------+
IR	3	RW	User-mode access permission bit for instret register:	0

			0: Accessing instret causes an illegal instruction exception

			1: Allows normal access
+--------+--------+-------+---------------------------------------------------------+----------+
TM	2	RW	User-mode access permission bit for time register:	0

			0: Accessing time causes an illegal instruction exception

			1: Allows normal access
+--------+--------+-------+---------------------------------------------------------+----------+
CY	1	RW	User-mode access permission bit for cycle register:	0

			0: Accessing cycle causes an illegal instruction exception

			1: Allows normal access
+--------+--------+-------+---------------------------------------------------------+----------+

Virtualization-mode Performance Event Counter Access Enable Register (HCOUNTEREN)

The Virtualization-mode Performance Event Counter Access Enable Register (hcounteren) is a 32-bit WARL register, primarily used to control access permissions for user-mode performance monitoring counters in guest virtual machines (VS-mode/VU-mode).

Supervisor-mode Performance Event Counter Access Enable Register Description
Name	Bit Field	R/W	Behavior	Reset Value
HPMx	31:4	RW	Guest virtual machine access permission bit for hpmcounterenx register:	0

			0: Accessing hpmcounterx causes an illegal instruction exception

			1: Allows normal access to hpmcounterx
+--------+--------+-------+-----------------------------------------------------------------+----------+
IR	3	RW	Guest virtual machine access permission bit for instret register:	0

			0: Accessing instret causes an illegal instruction exception

			1: Allows normal access
+--------+--------+-------+-----------------------------------------------------------------+----------+
TM	2	RW	Guest virtual machine access permission bit for time/vstimecmp(via stimecmp) register:	0

			0: Accessing time causes an illegal instruction exception

			1: Allows normal access
+--------+--------+-------+-----------------------------------------------------------------+----------+
CY	1	RW	Guest virtual machine access permission bit for cycle register:	0

			0: Accessing cycle causes an illegal instruction exception

			1: Allows normal access
+--------+--------+-------+-----------------------------------------------------------------+----------+

Supervisor-mode Timer Compare Register (STIMECMP)

The Supervisor-mode Timer Compare Register (stimecmp) is a 64-bit WARL register, primarily used to manage timer interrupts (STIP) in Supervisor mode.

STIMECMP Register Behavior Description:

Reset value is a 64-bit unsigned number 64'hffff_ffff_ffff_ffff.
If menvcfg.STCE is 0 and the current privilege level is lower than M-mode (HS-mode/VS-mode/HU-mode/VU-mode), accessing the stimecmp register causes an illegal instruction exception and does not generate an STIP interrupt.
The stimecmp register is the source of STIP interrupts: When the unsigned integer comparison time ≥ stimecmp is true, the STIP interrupt pending signal is asserted.
Supervisor-mode software can control the generation of timer interrupts by writing to stimecmp.

Guest Virtual Machine Supervisor-mode Timer Compare Register (VSTIMECMP)

The Guest Virtual Machine Supervisor-mode Timer Compare Register (vstimecmp) is a 64-bit WARL register, primarily used to manage timer interrupts (STIP) in guest virtual machine Supervisor mode.

VSTIMECMP Register Behavior Description:

Reset value is a 64-bit unsigned number 64'hffff_ffff_ffff_ffff.
If henvcfg.STCE is 0 or hcounteren.TM is 1, accessing the vstimecmp register via the stimecmp register causes a virtual illegal instruction exception and does not generate a VSTIP interrupt.
The vstimecmp register is the source of VSTIP interrupts: When the unsigned integer comparison time + htimedelta ≥ vstimecmp is true, the VSTIP interrupt pending signal is asserted.
Guest virtual machine Supervisor-mode software can control the generation of timer interrupts in VS-mode by writing to vstimecmp.

Machine-mode performance event selectors (mhpmevent3 - 31) are 64-bit WARL registers used to select the performance event corresponding to each performance event counter. In the XiangShan Kunming Lake architecture, each counter can be configured with up to four performance events for combined counting. After the user writes the event index value, event combination method, and sampling privilege level to the specified event selector, the event counter matched by that event selector begins counting normally.

Machine-mode Performance Event Selector Description
Name	Bit Field	R/W	Behavior	Reset Value
OF	63	RW	Performance counter overflow flag bit:	0

			0: Set to 1 when the corresponding performance counter overflows, generating an overflow interrupt

			1: Value remains unchanged when the corresponding performance counter overflows, no overflow interrupt is generated
+----------------+--------+-------+-----------------------------------------------+----------+
MINH	62	RW	When set to 1, inhibits M mode sampling count	0
+----------------+--------+-------+-----------------------------------------------+----------+
SINH	61	RW	When set to 1, inhibits S mode sampling count	0
+----------------+--------+-------+-----------------------------------------------+----------+
UINH	60	RW	When set to 1, inhibits U mode sampling count	0
+----------------+--------+-------+-----------------------------------------------+----------+
VSINH	59	RW	When set to 1, inhibits VS mode sampling count	0
+----------------+--------+-------+-----------------------------------------------+----------+
VUINH	58	RW	When set to 1, inhibits VU mode sampling count	0
+----------------+--------+-------+-----------------------------------------------+----------+
--	57:55	RW	--	0
+----------------+--------+-------+-----------------------------------------------+----------+
			Counter event combination method control bits:

			5'b00000: Use OR operation for combination
OP_TYPE2	54:50
OP_TYPE1	49:45	RW	5'b00001: Use AND operation for combination	0
OP_TYPE0	44:40
			5'b00010: Use XOR operation for combination

			5'b00100: Use ADD operation for combination
+----------------+--------+-------+-----------------------------------------------+----------+
			Counter performance event index value:
EVENT3	39:30
EVENT2	29:20	RW	0: Corresponding event counter does not count	--
EVENT1	19:10
EVENT0	9:0		1: Corresponding event counter counts the event

+----------------+--------+-------+-----------------------------------------------+----------+

Among these, the event combination method for counters is:

EVENT0 and EVENT1 events are combined using the OP_TYPE0 operation to form RESULT0.
EVENT2 and EVENT3 events are combined using the OP_TYPE1 operation to form RESULT1.
The combined result of RESULT0 and RESULT1 uses the OP_TYPE2 operation to form RESULT2.
RESULT2 is accumulated in the corresponding event counter.

The reset value for the event index portion of the performance event selector is specified as 0.

The Kunming Lake architecture classifies the provided performance events into four categories based on their source: frontend, backend, memory access, and cache. It also divides the counters into four parts, which respectively record performance events from the four sources mentioned above:

Frontend: mhpmevent 3-10
Backend: mhpmevent11-18
Memory Access: mhpmevent19-26
Cache: mhpmevent27-31

Kunming Lake Frontend Performance Event Index Table
Index	Event
0	noEvent
1	frontendFlush
2	ifu_req
3	ifu_miss
4	ifu_req_cacheline_0
5	ifu_req_cacheline_1
6	ifu_req_cacheline_0_hit
7	ifu_req_cacheline_1_hit
8	only_0_hit
9	only_0_miss
10	hit_0_hit_1
11	hit_0_miss_1
12	miss_0_hit_1
13	miss_0_miss_1
14	IBuffer_Flushed
15	IBuffer_hungry
16	IBuffer_1_4_valid
17	IBuffer_2_4_valid
18	IBuffer_3_4_valid
19	IBuffer_4_4_valid
20	IBuffer_full
21	Front_Bubble
22	Fetch_Latency_Bound
23	icache_miss_cnt
24	icache_miss_penalty
25	bpu_s2_redirect
26	bpu_s3_redirect
27	bpu_to_ftq_stall
28	mispredictRedirect
29	replayRedirect
30	predecodeRedirect
31	to_ifu_bubble
32	from_bpu_real_bubble
33	BpInstr
34	BpBInstr
35	BpRight
36	BpWrong
37	BpBRight
38	BpBWrong
39	BpJRight
40	BpJWrong
41	BpIRight
42	BpIWrong
43	BpCRight
44	BpCWrong
45	BpRRight
46	BpRWrong
47	ftb_false_hit
48	ftb_hit
49	fauftb_commit_hit
50	fauftb_commit_miss
51	tage_tht_hit
52	sc_update_on_mispred
53	sc_update_on_unconf
54	ftb_commit_hits
55	ftb_commit_misses

Kunming Lake Backend Performance Event Index Table
Index	Event
0	noEvent
1	decoder_fused_instr
2	decoder_waitInstr
3	decoder_stall_cycle
4	decoder_utilization
5	INST_SPEC
6	RECOVERY_BUBBLE
7	rename_in
8	rename_waitinstr
9	rename_stall
10	rename_stall_cycle_walk
11	rename_stall_cycle_dispatch
12	rename_stall_cycle_int
13	rename_stall_cycle_fp
14	rename_stall_cycle_vec
15	rename_stall_cycle_v0
16	rename_stall_cycle_vl
17	me_freelist_1_4_valid
18	me_freelist_2_4_valid
19	me_freelist_3_4_valid
20	me_freelist_4_4_valid
21	std_freelist_1_4_valid
22	std_freelist_2_4_valid
23	std_freelist_3_4_valid
24	std_freelist_4_4_valid
25	std_freelist_1_4_valid
26	std_freelist_2_4_valid
27	std_freelist_3_4_valid
28	std_freelist_4_4_valid
29	std_freelist_1_4_valid
30	std_freelist_2_4_valid
31	std_freelist_3_4_valid
32	std_freelist_4_4_valid
33	std_freelist_1_4_valid
34	std_freelist_2_4_valid
35	std_freelist_3_4_valid
36	std_freelist_4_4_valid
37	dispatch_in
38	dispatch_empty
39	dispatch_utili
40	dispatch_waitinstr
41	dispatch_stall_cycle_lsq
42	dispatch_stall_cycle_rob
43	dispatch_stall_cycle_int_dq
44	dispatch_stall_cycle_fp_dq
45	dispatch_stall_cycle_ls_dq
46	rob_interrupt_num
47	rob_exception_num
48	rob_flush_pipe_num
49	rob_replay_inst_num
50	rob_commitUop
51	rob_commitInstr
52	rob_commitInstrFused
53	rob_commitInstrLoad
54	rob_commitInstrBranch
55	rob_commitInstrStore
56	rob_walkInstr
57	rob_walkCycle
58	rob_1_4_valid
59	rob_2_4_valid
60	rob_3_4_valid
61	rob_4_4_valid
62	BR_MIS_PRED
63	TOTAL_FLUSH
64	EXEC_STALL_CYCLE
65	MEMSTALL_STORE
66	MEMSTALL_L1MISS
67	MEMSTALL_L2MISS
68	MEMSTALL_L3MISS
69	issueQueue_enq_fire_cnt
70	IssueQueueAluMulBkuBrhJmp_full
71	IssueQueueAluMulBkuBrhJmp_full
72	IssueQueueAluBrhJmpI2fVsetriwiVsetriwvfI2v_full
73	IssueQueueAluCsrFenceDiv_full
74	issueQueue_enq_fire_cnt
75	IssueQueueFaluFcvtF2vFmacFdiv_full
76	IssueQueueFaluFmacFdiv_full
77	IssueQueueFaluFmac_full
78	issueQueue_enq_fire_cnt
79	IssueQueueVfmaVialuFixVimacVppuVfaluVfcvtVipuVsetrvfwvf_full
80	IssueQueueVfmaVialuFixVfalu_full
81	IssueQueueVfdivVidiv_full
82	issueQueue_enq_fire_cnt
83	IssueQueueStaMou_full
84	IssueQueueStaMou_full
85	IssueQueueLdu_full
86	IssueQueueLdu_full
87	IssueQueueLdu_full
88	IssueQueueVlduVstuVseglduVsegstu_full
89	IssueQueueVlduVstu_full
90	IssueQueueStdMoud_full
91	IssueQueueStdMoud_full

Kunming Lake Memory Access Performance Event Index Table
Index	Event
0	noEvent
1	load_s0_in_fire
2	load_to_load_forward
3	stall_dcache
4	load_s1_in_fire
5	load_s1_tlb_miss
6	load_s2_in_fire
7	load_s2_dcache_miss
8	load_s0_in_fire
9	load_to_load_forward
10	stall_dcache
11	load_s1_in_fire
12	load_s1_tlb_miss
13	load_s2_in_fire
14	load_s2_dcache_miss
15	load_s0_in_fire
16	load_to_load_forward
17	stall_dcache
18	load_s1_in_fire
19	load_s1_tlb_miss
20	load_s2_in_fire
21	load_s2_dcache_miss
22	sbuffer_req_valid
23	sbuffer_req_fire
24	sbuffer_merge
25	sbuffer_newline
26	dcache_req_valid
27	dcache_req_fire
28	sbuffer_idle
29	sbuffer_flush
30	sbuffer_replace
31	mpipe_resp_valid
32	replay_resp_valid
33	coh_timeout
34	sbuffer_1_4_valid
35	sbuffer_2_4_valid
36	sbuffer_3_4_valid
37	sbuffer_full_valid
38	MEMSTALL_ANY_LOAD
39	enq
40	ld_ld_violation
41	enq
42	stld_rollback
43	enq
44	deq
45	deq_block
46	replay_full
47	replay_rar_nack
48	replay_raw_nack
49	replay_nuke
50	replay_mem_amb
51	replay_tlb_miss
52	replay_bank_conflict
53	replay_dcache_replay
54	replay_forward_fail
55	replay_dcache_miss
56	full_mask_000
57	full_mask_001
58	full_mask_010
59	full_mask_011
60	full_mask_100
61	full_mask_101
62	full_mask_110
63	full_mask_111
64	nuke_rollback
65	nack_rollback
66	mmioCycle
67	mmioCnt
68	mmio_wb_success
69	mmio_wb_blocked
70	stq_1_4_valid
71	stq_2_4_valid
72	stq_3_4_valid
73	stq_4_4_valid
74	dcache_wbq_req
75	dcache_wbq_1_4_valid
76	dcache_wbq_2_4_valid
77	dcache_wbq_3_4_valid
78	dcache_wbq_4_4_valid
79	dcache_mp_req
80	dcache_mp_total_penalty
81	dcache_missq_req
82	dcache_missq_1_4_valid
83	dcache_missq_2_4_valid
84	dcache_missq_3_4_valid
85	dcache_missq_4_4_valid
86	dcache_probq_req
87	dcache_probq_1_4_valid
88	dcache_probq_2_4_valid
89	dcache_probq_3_4_valid
90	dcache_probq_4_4_valid
91	load_req
92	load_replay
93	load_replay_for_data_nack
94	load_replay_for_no_mshr
95	load_replay_for_conflict
96	load_req
97	load_replay
98	load_replay_for_data_nack
99	load_replay_for_no_mshr
100	load_replay_for_conflict
101	load_req
102	load_replay
103	load_replay_for_data_nack
104	load_replay_for_no_mshr
105	load_replay_for_conflict
106	PTW_tlbllptw_incount
107	PTW_tlbllptw_inblock
108	PTW_tlbllptw_memcount
109	PTW_tlbllptw_memcycle
110	PTW_access
111	PTW_l2_hit
112	PTW_l1_hit
113	PTW_l0_hit
114	PTW_sp_hit
115	PTW_pte_hit
116	PTW_rwHarzad
117	PTW_out_blocked
118	PTW_fsm_count
119	PTW_fsm_busy
120	PTW_fsm_idle
121	PTW_resp_blocked
122	PTW_mem_count
123	PTW_mem_cycle
124	PTW_mem_blocked
125	ldDeqCount
126	stDeqCount

Kunming Lake Cache Performance Event Index Table
Index	Event
0	noEvent
1	Slice0_l2_cache_refill
2	Slice0_l2_cache_rd_refill
3	Slice0_l2_cache_wr_refill
4	Slice0_l2_cache_long_miss
5	Slice0_l2_cache_access
6	Slice0_l2_cache_l2wb
7	Slice0_l2_cache_l1wb
8	Slice0_l2_cache_wb_victim
9	Slice0_l2_cache_wb_cleaning_coh
10	Slice0_l2_cache_access_rd
11	Slice0_l2_cache_access_wr
12	Slice0_l2_cache_inv
13	Slice1_l2_cache_refill
14	Slice1_l2_cache_rd_refill
15	Slice1_l2_cache_wr_refill
16	Slice1_l2_cache_long_miss
17	Slice1_l2_cache_access
18	Slice1_l2_cache_l2wb
19	Slice1_l2_cache_l1wb
20	Slice1_l2_cache_wb_victim
21	Slice1_l2_cache_wb_cleaning_coh
22	Slice1_l2_cache_access_rd
23	Slice1_l2_cache_access_wr
24	Slice1_l2_cache_inv
25	Slice2_l2_cache_refill
26	Slice2_l2_cache_rd_refill
27	Slice2_l2_cache_wr_refill
28	Slice2_l2_cache_long_miss
29	Slice2_l2_cache_access
30	Slice2_l2_cache_l2wb
31	Slice2_l2_cache_l1wb
32	Slice2_l2_cache_wb_victim
33	Slice2_l2_cache_wb_cleaning_coh
34	Slice2_l2_cache_access_rd
35	Slice2_l2_cache_access_wr
36	Slice2_l2_cache_inv
37	Slice3_l2_cache_refill
38	Slice3_l2_cache_rd_refill
39	Slice3_l2_cache_wr_refill
40	Slice3_l2_cache_long_miss
41	Slice3_l2_cache_access
42	Slice3_l2_cache_l2wb
43	Slice3_l2_cache_l1wb
44	Slice3_l2_cache_wb_victim
45	Slice3_l2_cache_wb_cleaning_coh
46	Slice3_l2_cache_access_rd
47	Slice3_l2_cache_access_wr
48	Slice3_l2_cache_inv

Topdown PMU

Topdown performance analysis is a top-down analysis method used to quickly analyze CPU performance bottlenecks. Its core idea is to progressively decompose performance issues from high-level categories downward, refining the problem layer by layer, ultimately pinpointing the root cause. We have implemented a three-layer Topdown performance event structure, as shown below:

Three-layer Topdown Performance Events
Level 1	Level 2	Level 3	Description	Formula
Retiring	-	-	Instruction commit impact	INST_RETIRED /
				(IssueBW * CPU_CYCLES)
+-------------+-------------+-------------+--------------+---------------------------------------+
FrontEnd	-	-	Frontend impact	IF_FETCH_BUBBLE /
Bound				(IssueBW * CPU_CYCLES)
+-------------+-------------+-------------+--------------+---------------------------------------+
-	Fetch	-
	Latency		Fetch latency impact	IF_FETCH_BUBBLE_EQ_MAX /
	Bound			CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+
	Fetch			FrontEnd Bound -
-	Bandwidth	-	Fetch bandwidth impact	Fetch Latency Bound
	Bound
+-------------+-------------+-------------+--------------+---------------------------------------+
Bad				(INST_SPEC - INST_RETIRED+
Speculation	-	-	Misprediction impact	RECOVERY_BUBBLE) /
				(IssueBW * CPU_CYCLES)
+-------------+-------------+-------------+--------------+---------------------------------------+
-	Branch	-	Mispredicted branch	Bad Speculation *
	Misspredict		instruction impact	BR_MIS_PRED / TOTAL_FLUSH
+-------------+-------------+-------------+--------------+---------------------------------------+
-	Machine	-	Machine clears	Bad Speculation - Branch Misspredict
	Clears		event impact
+-------------+-------------+-------------+--------------+---------------------------------------+
BackEnd	-	-	Backend impact	1 - (FrontEnd Bound +
Bound				Bad Speculation + Retiring)
+-------------+-------------+-------------+--------------+---------------------------------------+
-	Core	-	Core impact	(EXEC_STALL_CYCLE - MEMSTALL_ANYLOAD -
	Bound			MEMSTALL_STORE) / CPU_CYCLE
+-------------+-------------+-------------+--------------+---------------------------------------+
-	Memory	-	Memory access impact	(MEMSTALL_ANYLOAD + MEMSTALL_STORE) /
	Bound			CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+
-	-	L1 Bound	L1 impact	(MEMSTALL_ANYLOAD - MEMSTALL_L1MISS) /
				CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+
-	-	L2 Bound	L2 impact	(MEMSTALL_L1MISS - MEMSTALL_L2MISS) /
				CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+
-	-	L3 Bound	L3 impact	(MEMSTALL_L2MISS - MEMSTALL_L3MISS) /
				CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+
-	-	Mem Bound	External memory impact	MEMSTALL_L3MISS / CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+
-	-	Store Bound	Store instruction impact	MEMSTALL_STORE / CPU_CYCLES
+-------------+-------------+-------------+--------------+---------------------------------------+

Where IssueBW is the issue width. The XiangShan Kunming Lake architecture currently supports 6 issues.

Topdown Performance Events
Name	Corresponding Performance Event	Description
CPU_CYCLES	-	Total clock cycles after all instructions are committed
+----------------------------+-------------------------------+--------------------------------------------------+
INST_RETIRED	rob_commitInstr	Number of successfully committed instructions
+----------------------------+-------------------------------+--------------------------------------------------+
INST_SPEC	-	Number of speculatively executed instructions
+----------------------------+-------------------------------+--------------------------------------------------+
IF_FETCH_BUBBLE	Front_Bubble	Number of bubbles fetched from the fetch buffer,
		and no backend stall exists
+----------------------------+-------------------------------+--------------------------------------------------+
IF_FETCH_BUBBLE_EQ_MAX	Fetch_Latency_Bound	Cycles where 0 instructions are fetched from the fetch buffer,
		and no backend stall exists
+----------------------------+-------------------------------+--------------------------------------------------+
BR_MIS_PRED	-	Number of mispredicted branch instructions
+----------------------------+-------------------------------+--------------------------------------------------+
TOTAL_FLUSH	-	Number of pipeline flush events
+----------------------------+-------------------------------+--------------------------------------------------+
RECOVERY_BUBBLE	-	Number of cycles recovered from early mispredictions
+----------------------------+-------------------------------+--------------------------------------------------+
EXEC_STALL_CYCLE	-	Number of cycles where Few uops are issued
+----------------------------+-------------------------------+--------------------------------------------------+
MEMSTALL_ANY_LOAD	-	No uops are issued, and at least one Load instruction is not completed
+----------------------------+-------------------------------+--------------------------------------------------+
MEMSTALL_STORE	-	Non-Store instruction uops are issued,
		and there is a Store instruction not completed
+----------------------------+-------------------------------+--------------------------------------------------+
MEMSTALL_L1MISS	-	No uops are issued, at least one Load instruction is not completed,
		and an L1-cache Miss occurred
+----------------------------+-------------------------------+--------------------------------------------------+
MEMSTALL_L2MISS	-	No uops are issued, at least one Load instruction is not completed,
		and an L2-cache Miss occurred
+----------------------------+-------------------------------+--------------------------------------------------+
MEMSTALL_L3MISS	-	No uops are issued, at least one Load instruction is not completed,
		and an L3-cache Miss occurred
+----------------------------+-------------------------------+--------------------------------------------------+

To count the impact of frontend fetch latency over a period, we can set the EVENT0 field of mhpmevent3 to 22, leaving the other bits at their default values. Then, run the test. After the test is completed, the mhpmcounter3 register can be read via a CSR read instruction to obtain the number of cycles of frontend fetch latency during this period. The impact caused by frontend fetch latency can then be calculated.

The performance event counters in the XiangShan Kunming Lake architecture are divided into two groups: Machine-mode event counters, Supervisor-mode event counters, and User-mode event counters.

Machine-mode Event Counter List
Name	Index	R/W	Description	Reset Value
MCYCLE	0xB00	RW	Machine-mode clock cycle counter	-
MINSTRET	0xB02	RW	Machine-mode retired instruction counter	-
MHPMCOUNTER3-31	0XB03-0XB1F	RW	Machine-mode performance event counter	0

The MHPMCOUNTERx counters are controlled by the corresponding MHPMEVENTx, which specifies the corresponding performance events to count.

Supervisor-mode event counters include the Supervisor-mode Counter Overflow Interrupt Flag Register (SCOUNTOVF).

Supervisor-mode Counter Overflow Interrupt Flag Register (SCOUNTOVF) Description
Name	Bit Field	R/W	Behavior	Reset Value
OFVEC	31:3	RO	Overflow flag bits for mhpmcounterx registers:	0

			1: Overflow occurred

			0: No overflow occurred
+------------+--------+-------+--------------------------------------+-------------+
--	2:0	RO 0	--	0
+------------+--------+-------+--------------------------------------+-------------+

scountovf serves as a read-only mapping of the OF bit in mhpmcounter registers and is controlled by xcounteren:

M-mode accessing scountovf can read the correct value.
HS-mode accessing scountovf: If mcounteren.HPMx is 1, the corresponding OFVECx can be read correctly; otherwise, it reads 0.
VS-mode accessing scountovf: If both mcounteren.HPMx and hcounteren.HPMx are 1, the corresponding OFVECx can be read correctly; otherwise, it reads 0.

User-mode Event Counter List
Name	Index	R/W	Description	Reset Value
CYCLE	0xC00	RO	User-mode read-only copy of mcycle register	-
TIME	0xC01	RO	User-mode read-only copy of memory-mapped register mtime	-
INSTRET	0xC02	RO	User-mode read-only copy of minstret register	-
HPMCOUNTER3-31	0XC03-0XC1F	RO	User-mode read-only copy of mhpmcounter3-31 registers	0

HPM