HPM
- Version: V2R2
- Status: OK
- Date: 2025/02/27
- commit: xxx
Basic Information
Terminology Description
Abbreviation | Full Name | Description |
---|---|---|
HPM | Hardware performance monitor | Hardware performance counter unit |
Submodule List
Submodule | Description |
---|---|
HPerfCounter | Individual counter module |
HPerfMonitor | Counter organization module |
PFEvent | Copy of Hpmevent register |
Design Specifications
- Implemented basic hardware performance monitoring functions based on the RISC-V privileged specification, with additional support for sstc and sscofpmf extensions.
- Clock cycles executed by a hardware thread (cycle)
- Number of instructions committed by a hardware thread (minstret)
- Hardware timer (time)
- Counter overflow flag (time)
- 29 hardware performance counters (hpmcounter3 - hpmcounter31)
- 29 hardware performance event selectors (mhpmcounter3 - mhpmcounter31)
- Supports definition of up to 2^10 types of performance events
Features
The basic functions of HPM are as follows:
- Disable all performance event monitoring via the mcountinhibit register.
- Initialize performance event counters for each monitoring unit, including: mcycle, minstret, mhpmcounter3 - mhpmcounter31.
- Configure performance event selectors for each monitoring unit, including: mhpmcounter3 - mhpmcounter31. In the XiangShan Kunming Lake architecture, each event selector can configure up to four event combinations. By writing the event index value, event combination method, and sampling privilege level to the event selector, configured events can be counted normally at the specified sampling privilege level, and the results are accumulated in the event counter based on the combination result.
- Configure xcounteren for access permission authorization.
- Enable all performance event monitoring via the mcountinhibit register to start counting.
HPM Event Overflow Interrupt
The overflow interrupt LCOFIP initiated by the Kunming Lake performance monitoring unit has a unified interrupt vector number of 12. The enabling and processing of this interrupt are consistent with standard private interrupts.
Overall Design
Performance events are defined in each submodule. Submodules assemble performance events into io_perf
output by calling generatePerfEvent
, sending them to four main modules: Frontend, Backend, MemBlock, and CoupledL2.
The aforementioned four modules obtain the performance event output from submodules by calling the get_perf
method. Simultaneously, the PFEvent module is instantiated within each main module as a copy of the CSR mhpmevent
register. It aggregates the required performance event selector data and the performance event output from submodules and connects them to the HPerfMonitor module to calculate the increment results applied to the performance event counters.
Finally, the CSR module collects the increment results of the performance event counters from the four top-level modules and inputs them into the CSR registers mhpmcounter3-31 for cumulative counting.
Specifically, performance events from CoupledL2 are directly input into the CSR module. Based on the event selection information read from the mhpmevent
register, they are processed by the HPerfMonitor module instantiated within the CSR and input into the CSR registers mhpmcounter26-31 for cumulative counting.
See 此图 for the specific overall design block diagram of HPM:
HPerfMonitor Counter Organization Module
Inputs the event selection information (events) to the corresponding HPerfCounter module and copies all performance event counting information to every HperfCounter module.
Collects the output from all HperfCounter modules.
HperfCounter Individual Counter Module
Based on the input event selection information, selects the required performance event counting information and combines the input performance events according to the counting mode specified in the event selection information for output.
PFEvent Copy of Hpmevent Register
A copy of the CSR register mhpmevent
: Collects CSR write information and synchronizes changes to mhpmevent
.
HPM Related Control Registers
Machine-mode Performance Event Counter Inhibit Register (MCOUNTINHIBIT)
The Machine-mode Performance Event Counter Inhibit Register (mcountinhibit) is a 32-bit WARL register, primarily used to control whether hardware performance monitoring counters count. In scenarios where performance analysis is not needed, counters can be disabled to reduce processor power consumption.
Name | Bit Field | R/W | Behavior | Reset Value |
---|---|---|---|---|
HPMx | 31:4 | RW | Inhibit bit for mhpmcounterx register: | 0 |
0: Count normally | ||||
1: Inhibit counting | ||||
+--------+--------+-------+--------------------------------------------+----------+ | ||||
IR | 3 | RW | Inhibit bit for minstret register: | 0 |
0: Count normally | ||||
1: Inhibit counting | ||||
+--------+--------+-------+--------------------------------------------+----------+ | ||||
-- | 2 | RO 0 | Reserved | 0 |
+--------+--------+-------+--------------------------------------------+----------+ | ||||
CY | 1 | RW | Inhibit bit for mcycle register: | 0 |
0: Count normally | ||||
1: Inhibit counting | ||||
+--------+--------+-------+--------------------------------------------+----------+ |
Machine-mode Performance Event Counter Access Enable Register (MCOUNTEREN)
The Machine-mode Performance Event Counter Access Enable Register (mcounteren) is a 32-bit WARL register, primarily used to control access permissions for user-mode performance monitoring counters in privilege modes lower than Machine mode (HS-mode/VS-mode/HU-mode/VU-mode).
Name | Bit Field | R/W | Behavior | Reset Value |
---|---|---|---|---|
HPMx | 31:4 | RW | Access permission bit for hpmcounterenx register below M-mode: | 0 |
0: Accessing hpmcounterx causes an illegal instruction exception | ||||
1: Allows normal access to hpmcounterx | ||||
+--------+--------+-------+------------------------------------------------------------+----------+ | ||||
IR | 3 | RW | Access permission bit for instret register below M-mode: | 0 |
0: Accessing instret causes an illegal instruction exception | ||||
1: Allows normal access | ||||
+--------+--------+-------+------------------------------------------------------------+----------+ | ||||
TM | 2 | RW | Access permission bit for time/stimecmp register below M-mode: | 0 |
0: Accessing time causes an illegal instruction exception | ||||
1: Allows normal access | ||||
+--------+--------+-------+------------------------------------------------------------+----------+ | ||||
CY | 1 | RW | Access permission bit for cycle register below M-mode: | 0 |
0: Accessing cycle causes an illegal instruction exception | ||||
1: Allows normal access | ||||
+--------+--------+-------+------------------------------------------------------------+----------+ |
Supervisor-mode Performance Event Counter Access Enable Register (SCOUNTEREN)
The Supervisor-mode Performance Event Counter Access Enable Register (scounteren) is a 32-bit WARL register, primarily used to control access permissions for user-mode performance monitoring counters in User mode (HU-mode/VU-mode).
Name | Bit Field | R/W | Behavior | Reset Value |
---|---|---|---|---|
HPMx | 31:4 | RW | User-mode access permission bit for hpmcounterenx register: | 0 |
0: Accessing hpmcounterx causes an illegal instruction exception | ||||
1: Allows normal access to hpmcounterx | ||||
+--------+--------+-------+---------------------------------------------------------+----------+ | ||||
IR | 3 | RW | User-mode access permission bit for instret register: | 0 |
0: Accessing instret causes an illegal instruction exception | ||||
1: Allows normal access | ||||
+--------+--------+-------+---------------------------------------------------------+----------+ | ||||
TM | 2 | RW | User-mode access permission bit for time register: | 0 |
0: Accessing time causes an illegal instruction exception | ||||
1: Allows normal access | ||||
+--------+--------+-------+---------------------------------------------------------+----------+ | ||||
CY | 1 | RW | User-mode access permission bit for cycle register: | 0 |
0: Accessing cycle causes an illegal instruction exception | ||||
1: Allows normal access | ||||
+--------+--------+-------+---------------------------------------------------------+----------+ |
Virtualization-mode Performance Event Counter Access Enable Register (HCOUNTEREN)
The Virtualization-mode Performance Event Counter Access Enable Register (hcounteren) is a 32-bit WARL register, primarily used to control access permissions for user-mode performance monitoring counters in guest virtual machines (VS-mode/VU-mode).
Name | Bit Field | R/W | Behavior | Reset Value |
---|---|---|---|---|
HPMx | 31:4 | RW | Guest virtual machine access permission bit for hpmcounterenx register: | 0 |
0: Accessing hpmcounterx causes an illegal instruction exception | ||||
1: Allows normal access to hpmcounterx | ||||
+--------+--------+-------+-----------------------------------------------------------------+----------+ | ||||
IR | 3 | RW | Guest virtual machine access permission bit for instret register: | 0 |
0: Accessing instret causes an illegal instruction exception | ||||
1: Allows normal access | ||||
+--------+--------+-------+-----------------------------------------------------------------+----------+ | ||||
TM | 2 | RW | Guest virtual machine access permission bit for time/vstimecmp(via stimecmp) register: | 0 |
0: Accessing time causes an illegal instruction exception | ||||
1: Allows normal access | ||||
+--------+--------+-------+-----------------------------------------------------------------+----------+ | ||||
CY | 1 | RW | Guest virtual machine access permission bit for cycle register: | 0 |
0: Accessing cycle causes an illegal instruction exception | ||||
1: Allows normal access | ||||
+--------+--------+-------+-----------------------------------------------------------------+----------+ |
Supervisor-mode Timer Compare Register (STIMECMP)
The Supervisor-mode Timer Compare Register (stimecmp) is a 64-bit WARL register, primarily used to manage timer interrupts (STIP) in Supervisor mode.
STIMECMP Register Behavior Description:
- Reset value is a 64-bit unsigned number 64'hffff_ffff_ffff_ffff.
- If
menvcfg.STCE
is 0 and the current privilege level is lower than M-mode (HS-mode/VS-mode/HU-mode/VU-mode), accessing the stimecmp register causes an illegal instruction exception and does not generate an STIP interrupt. - The stimecmp register is the source of STIP interrupts: When the unsigned integer comparison
time ≥ stimecmp
is true, the STIP interrupt pending signal is asserted. - Supervisor-mode software can control the generation of timer interrupts by writing to
stimecmp
.
Guest Virtual Machine Supervisor-mode Timer Compare Register (VSTIMECMP)
The Guest Virtual Machine Supervisor-mode Timer Compare Register (vstimecmp) is a 64-bit WARL register, primarily used to manage timer interrupts (STIP) in guest virtual machine Supervisor mode.
VSTIMECMP Register Behavior Description:
- Reset value is a 64-bit unsigned number 64'hffff_ffff_ffff_ffff.
- If
henvcfg.STCE
is 0 orhcounteren.TM
is 1, accessing the vstimecmp register via the stimecmp register causes a virtual illegal instruction exception and does not generate a VSTIP interrupt. - The vstimecmp register is the source of VSTIP interrupts: When the unsigned integer comparison
time + htimedelta ≥ vstimecmp
is true, the VSTIP interrupt pending signal is asserted. - Guest virtual machine Supervisor-mode software can control the generation of timer interrupts in VS-mode by writing to
vstimecmp
.
HPM Related Performance Event Selectors
Machine-mode performance event selectors (mhpmevent3 - 31) are 64-bit WARL registers used to select the performance event corresponding to each performance event counter. In the XiangShan Kunming Lake architecture, each counter can be configured with up to four performance events for combined counting. After the user writes the event index value, event combination method, and sampling privilege level to the specified event selector, the event counter matched by that event selector begins counting normally.
Name | Bit Field | R/W | Behavior | Reset Value |
---|---|---|---|---|
OF | 63 | RW | Performance counter overflow flag bit: | 0 |
0: Set to 1 when the corresponding performance counter overflows, generating an overflow interrupt | ||||
1: Value remains unchanged when the corresponding performance counter overflows, no overflow interrupt is generated | ||||
+----------------+--------+-------+-----------------------------------------------+----------+ | ||||
MINH | 62 | RW | When set to 1, inhibits M mode sampling count | 0 |
+----------------+--------+-------+-----------------------------------------------+----------+ | ||||
SINH | 61 | RW | When set to 1, inhibits S mode sampling count | 0 |
+----------------+--------+-------+-----------------------------------------------+----------+ | ||||
UINH | 60 | RW | When set to 1, inhibits U mode sampling count | 0 |
+----------------+--------+-------+-----------------------------------------------+----------+ | ||||
VSINH | 59 | RW | When set to 1, inhibits VS mode sampling count | 0 |
+----------------+--------+-------+-----------------------------------------------+----------+ | ||||
VUINH | 58 | RW | When set to 1, inhibits VU mode sampling count | 0 |
+----------------+--------+-------+-----------------------------------------------+----------+ | ||||
-- | 57:55 | RW | -- | 0 |
+----------------+--------+-------+-----------------------------------------------+----------+ | ||||
Counter event combination method control bits: | ||||
5'b00000: Use OR operation for combination | ||||
OP_TYPE2 | 54:50 | |||
OP_TYPE1 | 49:45 | RW | 5'b00001: Use AND operation for combination | 0 |
OP_TYPE0 | 44:40 | |||
5'b00010: Use XOR operation for combination | ||||
5'b00100: Use ADD operation for combination | ||||
+----------------+--------+-------+-----------------------------------------------+----------+ | ||||
Counter performance event index value: | ||||
EVENT3 | 39:30 | |||
EVENT2 | 29:20 | RW | 0: Corresponding event counter does not count | -- |
EVENT1 | 19:10 | |||
EVENT0 | 9:0 | 1: Corresponding event counter counts the event | ||
+----------------+--------+-------+-----------------------------------------------+----------+ |
Among these, the event combination method for counters is:
- EVENT0 and EVENT1 events are combined using the OP_TYPE0 operation to form RESULT0.
- EVENT2 and EVENT3 events are combined using the OP_TYPE1 operation to form RESULT1.
- The combined result of RESULT0 and RESULT1 uses the OP_TYPE2 operation to form RESULT2.
- RESULT2 is accumulated in the corresponding event counter.
The reset value for the event index portion of the performance event selector is specified as 0.
The Kunming Lake architecture classifies the provided performance events into four categories based on their source: frontend, backend, memory access, and cache. It also divides the counters into four parts, which respectively record performance events from the four sources mentioned above:
- Frontend: mhpmevent 3-10
- Backend: mhpmevent11-18
- Memory Access: mhpmevent19-26
- Cache: mhpmevent27-31
Index | Event |
---|---|
0 | noEvent |
1 | frontendFlush |
2 | ifu_req |
3 | ifu_miss |
4 | ifu_req_cacheline_0 |
5 | ifu_req_cacheline_1 |
6 | ifu_req_cacheline_0_hit |
7 | ifu_req_cacheline_1_hit |
8 | only_0_hit |
9 | only_0_miss |
10 | hit_0_hit_1 |
11 | hit_0_miss_1 |
12 | miss_0_hit_1 |
13 | miss_0_miss_1 |
14 | IBuffer_Flushed |
15 | IBuffer_hungry |
16 | IBuffer_1_4_valid |
17 | IBuffer_2_4_valid |
18 | IBuffer_3_4_valid |
19 | IBuffer_4_4_valid |
20 | IBuffer_full |
21 | Front_Bubble |
22 | Fetch_Latency_Bound |
23 | icache_miss_cnt |
24 | icache_miss_penalty |
25 | bpu_s2_redirect |
26 | bpu_s3_redirect |
27 | bpu_to_ftq_stall |
28 | mispredictRedirect |
29 | replayRedirect |
30 | predecodeRedirect |
31 | to_ifu_bubble |
32 | from_bpu_real_bubble |
33 | BpInstr |
34 | BpBInstr |
35 | BpRight |
36 | BpWrong |
37 | BpBRight |
38 | BpBWrong |
39 | BpJRight |
40 | BpJWrong |
41 | BpIRight |
42 | BpIWrong |
43 | BpCRight |
44 | BpCWrong |
45 | BpRRight |
46 | BpRWrong |
47 | ftb_false_hit |
48 | ftb_hit |
49 | fauftb_commit_hit |
50 | fauftb_commit_miss |
51 | tage_tht_hit |
52 | sc_update_on_mispred |
53 | sc_update_on_unconf |
54 | ftb_commit_hits |
55 | ftb_commit_misses |
Index | Event |
---|---|
0 | noEvent |
1 | decoder_fused_instr |
2 | decoder_waitInstr |
3 | decoder_stall_cycle |
4 | decoder_utilization |
5 | INST_SPEC |
6 | RECOVERY_BUBBLE |
7 | rename_in |
8 | rename_waitinstr |
9 | rename_stall |
10 | rename_stall_cycle_walk |
11 | rename_stall_cycle_dispatch |
12 | rename_stall_cycle_int |
13 | rename_stall_cycle_fp |
14 | rename_stall_cycle_vec |
15 | rename_stall_cycle_v0 |
16 | rename_stall_cycle_vl |
17 | me_freelist_1_4_valid |
18 | me_freelist_2_4_valid |
19 | me_freelist_3_4_valid |
20 | me_freelist_4_4_valid |
21 | std_freelist_1_4_valid |
22 | std_freelist_2_4_valid |
23 | std_freelist_3_4_valid |
24 | std_freelist_4_4_valid |
25 | std_freelist_1_4_valid |
26 | std_freelist_2_4_valid |
27 | std_freelist_3_4_valid |
28 | std_freelist_4_4_valid |
29 | std_freelist_1_4_valid |
30 | std_freelist_2_4_valid |
31 | std_freelist_3_4_valid |
32 | std_freelist_4_4_valid |
33 | std_freelist_1_4_valid |
34 | std_freelist_2_4_valid |
35 | std_freelist_3_4_valid |
36 | std_freelist_4_4_valid |
37 | dispatch_in |
38 | dispatch_empty |
39 | dispatch_utili |
40 | dispatch_waitinstr |
41 | dispatch_stall_cycle_lsq |
42 | dispatch_stall_cycle_rob |
43 | dispatch_stall_cycle_int_dq |
44 | dispatch_stall_cycle_fp_dq |
45 | dispatch_stall_cycle_ls_dq |
46 | rob_interrupt_num |
47 | rob_exception_num |
48 | rob_flush_pipe_num |
49 | rob_replay_inst_num |
50 | rob_commitUop |
51 | rob_commitInstr |
52 | rob_commitInstrFused |
53 | rob_commitInstrLoad |
54 | rob_commitInstrBranch |
55 | rob_commitInstrStore |
56 | rob_walkInstr |
57 | rob_walkCycle |
58 | rob_1_4_valid |
59 | rob_2_4_valid |
60 | rob_3_4_valid |
61 | rob_4_4_valid |
62 | BR_MIS_PRED |
63 | TOTAL_FLUSH |
64 | EXEC_STALL_CYCLE |
65 | MEMSTALL_STORE |
66 | MEMSTALL_L1MISS |
67 | MEMSTALL_L2MISS |
68 | MEMSTALL_L3MISS |
69 | issueQueue_enq_fire_cnt |
70 | IssueQueueAluMulBkuBrhJmp_full |
71 | IssueQueueAluMulBkuBrhJmp_full |
72 | IssueQueueAluBrhJmpI2fVsetriwiVsetriwvfI2v_full |
73 | IssueQueueAluCsrFenceDiv_full |
74 | issueQueue_enq_fire_cnt |
75 | IssueQueueFaluFcvtF2vFmacFdiv_full |
76 | IssueQueueFaluFmacFdiv_full |
77 | IssueQueueFaluFmac_full |
78 | issueQueue_enq_fire_cnt |
79 | IssueQueueVfmaVialuFixVimacVppuVfaluVfcvtVipuVsetrvfwvf_full |
80 | IssueQueueVfmaVialuFixVfalu_full |
81 | IssueQueueVfdivVidiv_full |
82 | issueQueue_enq_fire_cnt |
83 | IssueQueueStaMou_full |
84 | IssueQueueStaMou_full |
85 | IssueQueueLdu_full |
86 | IssueQueueLdu_full |
87 | IssueQueueLdu_full |
88 | IssueQueueVlduVstuVseglduVsegstu_full |
89 | IssueQueueVlduVstu_full |
90 | IssueQueueStdMoud_full |
91 | IssueQueueStdMoud_full |
Index | Event |
---|---|
0 | noEvent |
1 | load_s0_in_fire |
2 | load_to_load_forward |
3 | stall_dcache |
4 | load_s1_in_fire |
5 | load_s1_tlb_miss |
6 | load_s2_in_fire |
7 | load_s2_dcache_miss |
8 | load_s0_in_fire |
9 | load_to_load_forward |
10 | stall_dcache |
11 | load_s1_in_fire |
12 | load_s1_tlb_miss |
13 | load_s2_in_fire |
14 | load_s2_dcache_miss |
15 | load_s0_in_fire |
16 | load_to_load_forward |
17 | stall_dcache |
18 | load_s1_in_fire |
19 | load_s1_tlb_miss |
20 | load_s2_in_fire |
21 | load_s2_dcache_miss |
22 | sbuffer_req_valid |
23 | sbuffer_req_fire |
24 | sbuffer_merge |
25 | sbuffer_newline |
26 | dcache_req_valid |
27 | dcache_req_fire |
28 | sbuffer_idle |
29 | sbuffer_flush |
30 | sbuffer_replace |
31 | mpipe_resp_valid |
32 | replay_resp_valid |
33 | coh_timeout |
34 | sbuffer_1_4_valid |
35 | sbuffer_2_4_valid |
36 | sbuffer_3_4_valid |
37 | sbuffer_full_valid |
38 | MEMSTALL_ANY_LOAD |
39 | enq |
40 | ld_ld_violation |
41 | enq |
42 | stld_rollback |
43 | enq |
44 | deq |
45 | deq_block |
46 | replay_full |
47 | replay_rar_nack |
48 | replay_raw_nack |
49 | replay_nuke |
50 | replay_mem_amb |
51 | replay_tlb_miss |
52 | replay_bank_conflict |
53 | replay_dcache_replay |
54 | replay_forward_fail |
55 | replay_dcache_miss |
56 | full_mask_000 |
57 | full_mask_001 |
58 | full_mask_010 |
59 | full_mask_011 |
60 | full_mask_100 |
61 | full_mask_101 |
62 | full_mask_110 |
63 | full_mask_111 |
64 | nuke_rollback |
65 | nack_rollback |
66 | mmioCycle |
67 | mmioCnt |
68 | mmio_wb_success |
69 | mmio_wb_blocked |
70 | stq_1_4_valid |
71 | stq_2_4_valid |
72 | stq_3_4_valid |
73 | stq_4_4_valid |
74 | dcache_wbq_req |
75 | dcache_wbq_1_4_valid |
76 | dcache_wbq_2_4_valid |
77 | dcache_wbq_3_4_valid |
78 | dcache_wbq_4_4_valid |
79 | dcache_mp_req |
80 | dcache_mp_total_penalty |
81 | dcache_missq_req |
82 | dcache_missq_1_4_valid |
83 | dcache_missq_2_4_valid |
84 | dcache_missq_3_4_valid |
85 | dcache_missq_4_4_valid |
86 | dcache_probq_req |
87 | dcache_probq_1_4_valid |
88 | dcache_probq_2_4_valid |
89 | dcache_probq_3_4_valid |
90 | dcache_probq_4_4_valid |
91 | load_req |
92 | load_replay |
93 | load_replay_for_data_nack |
94 | load_replay_for_no_mshr |
95 | load_replay_for_conflict |
96 | load_req |
97 | load_replay |
98 | load_replay_for_data_nack |
99 | load_replay_for_no_mshr |
100 | load_replay_for_conflict |
101 | load_req |
102 | load_replay |
103 | load_replay_for_data_nack |
104 | load_replay_for_no_mshr |
105 | load_replay_for_conflict |
106 | PTW_tlbllptw_incount |
107 | PTW_tlbllptw_inblock |
108 | PTW_tlbllptw_memcount |
109 | PTW_tlbllptw_memcycle |
110 | PTW_access |
111 | PTW_l2_hit |
112 | PTW_l1_hit |
113 | PTW_l0_hit |
114 | PTW_sp_hit |
115 | PTW_pte_hit |
116 | PTW_rwHarzad |
117 | PTW_out_blocked |
118 | PTW_fsm_count |
119 | PTW_fsm_busy |
120 | PTW_fsm_idle |
121 | PTW_resp_blocked |
122 | PTW_mem_count |
123 | PTW_mem_cycle |
124 | PTW_mem_blocked |
125 | ldDeqCount |
126 | stDeqCount |
Index | Event |
---|---|
0 | noEvent |
1 | Slice0_l2_cache_refill |
2 | Slice0_l2_cache_rd_refill |
3 | Slice0_l2_cache_wr_refill |
4 | Slice0_l2_cache_long_miss |
5 | Slice0_l2_cache_access |
6 | Slice0_l2_cache_l2wb |
7 | Slice0_l2_cache_l1wb |
8 | Slice0_l2_cache_wb_victim |
9 | Slice0_l2_cache_wb_cleaning_coh |
10 | Slice0_l2_cache_access_rd |
11 | Slice0_l2_cache_access_wr |
12 | Slice0_l2_cache_inv |
13 | Slice1_l2_cache_refill |
14 | Slice1_l2_cache_rd_refill |
15 | Slice1_l2_cache_wr_refill |
16 | Slice1_l2_cache_long_miss |
17 | Slice1_l2_cache_access |
18 | Slice1_l2_cache_l2wb |
19 | Slice1_l2_cache_l1wb |
20 | Slice1_l2_cache_wb_victim |
21 | Slice1_l2_cache_wb_cleaning_coh |
22 | Slice1_l2_cache_access_rd |
23 | Slice1_l2_cache_access_wr |
24 | Slice1_l2_cache_inv |
25 | Slice2_l2_cache_refill |
26 | Slice2_l2_cache_rd_refill |
27 | Slice2_l2_cache_wr_refill |
28 | Slice2_l2_cache_long_miss |
29 | Slice2_l2_cache_access |
30 | Slice2_l2_cache_l2wb |
31 | Slice2_l2_cache_l1wb |
32 | Slice2_l2_cache_wb_victim |
33 | Slice2_l2_cache_wb_cleaning_coh |
34 | Slice2_l2_cache_access_rd |
35 | Slice2_l2_cache_access_wr |
36 | Slice2_l2_cache_inv |
37 | Slice3_l2_cache_refill |
38 | Slice3_l2_cache_rd_refill |
39 | Slice3_l2_cache_wr_refill |
40 | Slice3_l2_cache_long_miss |
41 | Slice3_l2_cache_access |
42 | Slice3_l2_cache_l2wb |
43 | Slice3_l2_cache_l1wb |
44 | Slice3_l2_cache_wb_victim |
45 | Slice3_l2_cache_wb_cleaning_coh |
46 | Slice3_l2_cache_access_rd |
47 | Slice3_l2_cache_access_wr |
48 | Slice3_l2_cache_inv |
Topdown PMU
Topdown performance analysis is a top-down analysis method used to quickly analyze CPU performance bottlenecks. Its core idea is to progressively decompose performance issues from high-level categories downward, refining the problem layer by layer, ultimately pinpointing the root cause. We have implemented a three-layer Topdown performance event structure, as shown below:
Level 1 | Level 2 | Level 3 | Description | Formula |
---|---|---|---|---|
Retiring | - | - | Instruction commit impact | INST_RETIRED / |
(IssueBW * CPU_CYCLES) | ||||
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
FrontEnd | - | - | Frontend impact | IF_FETCH_BUBBLE / |
Bound | (IssueBW * CPU_CYCLES) | |||
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
- | Fetch | - | ||
Latency | Fetch latency impact | IF_FETCH_BUBBLE_EQ_MAX / | ||
Bound | CPU_CYCLES | |||
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
Fetch | FrontEnd Bound - | |||
- | Bandwidth | - | Fetch bandwidth impact | Fetch Latency Bound |
Bound | ||||
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
Bad | (INST_SPEC - INST_RETIRED+ | |||
Speculation | - | - | Misprediction impact | RECOVERY_BUBBLE) / |
(IssueBW * CPU_CYCLES) | ||||
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
- | Branch | - | Mispredicted branch | Bad Speculation * |
Misspredict | instruction impact | BR_MIS_PRED / TOTAL_FLUSH | ||
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
- | Machine | - | Machine clears | Bad Speculation - Branch Misspredict |
Clears | event impact | |||
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
BackEnd | - | - | Backend impact | 1 - (FrontEnd Bound + |
Bound | Bad Speculation + Retiring) | |||
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
- | Core | - | Core impact | (EXEC_STALL_CYCLE - MEMSTALL_ANYLOAD - |
Bound | MEMSTALL_STORE) / CPU_CYCLE | |||
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
- | Memory | - | Memory access impact | (MEMSTALL_ANYLOAD + MEMSTALL_STORE) / |
Bound | CPU_CYCLES | |||
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
- | - | L1 Bound | L1 impact | (MEMSTALL_ANYLOAD - MEMSTALL_L1MISS) / |
CPU_CYCLES | ||||
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
- | - | L2 Bound | L2 impact | (MEMSTALL_L1MISS - MEMSTALL_L2MISS) / |
CPU_CYCLES | ||||
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
- | - | L3 Bound | L3 impact | (MEMSTALL_L2MISS - MEMSTALL_L3MISS) / |
CPU_CYCLES | ||||
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
- | - | Mem Bound | External memory impact | MEMSTALL_L3MISS / CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+ | ||||
- | - | Store Bound | Store instruction impact | MEMSTALL_STORE / CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+ |
Where IssueBW is the issue width. The XiangShan Kunming Lake architecture currently supports 6 issues.
Name | Corresponding Performance Event | Description |
---|---|---|
CPU_CYCLES | - | Total clock cycles after all instructions are committed |
+----------------------------+-------------------------------+--------------------------------------------------+ | ||
INST_RETIRED | rob_commitInstr | Number of successfully committed instructions |
+----------------------------+-------------------------------+--------------------------------------------------+ | ||
INST_SPEC | - | Number of speculatively executed instructions |
+----------------------------+-------------------------------+--------------------------------------------------+ | ||
IF_FETCH_BUBBLE | Front_Bubble | Number of bubbles fetched from the fetch buffer, |
and no backend stall exists | ||
+----------------------------+-------------------------------+--------------------------------------------------+ | ||
IF_FETCH_BUBBLE_EQ_MAX | Fetch_Latency_Bound | Cycles where 0 instructions are fetched from the fetch buffer, |
and no backend stall exists | ||
+----------------------------+-------------------------------+--------------------------------------------------+ | ||
BR_MIS_PRED | - | Number of mispredicted branch instructions |
+----------------------------+-------------------------------+--------------------------------------------------+ | ||
TOTAL_FLUSH | - | Number of pipeline flush events |
+----------------------------+-------------------------------+--------------------------------------------------+ | ||
RECOVERY_BUBBLE | - | Number of cycles recovered from early mispredictions |
+----------------------------+-------------------------------+--------------------------------------------------+ | ||
EXEC_STALL_CYCLE | - | Number of cycles where Few uops are issued |
+----------------------------+-------------------------------+--------------------------------------------------+ | ||
MEMSTALL_ANY_LOAD | - | No uops are issued, and at least one Load instruction is not completed |
+----------------------------+-------------------------------+--------------------------------------------------+ | ||
MEMSTALL_STORE | - | Non-Store instruction uops are issued, |
and there is a Store instruction not completed | ||
+----------------------------+-------------------------------+--------------------------------------------------+ | ||
MEMSTALL_L1MISS | - | No uops are issued, at least one Load instruction is not completed, |
and an L1-cache Miss occurred | ||
+----------------------------+-------------------------------+--------------------------------------------------+ | ||
MEMSTALL_L2MISS | - | No uops are issued, at least one Load instruction is not completed, |
and an L2-cache Miss occurred | ||
+----------------------------+-------------------------------+--------------------------------------------------+ | ||
MEMSTALL_L3MISS | - | No uops are issued, at least one Load instruction is not completed, |
and an L3-cache Miss occurred | ||
+----------------------------+-------------------------------+--------------------------------------------------+ |
To count the impact of frontend fetch latency over a period, we can set the EVENT0 field of mhpmevent3 to 22, leaving the other bits at their default values. Then, run the test. After the test is completed, the mhpmcounter3 register can be read via a CSR read instruction to obtain the number of cycles of frontend fetch latency during this period. The impact caused by frontend fetch latency can then be calculated.
HPM Related Performance Counters
The performance event counters in the XiangShan Kunming Lake architecture are divided into two groups: Machine-mode event counters, Supervisor-mode event counters, and User-mode event counters.
Name | Index | R/W | Description | Reset Value |
---|---|---|---|---|
MCYCLE | 0xB00 | RW | Machine-mode clock cycle counter | - |
MINSTRET | 0xB02 | RW | Machine-mode retired instruction counter | - |
MHPMCOUNTER3-31 | 0XB03-0XB1F | RW | Machine-mode performance event counter | 0 |
The MHPMCOUNTERx counters are controlled by the corresponding MHPMEVENTx, which specifies the corresponding performance events to count.
Supervisor-mode event counters include the Supervisor-mode Counter Overflow Interrupt Flag Register (SCOUNTOVF).
Name | Bit Field | R/W | Behavior | Reset Value |
---|---|---|---|---|
OFVEC | 31:3 | RO | Overflow flag bits for mhpmcounterx registers: | 0 |
1: Overflow occurred | ||||
0: No overflow occurred | ||||
+------------+--------+-------+--------------------------------------+-------------+ | ||||
-- | 2:0 | RO 0 | -- | 0 |
+------------+--------+-------+--------------------------------------+-------------+ |
scountovf
serves as a read-only mapping of the OF bit in mhpmcounter
registers and is controlled by xcounteren
:
- M-mode accessing
scountovf
can read the correct value. - HS-mode accessing
scountovf
: Ifmcounteren.HPMx
is 1, the corresponding OFVECx can be read correctly; otherwise, it reads 0. - VS-mode accessing
scountovf
: If bothmcounteren.HPMx
andhcounteren.HPMx
are 1, the corresponding OFVECx can be read correctly; otherwise, it reads 0.
Name | Index | R/W | Description | Reset Value |
---|---|---|---|---|
CYCLE | 0xC00 | RO | User-mode read-only copy of mcycle register | - |
TIME | 0xC01 | RO | User-mode read-only copy of memory-mapped register mtime | - |
INSTRET | 0xC02 | RO | User-mode read-only copy of minstret register | - |
HPMCOUNTER3-31 | 0XC03-0XC1F | RO | User-mode read-only copy of mhpmcounter3-31 registers | 0 |