JasonWang's Blog

一个CPU steal-time高的问题

字数统计: 3.2k阅读时长: 14 min
2025/05/07

前两天碰到一个基于QNX的虚拟化平台上的项目问题,同事反馈系统很卡,点击页面明显有延迟,卡顿严重。用top看了下Android系统的负载,还有20%左右的空闲,其他的如用户态、内核态以及中断的占用都比较正常,唯独有一个%host的占用特别高,最高能占到60%以上。这个host的占用是什么意思了?这篇文章,我们就基于这个问题,来详细阐述分析下虚拟化平台中host占用高的问题以及在虚拟化平台KVM是如何计算host占用的。

1
2
3

800%cpu 45%user 4%nice 126%sys 141%idle 3%iow 41%irq 16%sirq 425%host

问题定位与排查

通过adb shell进到设备里top看下系统整体状态,可以发现系统内存还有不少空间,CPU的空闲只有不到30%,占大头的就是%host这一部分。

top-steal-time-hight

先尝试用DeepSeek问了下,top指令中的host占用到底是什么意思? DeepSeek很快给出了答案:

在Linux的top命令中,”host的CPU占用”(通常对应%host或st字段)是虚拟化环境特有的性能指标,用于反映​​虚拟机被宿主机(Hypervisor)抢占的CPU时间百分比

就是说,host过高,可能是由于虚拟化平台中宿主机有异常,比如处于高负载,CPU配置不合理(给guest系统的资源太少),导致了客户机Android系统一直无法抢占到CPU,处于挂起等待的状态。那么,究竟如何排查这类问题了?接着问下DeepSeek: 如何排查steal CPU占用过高的问题,DeepSeek给出了一些可能的解释:

deep-seek CPU steal-time reason

顺着这个思路,进入设备继续查看,通过mpstat -P ALL查看整体的负载,观察到类似的情况, %steal这一栏显示,Android相当一部分负载都来自于等待宿主机上;这里多个核的%steal占比加起来就对应top中的%host

cpu steal-time high of mpstat

基于这些数据,我们推测很可能宿主机QNX侧有问题(之前的版本没有异常),拉着开发人员对齐了下,确认了虚拟机上的CPU分配没有太大问题(系统只有一个客户机Android,可以访问所有的物理CPU),那么基本可以排除资源不足引起的问题;是不是最近有什么修改引入了这个问题?

开发反馈只有两个修改点: 一个是更新了车控相关的信号,一个是在QNX上新增了一个PPS节点。更新车控信号矩阵不应该对系统负载有太大的影响,而且从已有的数据看,Android的各个进程的负载并不高,况且在台架上也没有太多的车控信号需要传输,因此可以排除。那么,很可能是新增PPS节点导致了QNX侧的负载过高,从而引起Android侧无法拿到CPU。最后,开发排查了相关的进程,发现是某个PPS节点的配置异常,有一个进程在高频的写数据导致QNX侧的负载太高(QNXidle已经接近0),因而客户机Android无法拿到足够的CPU资源。

PPS(Persistent Publish-Subscribe)是QNX用于跨进程通讯的一种协议

问题到这里也算告一段落。但是,为了对这类问题有更多的了解,后续碰到相似问题时时能够快速的定位分析,还是决定要深入代码层面来了解下虚拟化平台中steal time究竟是怎么来计算的。

虚拟化中的CPU steal-time

既然是‘偷来的时间’(steal-time),那么就说明不是客户机自己执行指令导致的CPU占用,而是等待宿主机分配资源所消耗的时间,此时宿主机可能是在处理其他客户机的请求,也有可能是在忙着处理内部的事务。

Steal time is the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor (or host itself).

无论是top指令中展示的%host,还是mpstat中的%steal都是通过内核的/proc/stat获取到的CPU占用数据。查看Android的源码external/toybox/toys/posix/ps.c,CPU的steal-time就是/proc/stat的最后一列数据。我们继续查看下内核的代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


static void top_common(
int (*filter)(long long *oslot, long long *nslot, int milis))
{
long long timeout = 0, now, stats[16];
struct proclist {
struct procpid **tb;
int count;
long long whence;
} plist[2], *plold, *plnew, old, new, mix;
char scratch[16], *pos, *cpufields[] = {"user", "nice", "sys", "idle",
"iow", "irq", "sirq", "host"}; // 通过这显示CPU占用
....

memset(plist, 0, sizeof(plist));
memset(stats, 0, sizeof(stats));

do {

//读取CPU占用数据,最后一列就是steal-time
if (readfile("/proc/stat", pos = toybuf, sizeof(toybuf))) {
long long *st = stats+8*(tock&1);

// user nice system idle iowait irq softirq host
sscanf(pos, "cpu %lld %lld %lld %lld %lld %lld %lld %lld",
st, st+1, st+2, st+3, st+4, st+5, st+6, st+7);
}

...
} while (!done);

if (!FLAG(b)) tty_reset();
}


内核中对应/proc/stat的状态显示在fs/proc/stat.c中实现的,可以看到steal-time的计算是通过一个per-cpu结构体变量kernel_cpustat中的cpustat数组对应的CPUTIME_STEAL索引获取到:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113

static int show_stat(struct seq_file *p, void *v)
{
int i, j;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
u64 guest, guest_nice;
u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
struct timespec64 boottime;

user = nice = system = idle = iowait =
irq = softirq = steal = 0;
guest = guest_nice = 0;
getboottime64(&boottime);

for_each_possible_cpu(i) {
struct kernel_cpustat kcpustat;
u64 *cpustat = kcpustat.cpustat;

kcpustat_cpu_fetch(&kcpustat, i);

user += cpustat[CPUTIME_USER];
nice += cpustat[CPUTIME_NICE];
system += cpustat[CPUTIME_SYSTEM];
idle += get_idle_time(&kcpustat, i);
iowait += get_iowait_time(&kcpustat, i);
irq += cpustat[CPUTIME_IRQ];
softirq += cpustat[CPUTIME_SOFTIRQ];
//获取steal-time
steal += cpustat[CPUTIME_STEAL];
guest += cpustat[CPUTIME_GUEST];
guest_nice += cpustat[CPUTIME_GUEST_NICE];
sum += kstat_cpu_irqs_sum(i);
sum += arch_irq_stat_cpu(i);

for (j = 0; j < NR_SOFTIRQS; j++) {
unsigned int softirq_stat = kstat_softirqs_cpu(j, i);

per_softirq_sums[j] += softirq_stat;
sum_softirq += softirq_stat;
}
}
sum += arch_irq_stat();

seq_put_decimal_ull(p, "cpu ", nsec_to_clock_t(user));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(system));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(idle));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(iowait));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(irq));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(softirq));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(steal));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice));
seq_putc(p, '\n');

for_each_online_cpu(i) {
struct kernel_cpustat kcpustat;
u64 *cpustat = kcpustat.cpustat;

kcpustat_cpu_fetch(&kcpustat, i);

/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
user = cpustat[CPUTIME_USER];
nice = cpustat[CPUTIME_NICE];
system = cpustat[CPUTIME_SYSTEM];
idle = get_idle_time(&kcpustat, i);
iowait = get_iowait_time(&kcpustat, i);
irq = cpustat[CPUTIME_IRQ];
softirq = cpustat[CPUTIME_SOFTIRQ];
steal = cpustat[CPUTIME_STEAL];
guest = cpustat[CPUTIME_GUEST];
guest_nice = cpustat[CPUTIME_GUEST_NICE];
seq_printf(p, "cpu%d", i);
seq_put_decimal_ull(p, " ", nsec_to_clock_t(user));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(system));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(idle));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(iowait));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(irq));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(softirq));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(steal));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice));
seq_putc(p, '\n');
}
seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);

show_all_irqs(p);

seq_printf(p,
"\nctxt %llu\n"
"btime %llu\n"
"processes %lu\n"
"procs_running %lu\n"
"procs_blocked %lu\n",
nr_context_switches(),
(unsigned long long)boottime.tv_sec,
total_forks,
nr_running(),
nr_iowait());

seq_put_decimal_ull(p, "softirq ", (unsigned long long)sum_softirq);

for (i = 0; i < NR_SOFTIRQS; i++)
seq_put_decimal_ull(p, " ", per_softirq_sums[i]);
seq_putc(p, '\n');

return 0;
}


内核中统计各个维度的CPU占用数据在kernel/sched/cputime.c统一实现了相关的接口,只有开启了内核配置CONFIG_PARAVIRT的虚拟化平台中才会计算steam-time时间,其他的则直接返回0:通过函数paravirt_steal_clock获取对应CPU的客户机系统的steal时间。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

/*
* When a guest is interrupted for a longer amount of time, missed clock
* ticks are not redelivered later. Due to that, this function may on
* occasion account more time than the calling functions think elapsed.
*/
static __always_inline u64 steal_account_process_time(u64 maxtime)
{
#ifdef CONFIG_PARAVIRT
if (static_key_false(&paravirt_steal_enabled)) {
u64 steal;

steal = paravirt_steal_clock(smp_processor_id());
steal -= this_rq()->prev_steal_time;
steal = min(steal, maxtime);
account_steal_time(steal);
this_rq()->prev_steal_time += steal;

return steal;
}
#endif
return 0;
}


/*
* Account for involuntary wait time.
* @cputime: the CPU time spent in involuntary wait
*/
void account_steal_time(u64 cputime)
{
u64 *cpustat = kcpustat_this_cpu->cpustat;

cpustat[CPUTIME_STEAL] += cputime;
}

函数paravirt_steal_clockarch/arm64/include/asm/paravirt.h中定义,最终实际是通过一个结构体pv_time_ops中的函数steal_clock调用获取客户机的steal时间。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

#ifdef CONFIG_PARAVIRT
struct pv_time_ops {
unsigned long long (*steal_clock)(int cpu); //获取steal时间的真正函数
};

struct paravirt_patch_template {
struct pv_time_ops time;
};

extern struct paravirt_patch_template pv_ops;

static inline u64 paravirt_steal_clock(int cpu)
{
return pv_ops.time.steal_clock(cpu);
}

int __init pv_time_init(void);

#else

#define pv_time_init() do {} while (0)

#endif // CONFIG_PARAVIRT

对于虚拟化平台来说,系统在初始化的时候,会通过pv_time_init函数对pv_ops进行初始化设置(arch/arm64/kernel/paravirt.c):

  • pv_time_init_stolen_time: 初始化存放steal时间相关的变量内存区域(用于宿主机hypervisor与客户机进行数据共享)
  • pv_ops.time.steal_clock: 对提供给外部获取steal-time的接口进行赋值
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

int __init pv_time_init(void)
{
int ret;

if (!has_pv_steal_clock())
return 0;

ret = pv_time_init_stolen_time();
if (ret)
return ret;

pv_ops.time.steal_clock = pv_steal_clock;

static_key_slow_inc(&paravirt_steal_enabled);
if (steal_acc)
static_key_slow_inc(&paravirt_steal_rq_enabled);

pr_info("using stolen time PV\n");

return 0;
}

函数pv_time_init_stolen_time注册一个CPU热插拔的回调,等CPU状态变为online收到回调后,调用stolen_time_cpu_online函数,初始化steal-time相关的配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14

static int __init pv_time_init_stolen_time(void)
{
int ret;

ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
"hypervisor/arm/pvtime:online",
stolen_time_cpu_online,
stolen_time_cpu_down_prepare);
if (ret < 0)
return ret;
return 0;
}

可以看到,stolen_time_cpu_online主要用于客户机与hypervisor协商一块固定的内存区域用于交换steal-time的时间,主要有几个步骤:

  • 首先通过一个HVC异常陷入指令(用于切换不同的Exception Level-EL),让客户机从EL1(内核态)进入到EL2(虚拟机态),获取到宿主机用于保存steal-time的内存地址,并通过arm_smccc_res返回给客户机
  • 内核将拿到的地址映射到一块内存,这样在内核中就可以通过函数访问到这块内存区域,从而读取到steam-time
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

static int stolen_time_cpu_online(unsigned int cpu)
{
struct pv_time_stolen_time_region *reg;
struct arm_smccc_res res;

reg = this_cpu_ptr(&stolen_time_region);

arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_TIME_ST, &res);

if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
return -EINVAL;

reg->kaddr = memremap(res.a0,
sizeof(struct pvclock_vcpu_stolen_time),
MEMREMAP_WB);

if (!reg->kaddr) {
pr_warn("Failed to map stolen time data structure\n");
return -ENOMEM;
}

if (le32_to_cpu(reg->kaddr->revision) != 0 ||
le32_to_cpu(reg->kaddr->attributes) != 0) {
pr_warn_once("Unexpected revision or attributes in stolen time data\n");
return -ENXIO;
}

return 0;
}

这里以Linux的虚拟化方案KVM(Kernel-based Virtual Machine)为例来说明,在虚拟机层面如何处理steal-time的计算,并给到客户机系统的。在KVM接收到HVC的指令ARM_SMCCC_HV_PV_TIME_ST后,最终会调用对应的异常处理函数kvm_hvc_call_handlerarch/arm64/kvm/hypercalls.c):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71

int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
{
u32 func_id = smccc_get_function(vcpu);
long val = SMCCC_RET_NOT_SUPPORTED;
u32 feature;
gpa_t gpa;

switch (func_id) {
case ARM_SMCCC_VERSION_FUNC_ID:
val = ARM_SMCCC_VERSION_1_1;
break;
case ARM_SMCCC_ARCH_FEATURES_FUNC_ID:
feature = smccc_get_arg1(vcpu);
switch (feature) {
case ARM_SMCCC_ARCH_WORKAROUND_1:
switch (arm64_get_spectre_v2_state()) {
case SPECTRE_VULNERABLE:
break;
case SPECTRE_MITIGATED:
val = SMCCC_RET_SUCCESS;
break;
case SPECTRE_UNAFFECTED:
val = SMCCC_ARCH_WORKAROUND_RET_UNAFFECTED;
break;
}
break;
case ARM_SMCCC_ARCH_WORKAROUND_2:
switch (arm64_get_spectre_v4_state()) {
case SPECTRE_VULNERABLE:
break;
case SPECTRE_MITIGATED:
/*
* SSBS everywhere: Indicate no firmware
* support, as the SSBS support will be
* indicated to the guest and the default is
* safe.
*
* Otherwise, expose a permanent mitigation
* to the guest, and hide SSBS so that the
* guest stays protected.
*/
if (cpus_have_final_cap(ARM64_SSBS))
break;
fallthrough;
case SPECTRE_UNAFFECTED:
val = SMCCC_RET_NOT_REQUIRED;
break;
}
break;
case ARM_SMCCC_HV_PV_TIME_FEATURES:
val = SMCCC_RET_SUCCESS;
break;
}
break;
case ARM_SMCCC_HV_PV_TIME_FEATURES:
val = kvm_hypercall_pv_features(vcpu);
break;
case ARM_SMCCC_HV_PV_TIME_ST:
gpa = kvm_init_stolen_time(vcpu);
if (gpa != GPA_INVALID)
val = gpa;
break;
default:
return kvm_psci_call(vcpu);
}

smccc_set_retval(vcpu, val, 0, 0, 0);
return 1;
}

可以看到,在这里虚拟机hypervisor会调用kvm_init_stolen_time来初始化steam-time相关的配置:

  • 通过客户机对应的vcpu结构体kvm_vcpu获取到steal变量的基地址base(对应客户机的物理地址)
  • 通过kvm_write_guest将一个pvclock_vcpu_stolen_time初始化值写入到base地址,正是在这个结构体中保存了客户机的steal-time
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

gpa_t kvm_init_stolen_time(struct kvm_vcpu *vcpu)
{
struct pvclock_vcpu_stolen_time init_values = {};
struct kvm *kvm = vcpu->kvm;
u64 base = vcpu->arch.steal.base;
int idx;

if (base == GPA_INVALID)
return base;

/*
* Start counting stolen time from the time the guest requests
* the feature enabled.
*/
vcpu->arch.steal.last_steal = current->sched_info.run_delay;

idx = srcu_read_lock(&kvm->srcu);
kvm_write_guest(kvm, base, &init_values, sizeof(init_values));
srcu_read_unlock(&kvm->srcu, idx);

return base;
}


最后,KVM运行的时候,监听一个进程上下文切换的事件,在vcpu进程切换时主动调用kvm_update_stolen_time更新steal-time,从这里我们也可以看到,客户机的steal-time实际读取的是调度器的统计数据run_delay,就是vcpu调度的延迟-在调度队列里等待的时间。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

void kvm_update_stolen_time(struct kvm_vcpu *vcpu)
{
struct kvm *kvm = vcpu->kvm;
u64 base = vcpu->arch.steal.base;
u64 last_steal = vcpu->arch.steal.last_steal;
u64 offset = offsetof(struct pvclock_vcpu_stolen_time, stolen_time);
u64 steal = 0;
int idx;

if (base == GPA_INVALID)
return;

idx = srcu_read_lock(&kvm->srcu);
if (!kvm_get_guest(kvm, base + offset, steal)) {
steal = le64_to_cpu(steal);
vcpu->arch.steal.last_steal = READ_ONCE(current->sched_info.run_delay);
steal += vcpu->arch.steal.last_steal - last_steal;
kvm_put_guest(kvm, base + offset, cpu_to_le64(steal));
}
srcu_read_unlock(&kvm->srcu, idx);
}

总结

这篇文章,我们基于一个虚拟化项目的CPU占用高的问题,分析了排查的手段与思路;以KVM为例,深入分析了虚拟化平台CPU steal-time的计算与更新流程,这样在后续碰到类似的问题时,可以更加得心应手了。在平常的项目实践中,遇到了一些疑难问题,如果有知识盲点,花点时间深入研究下背后的原理与机制,学习的效果比单纯的理论研究要好很多。

参考资料

原文作者:Jason Wang

更新日期:2025-05-14, 16:36:36

版权声明:本文采用知识共享署名-非商业性使用 4.0 国际许可协议进行许可

CATALOG
  1. 1. 问题定位与排查
  2. 2. 虚拟化中的CPU steal-time
  3. 3. 总结
  4. 4. 参考资料