JasonWang's Blog

深入Linux容器LXC之二-LXC源码分析

字数统计: 5.4k阅读时长: 26 min
2025/06/18

在上一篇文章中深入Linux容器LXC之一-LXC的实现原理着重介绍了LXC容器的实现原理,我们对LXC容器的基本原理有了一定的了解,但其中有一个问题,启动一个容器后,容器系统中存在两个名为init的进程。为什么会有两个init进程?为了解开这个疑问,需要对LXC的源码做一个深入的分析。本篇文章主要是围绕这个问题展开,大致分为两个大的部分:

  • lxc-create的实现: LXC容器是如何创建的
  • lxc-start的实现: LXC容器是如何启动的

LXC的源码可以通过Github仓库下载,其核心代码主要包括如下几个部分:

  • config: 包含了常见的配置,如apparmor配置,容器启动配置,init进程的配置,selinux规则配置
  • doc: 包含LXC工具的说明文档,常见的配置模版
  • src: 源代码目录,包括头文件,核心代码目录以及测试代码目录
  • templates: 容器镜像模版,包含了busybox, localdownload等几种常用的模版,更多的容器模版可以到LXC官网下载

接下来我们就从容器的创建与启动两个过程来分析下LXC的源代码。

LXC容器的创建过程

LXC工具相关的源码都在src/lxc/tools下面。这里我们重点已LXC容器的创建与启动两个流程来重点梳理下LXC源代码的实现原理。首先来看看容器容器lxc-create的具体调用流程;比如,通过如下指令创建一个容器,会调用lxc-create工具:

1
2
3
4
5
6

sudo lxc-create -n busybox-lxc -t busybox

## 如果我们想看到更多启动日志,可以加上 -l, -o两个参数,这样可以看到详细的启动过程
sudo lxc-create -n busybox-lxc1 -t busybox -l TRACE -o /home/jason/Downloads/busybox-lxc-create.log

lxc-create的实现都位于文件lxc_create.c,找到中入口函数lxc_create_main,可以看到容器的创建大致有如下几个关键的步骤:

  • lxc_arguments_parse:解析传入的参数,主要有容器名称(-n)与容器的模版(-t),并设置文件路径用于保存容器的持久化配置
  • lxc_log_init:初始化容器日志存储的目录(默认是无日志存储,需要通过-o参数设定日志保存的路径)
  • lxc_mkdir_p:创建容器对应的目录,用于保存配置与容器的rootfs镜像,ubuntu系统默认的路径为/var/lib/lxc
  • lxc_container_new: 创建一个新的容器struct lxc_container,并对其进行初始化;如果该容器已创建,则退出后续流程
  • 调用容器函数load_config加载配置,如果是第一次创建容器,配置文件为空(配置文件位于/var/lib/lxc/<lxc-name>/config
  • 调用create创建容器对象,主要是创建容器rootfs存储所需的目录,并执行模版对应的脚本,保存相关的容器配置
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

int __attribute__((weak, alias("lxc_create_main"))) main(int argc, char *argv[]);
int lxc_create_main(int argc, char *argv[])
{
struct lxc_container *c;
struct bdev_specs spec;
struct lxc_log log;
int flags = 0;

if (lxc_arguments_parse(&my_args, argc, argv))
exit(EXIT_FAILURE);

log.name = my_args.name;
log.file = my_args.log_file;
log.level = my_args.log_priority;
log.prefix = my_args.progname;
log.quiet = my_args.quiet;
log.lxcpath = my_args.lxcpath[0];

if (lxc_log_init(&log))
exit(EXIT_FAILURE);

···

if (lxc_mkdir_p(my_args.lxcpath[0], 0755))
exit(EXIT_FAILURE);

...

c = lxc_container_new(my_args.name, my_args.lxcpath[0]);
if (!c) {
ERROR("Failed to create lxc container");
exit(EXIT_FAILURE);
}

if (c->is_defined(c)) {
lxc_container_put(c);
ERROR("Container already exists");
exit(EXIT_FAILURE);
}

if (my_args.configfile)
c->load_config(c, my_args.configfile);
else
c->load_config(c, lxc_get_global_config_item("lxc.default_config"));

...

if (!c->create(c, my_args.template, my_args.bdevtype, &spec, flags, &argv[optind])) {
ERROR("Failed to create container %s", c->name);
lxc_container_put(c);
exit(EXIT_FAILURE);
}

lxc_container_put(c);
exit(EXIT_SUCCESS);
}

重点看一看容器创建的关键步骤lxc->create的过程。容器在创建时lxc_container_new,会初始化容器相关的成员函数;就是说lxc->create函数最终调用的是lxcapi_create,最终会调用到核心函数__lxcapi_create(具体可以参考源码lxccontainer.c):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

//lxc_container_new
c->daemonize = true;
c->pidfile = NULL;

/* Assign the member functions. */
c->is_defined = lxcapi_is_defined;
c->state = lxcapi_state;
c->is_running = lxcapi_is_running;
c->freeze = lxcapi_freeze;
c->unfreeze = lxcapi_unfreeze;
c->console = lxcapi_console;
c->console_getfd = lxcapi_console_getfd;
c->devpts_fd = lxcapi_devpts_fd;
c->init_pid = lxcapi_init_pid;
c->init_pidfd = lxcapi_init_pidfd;
c->load_config = lxcapi_load_config;
c->want_daemonize = lxcapi_want_daemonize;
c->want_close_all_fds = lxcapi_want_close_all_fds;
c->start = lxcapi_start;
c->startl = lxcapi_startl;
c->stop = lxcapi_stop;
c->config_file_name = lxcapi_config_file_name;
c->wait = lxcapi_wait;
c->set_config_item = lxcapi_set_config_item;
c->destroy = lxcapi_destroy;
c->destroy_with_snapshots = lxcapi_destroy_with_snapshots;
c->rename = lxcapi_rename;
c->save_config = lxcapi_save_config;
c->get_keys = lxcapi_get_keys;
c->create = lxcapi_create;
c->createl = lxcapi_createl;
c->shutdown = lxcapi_shutdown;
c->reboot = lxcapi_reboot;
c->reboot2 = lxcapi_reboot2;
c->clear_config = lxcapi_clear_config;
c->clear_config_item = lxcapi_clear_config_item;
c->get_config_item = lxcapi_get_config_item;
c->get_running_config_item = lxcapi_get_running_config_item;
c->get_cgroup_item = lxcapi_get_cgroup_item;
c->set_cgroup_item = lxcapi_set_cgroup_item;
c->get_config_path = lxcapi_get_config_path;
c->set_config_path = lxcapi_set_config_path;
c->clone = lxcapi_clone;
c->get_interfaces = lxcapi_get_interfaces;
c->get_ips = lxcapi_get_ips;
c->attach = lxcapi_attach;
c->attach_run_wait = lxcapi_attach_run_wait;
c->attach_run_waitl = lxcapi_attach_run_waitl;
c->snapshot = lxcapi_snapshot;
c->snapshot_list = lxcapi_snapshot_list;
c->snapshot_restore = lxcapi_snapshot_restore;
c->snapshot_destroy = lxcapi_snapshot_destroy;
c->snapshot_destroy_all = lxcapi_snapshot_destroy_all;
c->may_control = lxcapi_may_control;
c->add_device_node = lxcapi_add_device_node;
c->remove_device_node = lxcapi_remove_device_node;
c->attach_interface = lxcapi_attach_interface;
c->detach_interface = lxcapi_detach_interface;
c->checkpoint = lxcapi_checkpoint;
c->restore = lxcapi_restore;
c->migrate = lxcapi_migrate;
c->console_log = lxcapi_console_log;
c->mount = lxcapi_mount;
c->umount = lxcapi_umount;
c->seccomp_notify_fd = lxcapi_seccomp_notify_fd;
c->seccomp_notify_fd_active = lxcapi_seccomp_notify_fd_active;
c->set_timeout = lxcapi_set_timeout;

容器的创建函数__lxcapi_create主要有如下几个流程:

  • 首先通过get_template_path获取到模版的路径(模版是用于创建系统启动环境rootfs的一个脚本)
  • 接着调用create_container_dir创建保存容器rootfs的文件目录
  • 创建一个子进程用于创建容器持久化备份需要的目录,并保存当前的配置到持久化的设备中
  • 执行busybox模版脚本create_run_template(LXC源码中内置的模版可以参考src/templates),用于创建容器启动的初始化环境
  • 最后通过load_config_locked将保存好的配置文件(/var/lib/lxc/busybox/config)的配置加载到内存

为了确保容器在系统重启后依然可用,需要对容器的rootfs进行备份,目前lxc支持多种设备进行备份,比如基于一个文件目录(默认方式),也可以基于逻辑的设备文件,或者overlay的文件系统,具体可以参考src/lxc/storage/storage.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91

static bool __lxcapi_create(struct lxc_container *c, const char *t,
const char *bdevtype, struct bdev_specs *specs,
int flags, char *const argv[])
{
...

if (t) {
path_template = get_template_path(t);
if (!path_template)
return log_error(false, "Template \"%s\" not found", t);
}

...

fd_rootfs = create_container_dir(c);
if (fd_rootfs < 0)
return log_error(false, "Failed to create container %s", c->name);


/* Mark that this container as being created */
partial_fd = create_partial(fd_rootfs, c);
if (partial_fd < 0) {
SYSERROR("Failed to mark container as being partially created");
goto out;
}

/* No need to get disk lock bc we have the partial lock. */

mask = umask(0022);

/* Create the storage.
* Note we can't do this in the same task as we use to execute the
* template because of the way zfs works.
* After you 'zfs create', zfs mounts the fs only in the initial
* namespace.
*/
pid = fork();
...

if (pid == 0) { /* child */
struct lxc_storage *bdev = NULL;

bdev = do_storage_create(c, bdevtype, specs);
...

/* Save config file again to store the new rootfs location. */
if (!do_lxcapi_save_config(c, NULL)) {
ERROR("Failed to save initial config for %s", c->name);
/* Parent task won't see the storage driver in the
* config so we delete it.
*/
bdev->ops->umount(bdev);
bdev->ops->destroy(bdev);
_exit(EXIT_FAILURE);
}

_exit(EXIT_SUCCESS);
}

if (!wait_exited(pid))
goto out_unlock;

/* Reload config to get the rootfs. */
lxc_conf_free(c->lxc_conf);
c->lxc_conf = NULL;

if (!load_config_locked(c, c->configfile))
goto out_unlock;

if (!create_run_template(c, path_template, !!(flags & LXC_CREATE_QUIET), argv))
goto out_unlock;

/* Now clear out the lxc_conf we have, reload from the created
* container.
*/
do_lxcapi_clear_config(c);

if (t) {
if (!prepend_lxc_header(c->configfile, path_template, argv)) {
ERROR("Failed to prepend header to config file");
goto out_unlock;
}
}

bret = load_config_locked(c, c->configfile);
...

return bret;
}

LXC容器的启动过程

lxc-create

lxc-create的实现类似,lxc-start对应的实现都在文件lxc_start中,找到入口函数lxc_start_main,启动的过程主要有如下几个过程:

  • lxc_caps_init: 检查程序运行的权限,如果是非root权限执行,则会退出
  • lxc_arguments_parse: 解析制定的参数,这里我们只是给了一个-n参数容器名称
  • lxc_container_new: 创建一个容器对象,并检查容器是否已经处于运行状态;如果已运行,则执行结束
  • 调用lxc_container->start启动容器系统,创建新的init进程
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

int __attribute__((weak, alias("lxc_start_main"))) main(int argc, char *argv[]);
int lxc_start_main(int argc, char *argv[])
{
...

if (lxc_caps_init())
exit(err);

if (lxc_arguments_parse(&my_args, argc, argv))
exit(err);

if (!my_args.argc)
args = default_args;
else
args = my_args.argv;

log.name = my_args.name;
log.file = my_args.log_file;
log.level = my_args.log_priority;
log.prefix = my_args.progname;
log.quiet = my_args.quiet;
log.lxcpath = my_args.lxcpath[0];

if (lxc_log_init(&log))
exit(err);

lxcpath = my_args.lxcpath[0];
if (access(lxcpath, O_RDONLY) < 0) {
ERROR("You lack access to %s", lxcpath);
exit(err);
}


if (my_args.rcfile) {
...
//未指定rc配置文件
} else {
int rc;

rc = asprintf(&rcfile, "%s/%s/config", lxcpath, my_args.name);
if (rc == -1) {
ERROR("Failed to allocate memory");
exit(err);
}

/* container configuration does not exist */
if (access(rcfile, F_OK)) {
free(rcfile);
rcfile = NULL;
}

c = lxc_container_new(my_args.name, lxcpath);
if (!c) {
ERROR("Failed to create lxc_container");
exit(err);
}
}
...
if (c->is_running(c)) {
ERROR("Container is already running");
err = EXIT_SUCCESS;
goto out;
}
...

//使用默认启动变量
if (args == default_args)
err = c->start(c, 0, NULL) ? EXIT_SUCCESS : EXIT_FAILURE;
else
err = c->start(c, 0, args) ? EXIT_SUCCESS : EXIT_FAILURE;

...
}

do_lxcapi_start

从上文容器创建的过程可以知道,lxc_container->start实际调用的是函数lxcapi_start;最终调用的是do_lxcapi_start,主要有三个参数:

  • struct lxc_container已创建的容器对象
  • int useinit 是否使用init进程,对于启动一个完整容器镜像的情况,useinit0;否则,就利用init来启动一个新的进程
  • char *const argv[] 系统启动的参数,这里传入的是NULL参数

一个容器启动的关键过程有如下几个步骤:

  • ongoing_create: 查看容器是否正常创建,如果已经创建成功,则执行下面的流程

  • lxc_init_handler: 创建容器状态的处理对象,包括像状态管理的unix套接字等

  • 容器启动默认是后台运行的(daemonize),为了确保容器启动正常,会调用两次fork,第一个父进程用于监听容器创建的状态,并返回给lxc-create调用者,最终会退出;而第一个子进程则作为容器的监控进程,进程名为[lxc monitor] <config> <lxc-name>,用于维护进程的状态,并且fork一个新的进程用于启动容器系统,而其父进程则会直接退出

  • 对于一个后台进程来说,主要需要设置如下几个状态

    • chdir:将进程目录修改为根目录/
    • inherit_fds: 关闭不相关的文件
    • null_stdfds: 将标准输入输出符号定向到/dev/null设备
    • setsid: 设置为当前会话进程的首领进程(leader)
  • 调用lxc_start完成容器的最终启动与初始化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141


static bool do_lxcapi_start(struct lxc_container *c, int useinit, char * const argv[])
{
int ret;
struct lxc_handler *handler;
struct lxc_conf *conf;
char *default_args[] = {
"/sbin/init",
NULL,
};
...

ret = ongoing_create(c);
switch (ret) {
case LXC_CREATE_FAILED:
ERROR("Failed checking for incomplete container creation");
return false;
case LXC_CREATE_ONGOING:
ERROR("Ongoing container creation detected");
return false;
case LXC_CREATE_INCOMPLETE:
ERROR("Failed to create container");
do_lxcapi_destroy(c);
return false;
}

...

/* initialize handler */
handler = lxc_init_handler(NULL, c->name, conf, c->config_path, c->daemonize);
...

/* ... otherwise use default_args. */
if (!argv) {
...
argv = default_args;
}

/* I'm not sure what locks we want here.Any? Is liblxc's locking enough
* here to protect the on disk container? We don't want to exclude
* things like lxc_info while the container is running.
*/
if (c->daemonize) {
bool started;
char title[2048];
pid_t pid_first, pid_second;

pid_first = fork();
...

/* first parent */
if (pid_first != 0) {
...
/* Wait for container to tell us whether it started
* successfully.
*/
started = wait_on_daemonized_start(handler, pid_first);

free_init_cmd(init_cmd);
lxc_put_handler(handler);
return started;
}

/* first child */

/* We don't really care if this doesn't print all the
* characters. All that it means is that the proctitle will be
* ugly. Similarly, we also don't care if setproctitle() fails.
*/
ret = strnprintf(title, sizeof(title), "[lxc monitor] %s %s", c->config_path, c->name);
if (ret > 0) {
ret = setproctitle(title);
if (ret < 0)
INFO("Failed to set process title to %s", title);
else
INFO("Set process title to %s", title);
}

/* We fork() a second time to be reparented to init. Like
* POSIX's daemon() function we change to "/" and redirect
* std{in,out,err} to /dev/null.
*/
pid_second = fork();
if (pid_second < 0) {
SYSERROR("Failed to fork first child process");
_exit(EXIT_FAILURE);
}

/* second parent */
if (pid_second != 0) {
free_init_cmd(init_cmd);
lxc_put_handler(handler);
_exit(EXIT_SUCCESS);
}

/* second child */

/* change to / directory */
ret = chdir("/");
if (ret < 0) {
SYSERROR("Failed to change to \"/\" directory");
_exit(EXIT_FAILURE);
}

ret = inherit_fds(handler, true);
if (ret < 0)
_exit(EXIT_FAILURE);

/* redirect std{in,out,err} to /dev/null */
ret = null_stdfds();
if (ret < 0) {
ERROR("Failed to redirect std{in,out,err} to /dev/null");
_exit(EXIT_FAILURE);
}

/* become session leader */
ret = setsid();
if (ret < 0)
TRACE("Process %d is already process group leader", lxc_raw_getpid());
}
...

reboot:
...

if (useinit)
ret = lxc_execute(c->name, argv, 1, handler, c->config_path,
c->daemonize, &c->error_num);
else
ret = lxc_start(argv, handler, c->config_path, c->daemonize,
&c->error_num);

if (conf->reboot == REBOOT_REQ) {
INFO("Container requested reboot");
conf->reboot = REBOOT_INIT;
goto reboot;
}
...


lxc_start

函数lxc_start会设置一个启动的回调函数struct lxc_operations用于容器初始化后执行busyboxinit进程执行,实际是通过__lxc_start完成真正的初始化:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

static struct lxc_operations start_ops = {
.start = start,
.post_start = post_start
};


int lxc_start(char *const argv[], struct lxc_handler *handler,
const char *lxcpath, bool daemonize, int *error_num)
{
struct start_args start_arg = {
.argv = argv,
};

TRACE("Doing lxc_start");
return __lxc_start(handler, &start_ops, &start_arg, lxcpath, daemonize, error_num);
}


函数__lxc_start主要负责初始化容器的namespacecgroup以及挂载容器的rootfs并执行容器的init进程:

  • lxc_init: 初始化容器,比如设置环境变量,监听进程异常退出信号SIGBUS, SIGILL, SIGSEGV;同时初始化cgroup操作相关的函数
  • attach_block_device: 根据rootfs的根目录判断容器是否需要挂载到一个块设备上
  • monitor_create/monitor_delegate_controllers/monitor_entercgroups相关的几个操作,主要用于容器的监控进程cgroup配置
  • resolve_clone_flags/lxc_inherit_namespaces: 解析配置文件中的进程克隆的标志位,检查容器需要继承哪些命名空间
  • lxc_rootfs_init: 初始化容器的rootfs,并锁定rootfs所在的位置,避免容器初始化挂载被修改导致异常
  • lxc_spawn: 初始化容器执行环境,并创建容器的init进程,完成容器的启动
  • lxc_poll: 监听容器的init进程状态,如果容器异常退出,则执行异常恢复、资源回收等操作
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101

int __lxc_start(struct lxc_handler *handler, struct lxc_operations *ops,
void *data, const char *lxcpath, bool daemonize, int *error_num)
{
int ret, status;
const char *name = handler->name;
struct lxc_conf *conf = handler->conf;
struct cgroup_ops *cgroup_ops;

ret = lxc_init(name, handler);
...
handler->ops = ops;
handler->data = data;
handler->daemonize = daemonize;
cgroup_ops = handler->cgroup_ops;

if (!attach_block_device(handler->conf)) {
ERROR("Failed to attach block device");
ret = -1;
goto out_abort;
}

if (!cgroup_ops->monitor_create(cgroup_ops, handler)) {
ERROR("Failed to create monitor cgroup");
ret = -1;
goto out_abort;
}

if (!cgroup_ops->monitor_delegate_controllers(cgroup_ops)) {
ERROR("Failed to delegate controllers to monitor cgroup");
ret = -1;
goto out_abort;
}

if (!cgroup_ops->monitor_enter(cgroup_ops, handler)) {
ERROR("Failed to enter monitor cgroup");
ret = -1;
goto out_abort;
}

ret = resolve_clone_flags(handler);
...
ret = lxc_inherit_namespaces(handler);
...

/* If the rootfs is not a blockdev, prevent the container from marking
* it readonly.
* If the container is unprivileged then skip rootfs pinning.
*/
ret = lxc_rootfs_init(conf, !list_empty(&conf->id_map));
...

ret = lxc_spawn(handler);
if (ret < 0) {
ERROR("Failed to spawn container \"%s\"", name);
goto out_detach_blockdev;
}

handler->conf->reboot = REBOOT_NONE;

ret = lxc_poll(name, handler);
if (ret) {
ERROR("LXC mainloop exited with error: %d", ret);
goto out_delete_network;
}

if (!handler->init_died && handler->pid > 0) {
ERROR("Child process is not killed");
ret = -1;
goto out_delete_network;
}

status = lxc_wait_for_pid_status(handler->pid);
if (status < 0)
SYSERROR("Failed to retrieve status for %d", handler->pid);

/* If the child process exited but was not signaled, it didn't call
* reboot. This should mean it was an lxc-execute which simply exited.
* In any case, treat it as a 'halt'.
*/
if (WIFSIGNALED(status)) {
int signal_nr = WTERMSIG(status);
switch(signal_nr) {
case SIGINT: /* halt */
DEBUG("%s(%d) - Container \"%s\" is halting", signal_name(signal_nr), signal_nr, name);
break;
case SIGHUP: /* reboot */
DEBUG("%s(%d) - Container \"%s\" is rebooting", signal_name(signal_nr), signal_nr, name);
handler->conf->reboot = REBOOT_REQ;
break;
case SIGSYS: /* seccomp */
DEBUG("%s(%d) - Container \"%s\" violated its seccomp policy", signal_name(signal_nr), signal_nr, name);
break;
default:
DEBUG("%s(%d) - Container \"%s\" init exited", signal_name(signal_nr), signal_nr, name);
break;
}
}
...
}

lxc_spawn

lxc_spawn这一步,是容器启动的核心函数,主要负责容器cgroup的创建,加载/sbin/init启动容器1号进程:

  • lxc_sync_init创建一个本地的套接字对,用于同步父进程与子进程的初始化步骤
  • cgroup_ops->payload_create创建容器的cgroup(对应lxc.payload.busybox-lxc
  • 由于我们没有单独设置需要继承的命名空间,因此会调用lxc_clone3创建一个进程
  • 在子进程中,调用函数do_start完成容器的创建与启动,在这里会启动busyboxinit进程
  • 子进程的初始化需要等待父进程lxc-monitor的操作完成,比如设置完cgroup的配置后,告诉子进程lxc_sync_barrier_child继续初始化
  • 完成最后的配置后,最后将容器的运行状态设置为RUNNING状态(此时busybox可能还没有完成加载)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109

static int lxc_spawn(struct lxc_handler *handler)
{
...
if (!lxc_sync_init(handler))
return -1;

ret = socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0,
handler->data_sock);
if (ret < 0)
goto out_sync_fini;
data_sock0 = handler->data_sock[0];
data_sock1 = handler->data_sock[1];

if (container_uses_namespace(handler, CLONE_NEWNET)) {
ret = lxc_find_gateway_addresses(handler);
if (ret) {
ERROR("Failed to find gateway addresses");
goto out_sync_fini;
}
}

if (!cgroup_ops->payload_create(cgroup_ops, handler)) {
ERROR("Failed creating cgroups");
goto out_delete_net;
}

/* Create a process in a new set of namespaces. */
// 如果没有需要继承的namespace,则执行下面的代码
if (inherits_namespaces(handler)) {
...
} else {
int cgroup_fd = -EBADF;

struct clone_args clone_args = {
.flags = handler->clone_flags,
.pidfd = ptr_to_u64(&handler->pidfd),
.exit_signal = SIGCHLD,
};
...
/* Try to spawn directly into target cgroup. */
handler->pid = lxc_clone3(&clone_args, CLONE_ARGS_SIZE_VER2);
...

if (handler->pid == 0) {
(void)do_start(handler);
_exit(EXIT_FAILURE);
}
}

...
if (!cgroup_ops->setup_limits_legacy(cgroup_ops, handler->conf, false)) {
ERROR("Failed to setup cgroup limits for container \"%s\"", name);
goto out_delete_net;
}

if (!cgroup_ops->payload_delegate_controllers(cgroup_ops)) {
ERROR("Failed to delegate controllers to payload cgroup");
goto out_delete_net;
}

if (!cgroup_ops->payload_enter(cgroup_ops, handler)) {
ERROR("Failed to enter cgroups");
goto out_delete_net;
}

if (!cgroup_ops->setup_limits(cgroup_ops, handler)) {
ERROR("Failed to setup cgroup limits for container \"%s\"", name);
goto out_delete_net;
}

if (!cgroup_ops->chown(cgroup_ops, handler->conf))
goto out_delete_net;

if (!lxc_sync_barrier_child(handler, START_SYNC_STARTUP))
goto out_delete_net;

...

ret = setup_proc_filesystem(conf, handler->pid);
if (ret < 0) {
ERROR("Failed to setup procfs limits");
goto out_delete_net;
}

ret = setup_resource_limits(conf, handler->pid);
if (ret < 0) {
ERROR("Failed to setup resource limits");
goto out_delete_net;
}

/* Tell the child to continue its initialization. */
if (!lxc_sync_wake_child(handler, START_SYNC_POST_CONFIGURE))
goto out_delete_net;

...

ret = handler->ops->post_start(handler, handler->data);
if (ret < 0)
goto out_abort;

ret = lxc_set_state(name, handler, RUNNING);
...

return 0;
...
}


do_start

直行道do_start这里,是容器启动最后的一个关键步骤,会启动一个新的init进程,初始化对应的busybox最小系统的执行环境,关键的流程如下:

  • lxc_sync_wake_parent: 等待父进程的状态START_SYNC_CONFIGURE,开始执行容器的初始化操作,比如通过unshare创建新的控制分组cgroup
  • lxc_setup: 容器配置初始化,比如挂载容器的rootfs,设置容器的主机名,设置容器网络设备状态
  • 执行其他初始化动作,比如设置标准输入输出到/dev/null;调用setsid启动新的会话;lxc_set_environment设置容器的环境变量
  • handler->ops->start函数对应的start函数(参考lxc_start),实际调用execvp执行容器的可执行文件/sbin/init
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102


static int do_start(void *data)
{
struct lxc_handler *handler = data;
__lxc_unused __do_close int data_sock0 = handler->data_sock[0],
data_sock1 = handler->data_sock[1];
__do_close int devnull_fd = -EBADF, status_fd = -EBADF;
...

/*
* Tell the parent task it can begin to configure the container and wait
* for it to finish.
*/
if (!lxc_sync_wake_parent(handler, START_SYNC_CONFIGURE))
goto out_error;

/* Unshare cgroup namespace after we have setup our cgroups. If we do it
* earlier we end up with a wrong view of /proc/self/cgroup. For
* example, assume we unshare(CLONE_NEWCGROUP) first, and then create
* the cgroup for the container, say /sys/fs/cgroup/cpuset/lxc/c, then
* /proc/self/cgroup would show us:
*
* 8:cpuset:/lxc/c
*
* whereas it should actually show
*
* 8:cpuset:/
*/
if (handler->ns_unshare_flags & CLONE_NEWCGROUP) {
ret = unshare(CLONE_NEWCGROUP);
if (ret < 0) {
if (errno != EINVAL) {
SYSERROR("Failed to unshare CLONE_NEWCGROUP");
goto out_warn_father;
}

handler->ns_clone_flags &= ~CLONE_NEWCGROUP;
SYSINFO("Kernel does not support CLONE_NEWCGROUP");
} else {
INFO("Unshared CLONE_NEWCGROUP");
}
}
...

/* Setup the container, ip, names, utsname, ... */
ret = lxc_setup(handler);
if (ret < 0) {
ERROR("Failed to setup container \"%s\"", handler->name);
goto out_warn_father;
}

...

if (handler->conf->console.pty < 0 && handler->daemonize) {
if (devnull_fd < 0) {
devnull_fd = open_devnull();
if (devnull_fd < 0)
goto out_warn_father;
}

ret = set_stdfds(devnull_fd);
if (ret < 0) {
ERROR("Failed to redirect std{in,out,err} to \"/dev/null\"");
goto out_warn_father;
}
}

...

setsid();
...

/* Reset the environment variables the user requested in a clear
* environment.
*/
ret = clearenv();
/* Don't error out though. */
if (ret < 0)
SYSERROR("Failed to clear environment.");

ret = lxc_set_environment(handler->conf);
if (ret < 0)
goto out_warn_father;

ret = putenv("container=lxc");
if (ret < 0) {
SYSERROR("Failed to set environment variable: container=lxc");
goto out_warn_father;
}

...

/*
* After this call, we are in error because this ops should not return
* as it execs.
*/
handler->ops->start(handler, handler->data);
...
return -1;
}

为什么有两个init进程

容器启动完成后,通过pstree我们可以看到系统多了几个进程,lxc-start(83578)这个进程是容器的监控进程,init(83579)这个进程是busybox的最小系统的init进程,即整个容器的一号进程,这个进程会执行初始化脚本,完成容器系统启动。

1
2
3
4
5
6
7
8

~$ sudo pstree -p |grep lxc
|-lxc-monitord(762)
|-lxcfs(670)-+-{lxcfs}(680)
| `-{lxcfs}(681)
| |-lxc-start(83578)---init(83579)-+-getty(83668)


到容器的rootfs目录/var/lib/lxc/busybox-lxc/rootfs查看初始化脚本/etc/inittab,可以看到,系统会执行三个初始化动作:

  • sysinit 启动系统初始化脚本/etc/init.d/rcS,比如启动syslogd,挂载所有的fstab文件系统,启动DHCP服务
  • respawn意思是进程会在异常时重新启动,这里会启动一个/bin/getty守护进程用于系统的登录
  • askfirst执行前需要询问用户输入,这里会创建一个sh进程(实际/bin/sh软链接到了busybox
1
2
3
4
5
6
7
8
9
10
11
12
13


::sysinit:/etc/init.d/rcS
tty1::respawn:/bin/getty -L tty1 115200 vt100
console::askfirst:/bin/sh


## /etc/init.d/rcS
#!/bin/sh
/bin/syslogd
/bin/mount -a
/bin/udhcpc

在上一篇文章中,我们提到了在attach进入到容器中可以看到两个init进程,pid1的进程就是lxc-start启动的容器进程;而pid16的进程实际是执行脚本console::askfirst:/bin/sh创建的进程;如果我们把/var/lib/lxc/busybox-lxc/rootfs/etc/inittab中的console::askfirst:/bin/sh注释掉,再进去容器里用top查看,会发现没有了16init进程。

1
2
3
4
5
6
7
8
9
10
11
12
13
14

Mem: 40201368K used, 549680K free, 1329880K shrd, 738404K buff, 19484428K cached
CPU: 5% usr 0% sys 0% nic 94% idle 0% io 0% irq 0% sirq
Load average: 0.91 1.13 1.47 2/2955 21
PID PPID USER STAT VSZ %VSZ %CPU COMMAND
1 0 root S 2456 0% 0% init
4 1 root S 2456 0% 0% /bin/syslogd
14 1 root S 2456 0% 0% /bin/udhcpc
15 1 root S 2456 0% 0% /bin/getty -L tty1 115200 vt100
16 1 root S 2456 0% 0% init
17 0 root S 2456 0% 0% /bin/sh
21 17 root R 2456 0% 0% top


参考资料

原文作者:Jason Wang

更新日期:2025-07-09, 19:29:59

版权声明:本文采用知识共享署名-非商业性使用 4.0 国际许可协议进行许可

CATALOG
  1. 1. LXC容器的创建过程
  2. 2. LXC容器的启动过程
  3. 3. lxc_spawn
  4. 4. 为什么有两个init进程
  5. 5. 参考资料