<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>JasonWang&#39;s Blog</title>
  
  <subtitle>本色做人，角色做事</subtitle>
  <link href="https://sniffer.site/atom.xml" rel="self"/>
  
  <link href="https://sniffer.site/"/>
  <updated>2026-01-30T11:07:07.249Z</updated>
  <id>https://sniffer.site/</id>
  
  <author>
    <name>Jason Wang</name>
    
  </author>
  
  <generator uri="https://hexo.io/">Hexo</generator>
  
  <entry>
    <title>MACVLAN详解</title>
    <link href="https://sniffer.site/2026/01/17/MACVLAN%E8%AF%A6%E8%A7%A3/"/>
    <id>https://sniffer.site/2026/01/17/MACVLAN%E8%AF%A6%E8%A7%A3/</id>
    <published>2026-01-17T02:30:00.000Z</published>
    <updated>2026-01-30T11:07:07.249Z</updated>
    
    <content type="html"><![CDATA[<p><code>MACVLAN</code>(<code>MAC Virtual LAN</code>)是Linux内核提供的一种<code>L2</code>层的网络虚拟化技术，它允许在单个物理接口上创建多个虚拟子接口，每个子接口拥有独立的 <code>MAC</code>地址。与<code>Linux Bridge</code>相比，MACVLAN 减少了数据包处理层级，提供了更简洁的网络架构和更好的性能。<code>MACVLAN</code>通常用于容器网络、虚拟机网络等场景，为容器和虚拟机提供访问外部网络的能力。</p><p>本文结合实际的业务场景，基于<code>Linux 5.15</code>内核源码，深入分析<code>MACVLAN</code>的实现原理和工作机制。</p><span id="more"></span><h2 id="MACVLAN-简介"><a href="#MACVLAN-简介" class="headerlink" title="MACVLAN 简介"></a><strong>MACVLAN 简介</strong></h2><p><code>MACVLAN</code>允许在单个物理网络接口（父接口）上创建多个虚拟子接口。每个子接口拥有独立的 MAC 地址，对网络中的其他设备而言，它们就像是独立的物理设备。相对其他网络虚拟化技术，<code>MACVLAN</code>具有如下几个优势：</p><ul><li><strong>独立 MAC 地址</strong>： 每个子接口拥有唯一的 MAC 地址，可被网络独立识别</li><li><strong>多种工作模式</strong>： 支持私有、VEPA、桥接、直通、源地址五种模式 </li><li><strong>高性能</strong> ： 数据包无需经过额外的桥接层，减少处理开销 </li><li><strong>简化拓扑</strong> ： 无需创建 Linux Bridge，直接通过物理接口通信</li></ul><h2 id="MACVLAN实现原理"><a href="#MACVLAN实现原理" class="headerlink" title="MACVLAN实现原理"></a><strong>MACVLAN实现原理</strong></h2><p>在<code>Linux</code>内核中，对应有<code>MACVLAN</code>的网络驱动，核心代码位于<code>drivers/net/macvlan.c</code>文件中，主要提供了如下两个关键的数据结构：</p><ul><li><code>struct macvlan_dev</code>: MACVLAN设备，用于表示一个MACVLAN子设备，包含了MACVLAN设备、MACVLAN端口等信息</li><li><code>struct macvlan_port</code>: MACVLAN端口，表示一个MACVLAN端口，包含了MACVLAN子设备的列表和MAC地址哈希表等信息</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// drivers/net/macvlan.c</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">macvlan_dev</span> &#123;</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">net_device</span>       *<span class="title">dev</span>;</span>           <span class="comment">// MACVLAN设备</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">net_device</span>       *<span class="title">lowerdev</span>;</span>      <span class="comment">// 物理父设备</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">macvlan_port</span>     *<span class="title">port</span>;</span>          <span class="comment">// 所属的端口</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">hlist_node</span>       <span class="title">hlist</span>;</span>          <span class="comment">// 哈希链表节点</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">list_head</span>        <span class="title">list</span>;</span>           <span class="comment">// 设备链表节点</span></span><br><span class="line">    <span class="class"><span class="keyword">enum</span> <span class="title">macvlan_mode</span>       <span class="title">mode</span>;</span>           <span class="comment">// 工作模式</span></span><br><span class="line">    u16                     flags;          <span class="comment">// 设备标志</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">macvlan_source_entry</span> __<span class="title">rcu</span> *<span class="title">source_list</span>;</span></span><br><span class="line">    <span class="type">unsigned</span> <span class="type">int</span>            macaddr_count;</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">macvlan_port</span> &#123;</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">net_device</span>       *<span class="title">dev</span>;</span>           <span class="comment">// 物理父设备</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">hlist_head</span>       <span class="title">vlan_hash</span>[<span class="title">MACVLAN_HASH_SIZE</span>];</span> <span class="comment">// MAC地址哈希表</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">list_head</span>        <span class="title">vlans</span>;</span>          <span class="comment">// MACVLAN设备链表</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">rcu_head</span>         <span class="title">rcu</span>;</span>            <span class="comment">// RCU保护</span></span><br><span class="line">    <span class="type">bool</span>                    passthru;       <span class="comment">// 是否直通模式</span></span><br><span class="line">    <span class="type">int</span>                     count;          <span class="comment">// MACVLAN设备数量</span></span><br><span class="line">&#125;;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>对<code>MACVLAN</code>虚拟网卡来说，主要的处理都集中在数据链路层(L2)，我们以<code>MACVLAN</code>设备注册与数据包发送为例说明<code>MACVLAN</code>的实现原理。</p><h3 id="MACVLAN-的注册流程"><a href="#MACVLAN-的注册流程" class="headerlink" title="MACVLAN 的注册流程"></a><strong>MACVLAN 的注册流程</strong></h3><p><code>MACVLAN</code>设备的创建和注册过程如下,主要包括如下几个步骤：</p><ul><li>查找并验证父设备：检查父设备是否存在、是否支持<code>MACVLAN</code>。</li><li>初始化 <code>MACVLAN</code> 设备：创建 <code>MACVLAN</code> 设备、初始化<code>MACVLAN</code>端口等</li><li>注册 <code>MACVLAN</code> 设备：将 <code>MACVLAN</code> 设备注册到内核网络设备列表中</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// drivers/net/macvlan.c</span></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> <span class="title function_">macvlan_newlink</span><span class="params">(<span class="keyword">struct</span> net *src_net, <span class="keyword">struct</span> net_device *dev,</span></span><br><span class="line"><span class="params">                            <span class="keyword">struct</span> nlattr *tb[], <span class="keyword">struct</span> nlattr *data[])</span></span><br><span class="line">&#123;</span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">macvlan_dev</span> *<span class="title">vlan</span> =</span> netdev_priv(dev);</span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">net_device</span> *<span class="title">lowerdev</span>;</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">macvlan_port</span> *<span class="title">port</span>;</span></span><br><span class="line">    <span class="type">int</span> err;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// 1. 查找并验证父设备</span></span><br><span class="line">    <span class="keyword">if</span> (!tb[IFLA_LINK])</span><br><span class="line">        <span class="keyword">return</span> -EINVAL;</span><br><span class="line">    lowerdev = __dev_get_by_index(src_net, nla_get_u32(tb[IFLA_LINK]));</span><br><span class="line">    <span class="keyword">if</span> (!lowerdev)</span><br><span class="line">        <span class="keyword">return</span> -ENODEV;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// 2. 检查父设备是否支持 MACVLAN</span></span><br><span class="line">    <span class="keyword">if</span> (!macvlan_port_exists(lowerdev)) &#123;</span><br><span class="line">        <span class="comment">// 首次在此设备上创建 MACVLAN，初始化端口</span></span><br><span class="line">        err = macvlan_port_create(lowerdev);</span><br><span class="line">        <span class="keyword">if</span> (err)</span><br><span class="line">            <span class="keyword">return</span> err;</span><br><span class="line">    &#125;</span><br><span class="line">    port = macvlan_port_get_rtnl(lowerdev);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// 3. 初始化 MACVLAN 设备</span></span><br><span class="line">    vlan-&gt;lowerdev = lowerdev;</span><br><span class="line">    vlan-&gt;dev = dev;</span><br><span class="line">    vlan-&gt;port = port;</span><br><span class="line">    vlan-&gt;mode = MACVLAN_MODE_VEPA;  <span class="comment">// 默认 VEPA 模式</span></span><br><span class="line"></span><br><span class="line">    <span class="comment">// 4. 设置 MAC 地址</span></span><br><span class="line">    <span class="keyword">if</span> (tb[IFLA_ADDRESS])</span><br><span class="line">        eth_hw_addr_set(dev, nla_data(tb[IFLA_ADDRESS]));</span><br><span class="line">    <span class="keyword">else</span></span><br><span class="line">        eth_hw_addr_random(dev);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// 5. 将设备加入端口列表</span></span><br><span class="line">    list_add_tail_rcu(&amp;vlan-&gt;<span class="built_in">list</span>, &amp;port-&gt;vlans);</span><br><span class="line">    macvlan_hash_add(vlan);</span><br><span class="line">    port-&gt;count++;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// 6. 注册网络设备</span></span><br><span class="line">    err = register_netdevice(dev);</span><br><span class="line">    <span class="keyword">if</span> (err)</span><br><span class="line">        <span class="keyword">goto</span> cleanup;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line">cleanup:</span><br><span class="line">    macvlan_delete(vlan);</span><br><span class="line">    <span class="keyword">return</span> err;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="数据包接收流程"><a href="#数据包接收流程" class="headerlink" title="数据包接收流程"></a>数据包接收流程</h3><p>当物理接口接收到数据包时，一般会通过网络软中断进行处理，然后再发送到<code>MACVLAN</code>设备，<code>MACVLAN</code>的处理流程如下：</p><ul><li>获取<code>MACVLAN</code>端口结构：通过<code>skb-&gt;dev</code>获取<code>MACVLAN</code>端口结构</li><li>处理多播&#x2F;广播包：<code>MACVLAN</code>端口处理多播&#x2F;广播包的逻辑</li><li>源地址检查：防止 <code>MAC</code> 地址欺骗</li><li>处理数据包：<code>MACVLAN</code>端口处理数据包</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// drivers/net/macvlan.c</span></span><br><span class="line"><span class="type">static</span> <span class="type">rx_handler_result_t</span> <span class="title function_">macvlan_handle_frame</span><span class="params">(<span class="keyword">struct</span> sk_buff **pskb)</span></span><br><span class="line">&#123;</span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">macvlan_port</span> *<span class="title">port</span>;</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">sk_buff</span> *<span class="title">skb</span> =</span> *pskb;</span><br><span class="line">    <span class="type">const</span> <span class="class"><span class="keyword">struct</span> <span class="title">ethhdr</span> *<span class="title">eth</span> =</span> eth_hdr(skb);</span><br><span class="line">    <span class="type">const</span> <span class="class"><span class="keyword">struct</span> <span class="title">macvlan_dev</span> *<span class="title">vlan</span>;</span></span><br><span class="line"></span><br><span class="line">    <span class="comment">// 获取 MACVLAN 端口结构</span></span><br><span class="line">    port = macvlan_port_get_rcu(skb-&gt;dev);</span><br><span class="line">    <span class="keyword">if</span> (!port)</span><br><span class="line">        <span class="keyword">return</span> RX_HANDLER_PASS;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// 处理多播/广播包</span></span><br><span class="line">    <span class="keyword">if</span> (is_multicast_ether_addr(eth-&gt;h_dest)) &#123;</span><br><span class="line">        <span class="keyword">return</span> macvlan_handle_multicast(pskb, port, eth);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// 源地址检查（防止 MAC 地址欺骗）</span></span><br><span class="line">    macvlan_forward_source(skb, port, eth-&gt;h_source);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// 直通模式：使用第一个 MACVLAN 设备</span></span><br><span class="line">    <span class="keyword">if</span> (macvlan_passthru(port))</span><br><span class="line">        vlan = list_first_or_null_rcu(&amp;port-&gt;vlans, <span class="keyword">struct</span> macvlan_dev, <span class="built_in">list</span>);</span><br><span class="line">    <span class="keyword">else</span></span><br><span class="line">        vlan = macvlan_hash_lookup(port, eth-&gt;h_dest);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (!vlan || vlan-&gt;mode == MACVLAN_MODE_SOURCE)</span><br><span class="line">        <span class="keyword">return</span> RX_HANDLER_PASS;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// 将数据包交给对应的 MACVLAN 设备处理</span></span><br><span class="line">    skb-&gt;dev = vlan-&gt;dev;</span><br><span class="line">    skb-&gt;pkt_type = PACKET_HOST;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> RX_HANDLER_ANOTHER;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>数据的发送流程主要在<code>macvlan_queue_xmit</code>函数中，不再赘述，有兴趣的可以自己研究下。</p><h3 id="工作模式"><a href="#工作模式" class="headerlink" title="工作模式"></a><strong>工作模式</strong></h3><p>MACVLAN 支持五种工作模式，使用位掩码定义：</p><ul><li><code>MACVLAN_MODE_PRIVATE</code>: 私有模式，同一父设备下的虚拟子网卡之间完全隔离，且子网卡与父接口所在主机也完全隔离，所有子网卡仅能与外部网络通信，数据包无需在主机内做任何转发</li><li><code>MACVLAN_MODE_VEPA</code>: 虚拟以太网端口聚合模式，禁止同 <code>MACVLAN</code>实例的子网卡之间直接通信，所有子网卡的数据包（包括同组互访）都必须通过物理网卡转发到外部交换机，由交换机完成数据包的转发 &#x2F; 过滤 &#x2F; 隔离，再回传给目标子网卡</li><li><code>MACVLAN_MODE_BRIDGE</code>: 桥接模式，同父设备下的 <code>MACVLAN</code> 可直接通信，组成一个虚拟的网络，子网卡之间的数据可以直接通过内核转发，无需经过物理网卡与外部交换机</li><li><code>MACVLAN_MODE_PASSTHRU</code>: 一个物理网卡仅能创建一个虚拟子网卡，子网卡直接 “直通”物理父接口的二层特性，可复用父接口的 MAC 地址（或指定专属 MAC），子网卡的二层流量直接映射到物理网卡，内核仅做简单的数据包透传</li><li><code>MACVLAN_MODE_SOURCE</code>: 基于源MAC地址过滤网络报文</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// include/uapi/linux/if_macvlan.h</span></span><br><span class="line"><span class="class"><span class="keyword">enum</span> <span class="title">macvlan_mode</span> &#123;</span></span><br><span class="line">    MACVLAN_MODE_PRIVATE  = <span class="number">1</span>, </span><br><span class="line">    MACVLAN_MODE_VEPA     = <span class="number">2</span>,</span><br><span class="line">    MACVLAN_MODE_BRIDGE   = <span class="number">4</span>,</span><br><span class="line">    MACVLAN_MODE_PASSTHRU = <span class="number">8</span>,</span><br><span class="line">    MACVLAN_MODE_SOURCE   = <span class="number">16</span>,</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><h2 id="应用场景"><a href="#应用场景" class="headerlink" title="应用场景"></a><strong>应用场景</strong></h2><p><code>MACVLAN</code>作为Linux下轻量级的网络虚拟化技术，在容器、虚拟机、虚拟化网络等场景中得到了广泛的应用。</p><h3 id="容器网络"><a href="#容器网络" class="headerlink" title="容器网络"></a><strong>容器网络</strong></h3><p>让容器直接连接到物理网络，获得独立 IP：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 创建MACVLAN网络</span></span><br><span class="line">docker network create -d macvlan \</span><br><span class="line">  --subnet=192.168.1.0/24 \</span><br><span class="line">  --gateway=192.168.1.1 \</span><br><span class="line">  -o parent=eth0 \</span><br><span class="line">  macvlan_net</span><br><span class="line"></span><br><span class="line"><span class="comment"># 运行容器并连接到MACVLAN网络</span></span><br><span class="line">docker run --network macvlan_net \</span><br><span class="line">  --ip=192.168.1.100 \</span><br><span class="line">  -it ubuntu:latest</span><br></pre></td></tr></table></figure><h3 id="虚拟机网络"><a href="#虚拟机网络" class="headerlink" title="虚拟机网络"></a><strong>虚拟机网络</strong></h3><p>为 KVM&#x2F;QEMU 虚拟机提供直接网络访问：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 创建MACVLAN接口</span></span><br><span class="line">ip <span class="built_in">link</span> add macvlan0 <span class="built_in">link</span> eth0 <span class="built_in">type</span> macvlan mode bridge</span><br><span class="line">ip <span class="built_in">link</span> <span class="built_in">set</span> macvlan0 address 00:11:22:33:44:55</span><br><span class="line">ip <span class="built_in">link</span> <span class="built_in">set</span> macvlan0 up</span><br><span class="line">ip addr add 192.168.1.200/24 dev macvlan0</span><br></pre></td></tr></table></figure><h3 id="网络隔离"><a href="#网络隔离" class="headerlink" title="网络隔离"></a><strong>网络隔离</strong></h3><p>为不同服务创建独立的网络接口，实现服务间隔离：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 为不同服务创建独立的MACVLAN接口</span></span><br><span class="line">ip <span class="built_in">link</span> add web-vlan <span class="built_in">link</span> eth0 <span class="built_in">type</span> macvlan mode private</span><br><span class="line">ip <span class="built_in">link</span> add db-vlan <span class="built_in">link</span> eth0 <span class="built_in">type</span> macvlan mode private</span><br><span class="line">ip <span class="built_in">link</span> add cache-vlan <span class="built_in">link</span> eth0 <span class="built_in">type</span> macvlan mode private</span><br><span class="line"></span><br><span class="line"><span class="comment"># 配置IP地址</span></span><br><span class="line">ip addr add 10.0.1.10/24 dev web-vlan</span><br><span class="line">ip addr add 10.0.1.20/24 dev db-vlan</span><br><span class="line">ip addr add 10.0.1.30/24 dev cache-vlan</span><br></pre></td></tr></table></figure><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a><strong>总结</strong></h2><p><code>MACVLAN</code> 是一种轻量级的网络虚拟化技术，通过在单个物理接口上虚拟出多个独立 MAC 地址的子接口，每个网卡都共享父接口的网络链路与物理带宽。相比 Linux Bridge，它减少了数据包处理层级，在性能和简洁性上都具有优势。支持五种不同的工作模式（<code>PRIVATE、VEPA、BRIDGE、PASSTHRU、SOURCE</code>）适应了不同的隔离和通信需求，使其在容器网络、虚拟机网络、服务隔离等场景中得到广泛应用。理解其内核实现机制，有助于在实际部署中做出更合理的网络架构设计。</p><h2 id="参考文献"><a href="#参考文献" class="headerlink" title="参考文献"></a>参考文献</h2><ol><li>Linux内核源码: <a href="https://github.com/torvalds/linux/tree/master/drivers/net/macvlan.c">https://github.com/torvalds/linux/tree/master/drivers/net/macvlan.c</a></li><li>MACVLAN内核文档: <a href="https://www.kernel.org/doc/Documentation/networking/macvlan.txt">https://www.kernel.org/doc/Documentation/networking/macvlan.txt</a></li><li>Docker MACVLAN文档: <a href="https://docs.docker.com/network/macvlan/">https://docs.docker.com/network/macvlan/</a></li><li><a href="http://man7.org/linux/man-pages/man8/ip-link.8.html">Linux IP link</a></li></ol><ul><li><a href="https://developers.redhat.com/blog/2018/10/22/introduction-to-linux-interfaces-for-virtual-networking#vxlan">Linux Interfaces for Virtual Networking</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;&lt;code&gt;MACVLAN&lt;/code&gt;(&lt;code&gt;MAC Virtual LAN&lt;/code&gt;)是Linux内核提供的一种&lt;code&gt;L2&lt;/code&gt;层的网络虚拟化技术，它允许在单个物理接口上创建多个虚拟子接口，每个子接口拥有独立的 &lt;code&gt;MAC&lt;/code&gt;地址。与&lt;code&gt;Linux Bridge&lt;/code&gt;相比，MACVLAN 减少了数据包处理层级，提供了更简洁的网络架构和更好的性能。&lt;code&gt;MACVLAN&lt;/code&gt;通常用于容器网络、虚拟机网络等场景，为容器和虚拟机提供访问外部网络的能力。&lt;/p&gt;
&lt;p&gt;本文结合实际的业务场景，基于&lt;code&gt;Linux 5.15&lt;/code&gt;内核源码，深入分析&lt;code&gt;MACVLAN&lt;/code&gt;的实现原理和工作机制。&lt;/p&gt;</summary>
    
    
    
    <category term="网络协议" scheme="https://sniffer.site/categories/%E7%BD%91%E7%BB%9C%E5%8D%8F%E8%AE%AE/"/>
    
    
    <category term="Linux" scheme="https://sniffer.site/tags/Linux/"/>
    
    <category term="网络虚拟化" scheme="https://sniffer.site/tags/%E7%BD%91%E7%BB%9C%E8%99%9A%E6%8B%9F%E5%8C%96/"/>
    
    <category term="MACVLAN" scheme="https://sniffer.site/tags/MACVLAN/"/>
    
    <category term="容器网络" scheme="https://sniffer.site/tags/%E5%AE%B9%E5%99%A8%E7%BD%91%E7%BB%9C/"/>
    
  </entry>
  
  <entry>
    <title>爱,AI与死亡</title>
    <link href="https://sniffer.site/2025/12/31/%E7%88%B1-AI%E4%B8%8E%E6%AD%BB%E4%BA%A1/"/>
    <id>https://sniffer.site/2025/12/31/%E7%88%B1-AI%E4%B8%8E%E6%AD%BB%E4%BA%A1/</id>
    <published>2025-12-31T14:27:26.000Z</published>
    <updated>2026-01-02T06:08:36.868Z</updated>
    
    <content type="html"><![CDATA[<blockquote><p>敢于浪费哪怕一个钟头时间的人，说明他还不懂得珍惜生命的全部价值</p><pre><code>达尔文</code></pre></blockquote><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/horse-success-1.png" alt="2026-horse"></p><span id="more"></span><p>2026年已经悄然来临，看着窗外的夜色，距离新年的到来已不到六个小时。按照惯例，停下脚步，放慢节奏，认真看看过去一年发生的事情与变化，写下内心的感悟，算是对过去的一个交代。标题有点模仿奈飞的动画短片《<a href="https://en.wikipedia.org/wiki/Love,_Death_%26_Robots"><strong>Love, Death &amp; Robots</strong></a>》的意思，作为这一年核心的三个主题。</p><h2 id="爱"><a href="#爱" class="headerlink" title="爱"></a><strong>爱</strong></h2><p>说到爱，常会想起作家木心在《文学回忆录》中说的一句话，“有人在爱，有人在被爱，很幸福，也很麻烦”，他又说，“爱，是一场自我教育”。爱本质上，是“我”与外界的一种真诚而深厚的情感联系，这里涉及到两个维度，一个是“我”，一个是“外界的本体”，比如父母，子女，配偶，朋友，同事或自然之间的任何事物，抑或是对自己的爱。</p><p>从小到大，在集体主义的文化熏陶下，身边的人经常会说要爱父母，爱自己的亲人，但很少有人说要好好爱自己，好像提醒一个人爱自己是一个自私而让人羞愧的事情。但若深入想一想，所有的爱的源头都在“我”，在于我们怎么看待自己，怎么定位自我，怎么面对自己内心的欲求，怎么处理与身边人的关系，怎么去自爱。一切的幸福与苦恼的源头都在“我”。如果我们要从与外界的联系中获得幸福而非痛苦，首先需要做的是对自我有一个清晰的定位与认知，学会接纳事物应有的状态，保持开放空杯的心态。从三个方面说一说我所理解的爱。</p><p>我与父母的关系。长期以来，由于缺乏足够的沟通，我与父母的关系都常年处于紧张的状态，父母经年累月的争吵常让我焦虑不已。我无法处理好与父母之间的关系。我把父亲认定为思想保守，重男轻女的大男子主义，难以为他人考虑的男人；把母亲认定为任劳任怨，倔强而情绪容易失控的女人。每当父母为生活中鸡毛蒜皮的事情争吵时，我都习惯性的讲道理，然后去指责对方，结果是大家都陷入情绪的泥沼，很难受。这两年每次回家，看到母亲满头白发，我才逐渐意识到，父母已经老了，他们跟我相处的时间会越来越少。每每想到这里，父母身上所有的毛病都不再重要，重要的是，我真心的爱他们，希望他们过得身体健康，生活快乐。基于此，我没有理由去指责他们，相反，我站在他们的角度思考问题，跟他们交流聊天，多了解下他们过去的生活。而父母也会因为我的嘘寒问暖而更多的信任我，跟我聊更多的话题，彼此相处也会更轻松愉快。所以，我与父母之间的问题，本源在“我”，在我怎么看待自己与他们之间的关系，怎么换个方式或者角度去看待父母身上的“问题”。</p><p>我与女儿的关系。由于有时过于严格，会对女儿发脾气，导致她一直对我都带着一种陌生感，因而更喜欢跟她妈妈呆在一起。我看到这种差异，却改变的非常慢。现在想来，核心在于我并没有懂得如何处理与她的关系，就像我与父母的关系一样。我把女儿定义为一个不懂事，贪玩，没有自制力的小孩，可却完全忽视了她身上发光点，比如她会坚持做她喜欢的事情，即使累她也要去做；她会坚持把自己的鞋子、东西摆放的整整齐齐；她会在我开会工作时，尽量自己玩，不打扰我。回想起来，其实所有问题的源头都在我：在于我太过“自我中心”，强调自己的权威，在于我无法体察到她内心的情绪变化，在她哭泣时没有给予她安慰与安全感；在于不会妥善处理好自己的情绪，没有找到一种合适的方式去表达自己内心的爱意。</p><p>关于“我”与身体的关系。年轻的时候，总会或多或少忽视身体的健康，胡乱吃东西，熬夜，追求一些虚无缥缈的东西，毫无节制，似乎身体只是“我”的一个工具而已，而忘了身体是“我”的一部分，“我”想要生活的更快乐，更幸福，更安宁，首先要学会照顾好身体，学会保养身体，保持身心的和谐一致，而非完全把身体置身“我”之外。</p><p>爱是什么？爱一场自我修为的磨炼，是不断的自我教育。我们体会到爱，是因为基于爱这个媒介，我们能更好的感受到自己的存在，能更好的成全自我，也能助力身边的人变得更好。</p><h2 id="AI"><a href="#AI" class="headerlink" title="AI"></a>AI</h2><p>2025年是AI能力大飞跃的一年，如果chatGPT的出现算是大语言模型诞生的元年，那么2025年就是大模型突破生产力想象边界爆发的一年。AI技术的发展明显加速了。</p><ol><li>年初<code>Deepseek-R1</code>的出现，让大家感受到了AI突变带来的喜悦与便利</li><li>英伟达营收与利润持续高增，市值突破令人吃惊的5万亿美元，真正富可敌国</li><li>Claude Code&#x2F;Cursor&#x2F;Manus AI等各种Agent工具的出现，让AI的应用场景更加丰富</li><li>年末Google推出的Gemini3与Nano Banana带来的能力突破，让人着实惊叹AI的能力</li><li>机器人从一个极客的玩具，变成了一个与人同台跳舞的多面手</li></ol><p>如今，AI大模型不仅仅能够更熟练的进行自动驾驶，也能够参加国际数学竞赛拿到头奖；不仅能够进行编程，也能够基于多模态的信号处理更复杂的用户任务；不仅能够下象棋，玩游戏，也能够进行科学探索，更高效的发现人类无法发现的规律。AI大模型正在悄然的改变人类的生活方式，一场新的技术革命已然到来。那么，我们的社会要如何应对这场全新的技术革命？</p><ul><li>单纯的知识技能不再重要了，更重要的是创造力，是发现与提出问题的能力</li><li>传统的学校教育亟需调整，不能再以传授知识为主，而应以培养人的创造力为主</li><li>社会的很多工种，比如文案撰写者、新闻编辑，客服，售后，初级会计与财经分析师等都逐渐会消失，取而代之的是一个个AI的代理人</li><li>组织内的很多业务流程会逐渐被AI代理接管，业务流程的处理会更加快速高效，而人不过是AI代理的一个协作者而已</li></ul><p>人类是时候反思自己与机器智能之间的关系了。未来，机器人会进入千家万户，成为另外一个家庭成员；人与具备“思维”的机器人谈恋爱，甚至结婚可能并不奇怪；智能机器人会自我演化，成为地球新的物种。有一天，也许机器人真的会像《西部世界》中描述的那样，成为人类的竞争对手。那么，接下来了，人要怎么自我定位？是将智能机器人定义为一个工具，或是一个伙伴？人类的社会关系是否要被重构？不得而知。在通用智能AGI来临之前，作为个人，唯有不断地去创造，提供更多的价值，我们才有可能与AI智能体竞争。</p><h2 id="死亡"><a href="#死亡" class="headerlink" title="死亡"></a>死亡</h2><p>死亡这个课题在三十岁以前好像很少闯进我的脑海，即使碰到它也没有让我觉得多么的恐惧。但随着年龄的增大，看到日益衰老的父母，慢慢长大的女儿，身边逐渐去世的亲人，对死亡的恐惧感也时不时冒上头来。但对死亡思考的越多，反而会让人更加清醒而自知，那种恐惧感会逐渐被冲淡，让我更多的去关注当下的美好。</p><p>“死亡与我们无关；当我们存在时，死亡不存在；当死亡存在时，我们已不存在。”，伊壁鸠鲁如是说。那么，死亡只是我们对未知的恐惧而已，因为我们永远无法体验到它。我们能感受到的是生，是酸甜苦辣，是家人的拥抱，是父母的问候，是儿女的笑脸；是音乐的节奏，是文学的魅力，是想象的无限，是生命中点滴的美好。从这个角度来说，不论从理性的角度还是从入世的角度来说，死亡都不应该成为一种内心的恐惧。死亡不过是大自然生命循环的一部分而已。</p><p>相反，为了生命的存在，我们应该努力去构建一个更好的存在体验，去创造，去思考，去探索，去提供价值，放下内心的偏见与成见，做真正有意义而有影响力的事情。</p><p>2026年，马到成功。新年快乐！</p>]]></content>
    
    
    <summary type="html">&lt;blockquote&gt;&lt;p&gt;敢于浪费哪怕一个钟头时间的人，说明他还不懂得珍惜生命的全部价值&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;达尔文
&lt;/code&gt;&lt;/pre&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;img src=&quot;https://md-files.oss-cn-shenzhen.aliyuncs.com/horse-success-1.png&quot; alt=&quot;2026-horse&quot;&gt;&lt;/p&gt;</summary>
    
    
    
    <category term="思考" scheme="https://sniffer.site/categories/%E6%80%9D%E8%80%83/"/>
    
    
    <category term="成长" scheme="https://sniffer.site/tags/%E6%88%90%E9%95%BF/"/>
    
    <category term="探索" scheme="https://sniffer.site/tags/%E6%8E%A2%E7%B4%A2/"/>
    
  </entry>
  
  <entry>
    <title>进程调度中的PELT与WALT</title>
    <link href="https://sniffer.site/2025/10/02/%E8%BF%9B%E7%A8%8B%E8%B0%83%E5%BA%A6%E4%B8%AD%E7%9A%84PELT%E4%B8%8EWALT/"/>
    <id>https://sniffer.site/2025/10/02/%E8%BF%9B%E7%A8%8B%E8%B0%83%E5%BA%A6%E4%B8%AD%E7%9A%84PELT%E4%B8%8EWALT/</id>
    <published>2025-10-02T02:58:01.000Z</published>
    <updated>2025-10-02T03:54:25.603Z</updated>
    
    <content type="html"><![CDATA[<p>对于任务调度器来说，在发生调度时需要决定选择某个进程调度到哪个CPU执行，同时还需要基于系统当前的负载，实时的调整CPU运行的频率。Linux进程调度器为了适配各种不同场景的设备，如服务器，嵌入式设备，通常需要在多个性能目标上进行权衡取舍：</p><ul><li>需要确保调度尽可能公平，保证每个任务都有机会得到执行</li><li>快速响应用户请求，比如对于交互式的用户任务，需要降低调度延迟，快速调度任务执行</li><li>实现更高的系统吞吐量，可以满足更多的任务并发执行</li><li>同时尽可能降低系统功耗，在系统空闲或者负载降低时尽可能减小工作频率，减少能耗</li></ul><span id="more"></span><p>通常来说，这些目标是相互冲突的，在公平调度器（<code>CFS(Complete Fair Scheduler)</code>）出来之前，之前的调度器并没有很好的解决这些问题。最开始<code>Linux</code>内核采用的是<code>O(n)</code>调度器，每次执行任务调度时会扫描就绪队列上的所有进程，计算进程的优先级，再从中选择一个最高优先级的任务执行，<code>O(n)</code>调度器的调度时间随着系统进程数量增加呈现线性增长，因此很难以进行扩展。为了解决该问题，2002年内核中引入<code>O(1)</code>调度器，其基本思想是，将进程的优先级分为动态优先级与静态优先级，静态优先级用于计算每个任务可运行的时间片长度，动态优先级在调度时用到，每次调度的都会选择动态优先级最高的任务运行。<code>O(1)</code>调度器采用了一个动态优先级数组来存放可运行的任务，因此可以在<code>O(1)</code>时间内选择一个最优的任务。</p><p><code>O(1)</code>调度器虽然解决了调度延迟的问题，但无法准确的判断系统中交互式与批处理两种类似的任务，从而导致交互延迟；<code>O(1)</code>调度器发明人<code>Ingo Molnar</code>在<code>Con Kolivas</code>发明的调度器<code>RSDL(The Rotating Staircase Deadline Scheduler)</code>基础上实现了一个全新的调度器<code>CFS</code>: <code>CFS</code>换了一种思路，不再单纯通过优先级来选择执行的任务，而是通过计算进程消耗的<code>CPU</code>时间来决定哪个任务可以被调度，它会根据系统的任务优先级来计算出每个任务所需要的<code>虚拟时间(vruntime)</code>，调度时只需要选择<code>虚拟时间(vruntime)</code>最小的任务执行即可。</p><p><code>CFS</code>虽然做到了足够的<code>公平</code>，但未考虑到任务的延迟<code>lag</code>，比如某个任务预期需要执行<code>20ms</code>，而实际上只拿到了<code>10ms</code>的<code>CPU</code>时间，这个差值认为是一个任务的延迟，在这种情况下按照<code>CFS</code>的调度策略执行，在极端情况下可能会导致该任务无法立即调度，从而导致响应的延迟。为了减少这种任务的调度延迟，<a href="https://lwn.net/Articles/925371/">新的调度器(<code>EEVDF(Earlist Eligible Virtual Deadline First)</code>)</a>基于该延迟计算出每个任务的截止时间(<code>deadline</code>)，在实际调度时会选择截止时间最小的任务执行，以此改善任务调度的延迟。</p><p><code>CFS</code>在<code>SMP</code>这种对称的系统中可以很好的发挥作用，因为调度时无需考虑单个<code>CPU</code>的差异，但如果处理如<code>AMP</code>（非对称系统）， <code>HMP</code>（异构系统）等更复杂的情况就显得力不从心。在<code>3.18</code>之前的内核版本中，<code>CFS</code>调度器都是根据每个运行队列上的负载来执行负载跟踪(<code>PRLT,Per Runqueue Load Tracking</code>)，<code>PRLT</code>有个比较明显的缺点，无法准确的知道每个进程对整体负载的贡献，因此难以准确对任务的负载的轻重做出判断，在<code>ARM</code>大小核的框架下，可能导致任务错配，比如<code>重</code>任务放到了小核上执行，而<code>轻</code>任务则放到了大核上，从而降低了系统的运行效率。<code>Linux</code>内核从<code>3.18</code>版本开始，引入了<code>PELT(Per-Entity Load Tracking)</code>，以便更准确的跟踪系统的负载；高通针对<code>PELT</code>响应慢的问题，提出了<code>WALT(Window Assisted Load Tracking)</code>算法，目前所有高通的平台都默认使用<code>WALT</code>来跟踪负载。</p><p>这篇文章，我们主要介绍下负载跟踪算法<code>PELT</code>、<code>WALT</code>的实现原理以及对两个算法进行对比。</p><blockquote><p>本文基于Linux内核版本5.4分析</p></blockquote><h2 id="PELT"><a href="#PELT" class="headerlink" title="PELT"></a><strong>PELT</strong></h2><p><code>PELT(Per-Entity Load Tracking)</code>中的<code>Entity</code>对应一个调度实体，可以是一个进程，也可以是一个<code>cgroup</code>中的所有进程。为了跟踪每个<code>Entity</code>的负载，系统时间（物理时间，非虚拟调度时间）被分成<code>1024us(1ms)</code>的时间序列，在每一个<code>1024us</code>的周期内，一个调度实体对系统的负载贡献可以根据该实体处于<code>runnable</code>状态（正在执行或者等待CPU调度执行）的时间来计算：如果该周期内，<code>runnable</code>的时间为<code>x</code>，相应的该实体对系统负载的贡献为<code>x/1024</code>。但如果要考虑更长时间（超过<code>1ms</code>）的负载状态，就需要将多个周期内的负载进行加权处理（离当前越近对当前系统负载贡献越小）得到系统的整体负载。假定$L_i$表示负载计算周期$p_i$的调度实体负载，那么该实体对系统负载的贡献可以表达为（指数权重滑动平均,EWMA）：</p><p>$$ L &#x3D; L_0 + L_1<em>y + L_2 * y^2 + L_3 * y^3 + … &#x3D; \sum_{i&#x3D;1}^n L_i</em>y^i $$</p><p>这里参数<code>y</code>是衰减权重值，一般按照$y_32&#x3D;0.5, y≈0.9786$得到，表示调度实体经过32个计算周期后，其对当前系统负载的贡献值为<code>0.5</code>；而当前时刻的负载对系统负载的贡献权重为<code>1</code>；类似的，我们向前滚动一个计算周期（<code>1ms</code>），则可以得到该任务新的负载贡献值：</p><p>$$ L &#x3D; L_0’ + y <em>(L_0 + L_1</em>y + L_2 * y^2 + L_3 * y^3 + …) \<br>    &#x3D; L_0’ + L_0 * y + L_1*y^2 + L_2 * y^3 + L_3 * y^4 + …) $$</p><p>可以看到当前时间负载就是该调度实体当前的负载，再加上前一个周期的负载乘以衰减系数<code>y</code>即可；因此，想要计算出当前负载，一般只需要计算出$val*y^n$即可，我们已知$y^n &#x3D; \frac{1}{2}$，可以将$y^n$的计算转换为如下：</p><p>$$ y^n &#x3D; \frac{1}{(2)^\frac{n}{p}} * y^{n%p} $$</p><p>这里的<code>p</code>表示负载计算的衰减周期，一般为<code>32ms</code>；实际上，我们只需要把<code>0 &lt; n &lt;= 32</code>这些情况的数据计算出来，对于<code>n &gt; 32</code>的情况，只需要通过如下公司计算出来即可，比如计算<code>n=35</code>的情况：</p><p>$$ y^{35} &#x3D; \frac{1}{(2)^\frac{35}{32}} * y^{35%32} &#x3D; \frac{1}{2} * y^3 $$</p><p>为了避免浮点运算，提高计算效率，内核会提前计算好<code>0 &lt; n &lt;= 32</code>的情况，将$y^n*2^32$的值计算好，保存到一个<code>runnable_avg_yN_inv</code>数组里，这样实际计算的时候只需要查表即可得到$y^n$的值;具体计算的代码可以参考<code>Documentation/scheduler/sched-pelt.c</code>:</p><p>$$ runnable_avg_yN_inv[i] &#x3D; (2^{32} - 1) * y^i $$</p><p>基于该预计算的数据，内核在计算调度任务的负载时，直接通过如下函数<code>decay_load</code>就可以得到对应时刻负载的衰减权重系数：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Approximate:</span></span><br><span class="line"><span class="comment"> *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="type">static</span> u64 <span class="title function_">decay_load</span><span class="params">(u64 val, u64 n)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="type">unsigned</span> <span class="type">int</span> local_n;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (unlikely(n &gt; LOAD_AVG_PERIOD * <span class="number">63</span>))</span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* after bounds checking we can collapse to 32-bit */</span></span><br><span class="line">local_n = n;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * As y^PERIOD = 1/2, we can combine</span></span><br><span class="line"><span class="comment"> *    y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)</span></span><br><span class="line"><span class="comment"> * With a look-up table which covers y^n (n&lt;PERIOD)</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * To achieve constant time decay_load.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (unlikely(local_n &gt;= LOAD_AVG_PERIOD)) &#123;</span><br><span class="line">val &gt;&gt;= local_n / LOAD_AVG_PERIOD;</span><br><span class="line">local_n %= LOAD_AVG_PERIOD;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], <span class="number">32</span>);</span><br><span class="line"><span class="keyword">return</span> val;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>实际计算当前时刻的负载要更麻烦些，具体可以参考内核代码<code>sched/pelt.c</code>中的函数<code>accumulate_sum</code>；为了跟踪系统负载，内核在任务结构体<code>struct task_struct</code>中增加了一个<code>struct sched_avg</code>的结构体，用以保存系统负载相关的状态:</p><ul><li><code>last_update_time</code>: 上一次系统负载更新的时间点，基于与上次更新的时间差，我们可以计算对应的<code>load_avg/util_avg</code>的值</li><li><code>load_sum/runnable_load_sum/util_sum</code>：根据上述中的权重衰减级数得到的系统负载值，<code>load_sum</code>是衰减周期内所有负载的总和，包括了<code>running/runnable</code>两种状态的时间；<code>util_sum</code>仅包含<code>running</code>时间。对于普通任务来说，<code>runnable_load_sum</code>等于<code>util_sum</code>，对于一个分组调度实体来说，<code>runnable_load_sum</code>是所有该分组任务的<code>running+runnable</code>时间的总和</li><li><code>period_contrib</code>: 计算中间值，用于保存负载计算时时间窗口的临时值</li><li><code>load_avg/runnable_avg/util_avg</code>： 根据<code>*_sum</code>计算得到的平均值</li><li><code>util_est</code>: 任务阻塞后，其负载会不断衰减。如果一个重载任务阻塞太长时间，根据标准<code>PELT</code>计算出来的负载会非常小，当该任务被唤醒时，由于负载较小会让调度器做出错误的判断。因此引入了这个成员，记录阻塞之前的负载均值<code>load_avg</code>信息</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">sched_avg</span> &#123;</span></span><br><span class="line">u64last_update_time;</span><br><span class="line">u64load_sum;</span><br><span class="line">u64runnable_sum;</span><br><span class="line">u32util_sum;</span><br><span class="line">u32period_contrib;</span><br><span class="line"><span class="type">unsigned</span> <span class="type">long</span>load_avg;</span><br><span class="line"><span class="type">unsigned</span> <span class="type">long</span>runnable_avg;</span><br><span class="line"><span class="type">unsigned</span> <span class="type">long</span>util_avg;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">util_est</span><span class="title">util_est</span>;</span></span><br><span class="line">&#125; ____cacheline_aligned;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>Linux</code>内核调度器在执行任务调度的时候或者内核时钟中断来时，会实时更新当前系统的负载，比如在任务入队（<code>enqueue_task_fair</code>）、出队（<code>dequeue_task_fair</code>）或者切换不同的<code>cgroup</code>分组(<code>task_change_group_fair</code>)的时候会主动更新当前负载;以任务入队函数<code>enqueue_entity</code>为例，在更新完任务的优先级相关的时间片信息后，会通过<code>update_load_avg</code>更新系统负载。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">void</span></span><br><span class="line"><span class="title function_">enqueue_entity</span><span class="params">(<span class="keyword">struct</span> cfs_rq *cfs_rq, <span class="keyword">struct</span> sched_entity *se, <span class="type">int</span> flags)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="type">bool</span> renorm = !(flags &amp; ENQUEUE_WAKEUP) || (flags &amp; ENQUEUE_MIGRATED);</span><br><span class="line"><span class="type">bool</span> curr = cfs_rq-&gt;curr == se;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * If we&#x27;re the current task, we must renormalise before calling</span></span><br><span class="line"><span class="comment"> * update_curr().</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (renorm &amp;&amp; curr)</span><br><span class="line">se-&gt;vruntime += cfs_rq-&gt;min_vruntime;</span><br><span class="line"></span><br><span class="line">update_curr(cfs_rq);</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Otherwise, renormalise after, such that we&#x27;re placed at the current</span></span><br><span class="line"><span class="comment"> * moment in time, instead of some random moment in the past. Being</span></span><br><span class="line"><span class="comment"> * placed in the past could significantly boost this task to the</span></span><br><span class="line"><span class="comment"> * fairness detriment of existing tasks.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (renorm &amp;&amp; !curr)</span><br><span class="line">se-&gt;vruntime += cfs_rq-&gt;min_vruntime;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * When enqueuing a sched_entity, we must:</span></span><br><span class="line"><span class="comment"> *   - Update loads to have both entity and cfs_rq synced with now.</span></span><br><span class="line"><span class="comment"> *   - Add its load to cfs_rq-&gt;runnable_avg</span></span><br><span class="line"><span class="comment"> *   - For group_entity, update its weight to reflect the new share of</span></span><br><span class="line"><span class="comment"> *     its group cfs_rq</span></span><br><span class="line"><span class="comment"> *   - Add its new weight to cfs_rq-&gt;load.weight</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);  <span class="comment">//更新任务分组</span></span><br><span class="line">se_update_runnable(se);</span><br><span class="line">update_cfs_group(se);</span><br><span class="line">account_entity_enqueue(cfs_rq, se);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (flags &amp; ENQUEUE_WAKEUP)</span><br><span class="line">place_entity(cfs_rq, se, <span class="number">0</span>);</span><br><span class="line"></span><br><span class="line">check_schedstat_required();</span><br><span class="line">update_stats_enqueue(cfs_rq, se, flags);</span><br><span class="line">check_spread(cfs_rq, se);</span><br><span class="line"><span class="keyword">if</span> (!curr)</span><br><span class="line">__enqueue_entity(cfs_rq, se);</span><br><span class="line">se-&gt;on_rq = <span class="number">1</span>;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * When bandwidth control is enabled, cfs might have been removed</span></span><br><span class="line"><span class="comment"> * because of a parent been throttled but cfs-&gt;nr_running &gt; 1. Try to</span></span><br><span class="line"><span class="comment"> * add it unconditionnally.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (cfs_rq-&gt;nr_running == <span class="number">1</span> || cfs_bandwidth_used())</span><br><span class="line">list_add_leaf_cfs_rq(cfs_rq);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (cfs_rq-&gt;nr_running == <span class="number">1</span>)</span><br><span class="line">check_enqueue_throttle(cfs_rq);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>函数<code>update_load_avg</code>主要作用是更新系统的负载状态，包括更新当前调度任务的负载，对应任务队列的负载，同时需要更新任务所在的分组的负载状态：</p><ul><li><code>__update_load_avg_se</code>: 更新当前调度实体的负载信息</li><li><code>update_cfs_rq_load_avg</code>: 更新调度实体对应的任务队列上的负载</li><li><code>propagate_entity_load_avg</code>: 将调度实体的负载信息在控制分组内进行传递</li><li><code>cfs_rq_util_change</code>: 如果负载有更新，需要通知系统调频模块选择一个适当的CPU频率</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">/* Update task and its cfs_rq load average */</span></span><br><span class="line"><span class="type">static</span> <span class="keyword">inline</span> <span class="type">void</span> <span class="title function_">update_load_avg</span><span class="params">(<span class="keyword">struct</span> cfs_rq *cfs_rq, <span class="keyword">struct</span> sched_entity *se, <span class="type">int</span> flags)</span></span><br><span class="line">&#123;</span><br><span class="line">u64 now = cfs_rq_clock_pelt(cfs_rq);</span><br><span class="line"><span class="type">int</span> decayed;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Track task load average for carrying it to new CPU after migrated, and</span></span><br><span class="line"><span class="comment"> * track group sched_entity load average for task_h_load calc in migration</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (se-&gt;avg.last_update_time &amp;&amp; !(flags &amp; SKIP_AGE_LOAD))</span><br><span class="line">__update_load_avg_se(now, cfs_rq, se);</span><br><span class="line"></span><br><span class="line">decayed  = update_cfs_rq_load_avg(now, cfs_rq);</span><br><span class="line">decayed |= propagate_entity_load_avg(se);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!se-&gt;avg.last_update_time &amp;&amp; (flags &amp; DO_ATTACH)) &#123;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * DO_ATTACH means we&#x27;re here from enqueue_entity().</span></span><br><span class="line"><span class="comment"> * !last_update_time means we&#x27;ve passed through</span></span><br><span class="line"><span class="comment"> * migrate_task_rq_fair() indicating we migrated.</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * IOW we&#x27;re enqueueing a task on a new CPU.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">attach_entity_load_avg(cfs_rq, se);</span><br><span class="line">update_tg_load_avg(cfs_rq);</span><br><span class="line"></span><br><span class="line">&#125; <span class="keyword">else</span> <span class="keyword">if</span> (decayed) &#123;</span><br><span class="line">cfs_rq_util_change(cfs_rq, <span class="number">0</span>);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (flags &amp; UPDATE_TG)</span><br><span class="line">update_tg_load_avg(cfs_rq);</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="WALT"><a href="#WALT" class="headerlink" title="WALT"></a><strong>WALT</strong></h2><p><code>WALT(Window Assisted Load Tracking)</code>一种基于时间窗口（一般默认为<code>16ms</code>，高通的代码里会根据系统时钟中断频率<code>HZ</code>来调整窗口大小）的负载跟踪算法，是高通针对手机移动平台提出的一种负载跟踪方案。其基本思想是，基于一个固定的时间窗口来计算任务负载，同时会考虑过去几个窗口内（默认是计算<code>5</code>个窗口内的负载均值）的负载情况，实际计算时会取当前负载与历史负载均值的最大值作为任务的当前负载。</p><p><code>WALT</code>算法在当前<code>struct task_struct</code>中嵌入了一个<code>struct walt_task_struct</code>用于保存负载相关的信息，包括当前窗口的负载，历史负载，以及<code>CPU</code>能力归一化后的负载:</p><ul><li><code>mark_start</code>: 任务开始执行的时间标记</li><li><code>sum</code>: 当前窗口任务总的运行时间（包括运行时间和等待时间，做了频率缩放）</li><li><code>sum_history</code>: 上一个窗口任务总的运行时间（不包括休眠时间）</li><li><code>demand</code>: 上一个窗口最大的任务负载，用于调节CPU工作频率</li><li><code>curr_window_cpu</code>: 当前窗口任务在各个<code>CPU</code>上的运行时间</li><li><code>prev_window_cpu</code>: 上一个窗口任务在各个<code>CPU</code>上的运行时间</li><li><code>pred_demand</code>: 当前窗口预测的任务负载(用于<code>EAS</code>功耗感知调度)</li><li><code>demand_scaled</code>: 当前窗口任务的归一化后负载（缩放到了1024）</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">walt_task_struct</span> &#123;</span></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * &#x27;mark_start&#x27; marks the beginning of an event (task waking up, task</span></span><br><span class="line"><span class="comment"> * starting to execute, task being preempted) within a window</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * &#x27;sum&#x27; represents how runnable a task has been within current</span></span><br><span class="line"><span class="comment"> * window. It incorporates both running time and wait time and is</span></span><br><span class="line"><span class="comment"> * frequency scaled.</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * &#x27;sum_history&#x27; keeps track of history of &#x27;sum&#x27; seen over previous</span></span><br><span class="line"><span class="comment"> * RAVG_HIST_SIZE windows. Windows where task was entirely sleeping are</span></span><br><span class="line"><span class="comment"> * ignored.</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * &#x27;demand&#x27; represents maximum sum seen over previous</span></span><br><span class="line"><span class="comment"> * sysctl_sched_ravg_hist_size windows. &#x27;demand&#x27; could drive frequency</span></span><br><span class="line"><span class="comment"> * demand for tasks.</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * &#x27;curr_window_cpu&#x27; represents task&#x27;s contribution to cpu busy time on</span></span><br><span class="line"><span class="comment"> * various CPUs in the current window</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * &#x27;prev_window_cpu&#x27; represents task&#x27;s contribution to cpu busy time on</span></span><br><span class="line"><span class="comment"> * various CPUs in the previous window</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * &#x27;curr_window&#x27; represents the sum of all entries in curr_window_cpu</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * &#x27;prev_window&#x27; represents the sum of all entries in prev_window_cpu</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * &#x27;pred_demand&#x27; represents task&#x27;s current predicted cpu busy time</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * &#x27;busy_buckets&#x27; groups historical busy time into different buckets</span></span><br><span class="line"><span class="comment"> * used for prediction</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * &#x27;demand_scaled&#x27; represents task&#x27;s demand scaled to 1024</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">u64mark_start;</span><br><span class="line">u32sum, demand;</span><br><span class="line">u32coloc_demand;</span><br><span class="line">u32sum_history[RAVG_HIST_SIZE_MAX];</span><br><span class="line">u32*curr_window_cpu, *prev_window_cpu;</span><br><span class="line">u32curr_window, prev_window;</span><br><span class="line">u32pred_demand;</span><br><span class="line">u8busy_buckets[NUM_BUSY_BUCKETS];</span><br><span class="line">u16demand_scaled;</span><br><span class="line">u16pred_demand_scaled;</span><br><span class="line">u64active_time;</span><br><span class="line"><span class="type">int</span>boost;</span><br><span class="line"><span class="type">bool</span>wake_up_idle;</span><br><span class="line"><span class="type">bool</span>misfit;</span><br><span class="line"><span class="type">bool</span>rtg_high_prio;</span><br><span class="line">u8low_latency;</span><br><span class="line">u64boost_period;</span><br><span class="line">u64boost_expires;</span><br><span class="line">u64last_sleep_ts;</span><br><span class="line">u32init_load_pct;</span><br><span class="line">u32unfilter;</span><br><span class="line">u64last_wake_ts;</span><br><span class="line">u64last_enqueued_ts;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">walt_related_thread_group</span> __<span class="title">rcu</span> *<span class="title">grp</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">list_head</span><span class="title">grp_list</span>;</span></span><br><span class="line">u64cpu_cycles;</span><br><span class="line"><span class="type">cpumask_t</span>cpus_requested;</span><br><span class="line"><span class="type">bool</span>iowaited;</span><br><span class="line">&#125;;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>WALT</code>负载计算是根据系统的事件来触发的，比如系统时间中断，任务状态变化（任务创建、唤醒、切换等），负载均衡，中断等都会重新计算任务的负载信息，比如执行调度函数<code>_schedule</code>时，会调用<code>walt_update_task_ravg</code>更新任务负载信息：</p><ul><li>首先调用<code>update_window_start</code>函数更新任务窗口开始时间</li><li><code>update_task_rq_cpu_cycles</code>更新任务队列的CPU运行的周期</li><li><code>update_task_demand</code>更新任务的负载信息，这个是<code>WALT</code>算法的关键函数</li><li><code>update_cpu_busy_time</code>更新CPU在该窗口内的运行时间信息</li><li><code>update_task_pred_demand</code>更新任务的预测负载信息</li><li><code>run_walt_irq_work</code>触发一个软中断，用于更新当前CPU的工作频率</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">/* Reflect task activity on its demand and cpu&#x27;s busy time statistics */</span></span><br><span class="line"><span class="type">void</span> <span class="title function_">walt_update_task_ravg</span><span class="params">(<span class="keyword">struct</span> task_struct *p, <span class="keyword">struct</span> rq *rq, <span class="type">int</span> event,</span></span><br><span class="line"><span class="params">u64 wallclock, u64 irqtime)</span></span><br><span class="line">&#123;</span><br><span class="line">u64 old_window_start;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!rq-&gt;wrq.window_start || p-&gt;wts.mark_start == wallclock)</span><br><span class="line"><span class="keyword">return</span>;</span><br><span class="line"></span><br><span class="line">lockdep_assert_held(&amp;rq-&gt;lock);</span><br><span class="line"></span><br><span class="line">old_window_start = update_window_start(rq, wallclock, event);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!p-&gt;wts.mark_start) &#123;</span><br><span class="line">update_task_cpu_cycles(p, cpu_of(rq), wallclock);</span><br><span class="line"><span class="keyword">goto</span> done;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">update_task_rq_cpu_cycles(p, rq, event, wallclock, irqtime);</span><br><span class="line">update_task_demand(p, rq, event, wallclock);</span><br><span class="line">update_cpu_busy_time(p, rq, event, wallclock, irqtime);</span><br><span class="line">update_task_pred_demand(rq, p, event);</span><br><span class="line"><span class="keyword">if</span> (event == PUT_PREV_TASK &amp;&amp; p-&gt;state)</span><br><span class="line">p-&gt;wts.iowaited = p-&gt;in_iowait;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line">done:</span><br><span class="line">p-&gt;wts.mark_start = wallclock;</span><br><span class="line"></span><br><span class="line">run_walt_irq_work(old_window_start, rq);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>真正更新任务负载的函数是<code>update_task_demand</code>。负载的计算主要有三种不同的情况：</p><p><img src="https://sniffer-site.oss-cn-shenzhen.aliyuncs.com/walt-demand-calculation.png" alt="walt-demand-calculation"></p><ol><li>如果系统事件在当前窗口(<code>ws &lt; ms &lt;wc</code>)内发生，那么只需要更新当前窗口的任务负载</li><li>如果系统事件横跨了两个窗口(<code>ms &lt; ws &lt; wc</code>)，任务开始的时间在上一个窗口，那么需要更新当前窗口的任务负载和上一个窗口的任务负载</li><li>如果系统事件横跨多个窗口（<code>ms &lt; ws &lt; wc</code>)），即任务开始的时间在之前的某个窗口，那么需要更新所有的历史窗口的任务负载</li></ol><p>任务计算运行时间的方式就是根据<code>ws</code>和<code>ms</code>的差值缩放到<code>1024</code>(CPU的最大默认能力)，基于任务的运行时间，<code>WALT</code>选择历史窗口中平均负载值与当前任务负载中较大的一个作为真正的负载，缩放后得到<code>demand_scaled</code>作为系统调频、任务放置等决策依据,具体可以参考函数<code>update_history</code>的实现：</p><p>$$ delta &#x3D; (ws - ms) * cur_cpu_freq&#x2F;max_cpu_freq * cpu_capacity &#x2F; 1024 &#x2F;&#x2F; 任务缩放后的执行时间 $$</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Account cpu demand of task and/or update task&#x27;s cpu demand history</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * ms = p-&gt;wts.mark_start;</span></span><br><span class="line"><span class="comment"> * wc = wallclock</span></span><br><span class="line"><span class="comment"> * ws = rq-&gt;wrq.window_start</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="type">static</span> u64 <span class="title function_">update_task_demand</span><span class="params">(<span class="keyword">struct</span> task_struct *p, <span class="keyword">struct</span> rq *rq,</span></span><br><span class="line"><span class="params">       <span class="type">int</span> event, u64 wallclock)</span></span><br><span class="line">&#123;</span><br><span class="line">u64 mark_start = p-&gt;wts.mark_start;</span><br><span class="line">u64 delta, window_start = rq-&gt;wrq.window_start;</span><br><span class="line"><span class="type">int</span> new_window, nr_full_windows;</span><br><span class="line">u32 window_size = sched_ravg_window;</span><br><span class="line">u64 runtime;</span><br><span class="line"></span><br><span class="line">new_window = mark_start &lt; window_start;</span><br><span class="line"><span class="keyword">if</span> (!account_busy_for_task_demand(rq, p, event)) &#123;</span><br><span class="line"><span class="keyword">if</span> (new_window)</span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * If the time accounted isn&#x27;t being accounted as</span></span><br><span class="line"><span class="comment"> * busy time, and a new window started, only the</span></span><br><span class="line"><span class="comment"> * previous window need be closed out with the</span></span><br><span class="line"><span class="comment"> * pre-existing demand. Multiple windows may have</span></span><br><span class="line"><span class="comment"> * elapsed, but since empty windows are dropped,</span></span><br><span class="line"><span class="comment"> * it is not necessary to account those.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">update_history(rq, p, p-&gt;wts.sum, <span class="number">1</span>, event);</span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!new_window) &#123;</span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * The simple case - busy time contained within the existing</span></span><br><span class="line"><span class="comment"> * window.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">return</span> add_to_task_demand(rq, p, wallclock - mark_start);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Busy time spans at least two windows. Temporarily rewind</span></span><br><span class="line"><span class="comment"> * window_start to first window boundary after mark_start.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">delta = window_start - mark_start;</span><br><span class="line">nr_full_windows = div64_u64(delta, window_size);</span><br><span class="line">window_start -= (u64)nr_full_windows * (u64)window_size;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Process (window_start - mark_start) first */</span></span><br><span class="line">runtime = add_to_task_demand(rq, p, window_start - mark_start);</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Push new sample(s) into task&#x27;s demand history */</span></span><br><span class="line">update_history(rq, p, p-&gt;wts.sum, <span class="number">1</span>, event);</span><br><span class="line"><span class="keyword">if</span> (nr_full_windows) &#123;</span><br><span class="line">u64 scaled_window = scale_exec_time(window_size, rq);</span><br><span class="line"></span><br><span class="line">update_history(rq, p, scaled_window, nr_full_windows, event);</span><br><span class="line">runtime += nr_full_windows * scaled_window;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Roll window_start back to current to process any remainder</span></span><br><span class="line"><span class="comment"> * in current window.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">window_start += (u64)nr_full_windows * (u64)window_size;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Process (wallclock - window_start) next */</span></span><br><span class="line">mark_start = window_start;</span><br><span class="line">runtime += add_to_task_demand(rq, p, wallclock - mark_start);</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> runtime;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>最后系统调频时，基于<code>WALT</code>计算出来的负载，可以以此调整CPU运行的频率，具体可以参考<code>cpufreq_schedutil.c</code>:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">unsigned</span> <span class="type">long</span> <span class="title function_">sugov_get_util</span><span class="params">(<span class="keyword">struct</span> sugov_cpu *sg_cpu)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">rq</span> *<span class="title">rq</span> =</span> cpu_rq(sg_cpu-&gt;cpu);</span><br><span class="line"><span class="type">unsigned</span> <span class="type">long</span> max = arch_scale_cpu_capacity(sg_cpu-&gt;cpu);</span><br><span class="line"><span class="type">unsigned</span> <span class="type">long</span> util;</span><br><span class="line"></span><br><span class="line">sg_cpu-&gt;max = max;</span><br><span class="line">sg_cpu-&gt;bw_dl = cpu_bw_dl(rq);</span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_SCHED_WALT</span></span><br><span class="line">util = cpu_util_freq_walt(sg_cpu-&gt;cpu, &amp;sg_cpu-&gt;walt_load);</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> uclamp_rq_util_with(rq, util, <span class="literal">NULL</span>);</span><br><span class="line"><span class="meta">#<span class="keyword">else</span></span></span><br><span class="line">util = cpu_util_cfs(rq);</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> schedutil_cpu_util(sg_cpu-&gt;cpu, util, max, FREQUENCY_UTIL, <span class="literal">NULL</span>);</span><br><span class="line"><span class="meta">#<span class="keyword">endif</span></span></span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="PELT与WALT的对比"><a href="#PELT与WALT的对比" class="headerlink" title="PELT与WALT的对比"></a><strong>PELT与WALT的对比</strong></h2><p><code>PELT</code>是<code>ARM</code>公司开发的一种负载跟踪算法，基于滑动平均的方式预测任务负载，但由于其缺乏对大小核(<code>Big.Little</code>)架构的支持，加上其负载衰减慢，因此并不合适手机等终端设备这种负载突发、变化频繁的系统，<code>WALT</code>针对上述的缺点做了改建优化，总结来说，两者的区别如下:</p><ul><li><code>PELT</code>负载跟踪只考虑到了<code>SMP</code>这种架构，对于当前手机芯片<code>HMP</code>这种异构体系并不合适；而且，<code>PELT</code>只是考虑到了<code>CFS</code>调度类的负载跟踪，并没有考虑到系统其他调度类，如<code>RT</code>&#x2F;<code>DEADLINE</code>等，因此缺乏对系统负载的全局性考虑。</li><li>考虑到移动平台大小核(<code>Big.Little</code>)的体系架构，会将负载基于<code>CPU</code>能力与频率进行归一化处理，从而更准确的判断任务负载与CPU能力之间的匹配</li><li>从调度、调频的延迟来看，由于<code>PLET</code>考虑了更长的时间周期，因此无法快速响应负载，对于像<code>Android</code>这样的移动平台来说，可能会出现系统响应变慢；<code>WALT</code>能更快速的响应突发的负载，但相对来说可能会带来更多的能耗</li></ul><p><code>Linaro</code>的官网上有一个文档对<code>PLET</code>与<code>WALT</code>的比较进行了详细的分析，可以参考下<a href="https://static.linaro.org/connect/bkk16/Presentations/Tuesday/BKK16-208.pdf">PELT vs Window tracking<br>and EAS on SMP multi-cluster</a>。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a><strong>参考资料</strong></h2><ul><li><a href="https://blog.csdn.net/lizhijun_buaa/article/details/135954017">https://blog.csdn.net/lizhijun_buaa/article/details/135954017</a></li><li><a href="https://www.cnblogs.com/lingjiajun/p/12317090.html">https://www.cnblogs.com/lingjiajun/p/12317090.html</a></li><li><a href="https://lwn.net/Articles/531853/">https://lwn.net/Articles/531853/</a></li><li><a href="http://www.wowotech.net/process_management/PELT.html">http://www.wowotech.net/process_management/PELT.html</a></li><li><a href="http://www.wowotech.net/tag/pelt">http://www.wowotech.net/tag/pelt</a></li><li><a href="https://blog.csdn.net/feelabclihu/article/details/108414156">https://blog.csdn.net/feelabclihu/article/details/108414156</a></li><li><a href="https://www.anandtech.com/show/12620/improving-the-exynos-9810-galaxy-s9-part-2/2">https://www.anandtech.com/show/12620/improving-the-exynos-9810-galaxy-s9-part-2/2</a></li><li><a href="https://lwn.net/Articles/925371/">https://lwn.net/Articles/925371/</a></li><li><a href="https://android.googlesource.com/kernel/msm/+/android-msm-bullhead-3.10-marshmallow-dr/Documentation/scheduler/sched-hmp.txt">https://android.googlesource.com/kernel/msm/+/android-msm-bullhead-3.10-marshmallow-dr/Documentation/scheduler/sched-hmp.txt</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;对于任务调度器来说，在发生调度时需要决定选择某个进程调度到哪个CPU执行，同时还需要基于系统当前的负载，实时的调整CPU运行的频率。Linux进程调度器为了适配各种不同场景的设备，如服务器，嵌入式设备，通常需要在多个性能目标上进行权衡取舍：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;需要确保调度尽可能公平，保证每个任务都有机会得到执行&lt;/li&gt;
&lt;li&gt;快速响应用户请求，比如对于交互式的用户任务，需要降低调度延迟，快速调度任务执行&lt;/li&gt;
&lt;li&gt;实现更高的系统吞吐量，可以满足更多的任务并发执行&lt;/li&gt;
&lt;li&gt;同时尽可能降低系统功耗，在系统空闲或者负载降低时尽可能减小工作频率，减少能耗&lt;/li&gt;
&lt;/ul&gt;</summary>
    
    
    
    <category term="Linux" scheme="https://sniffer.site/categories/Linux/"/>
    
    
    <category term="PELT" scheme="https://sniffer.site/tags/PELT/"/>
    
    <category term="WALT" scheme="https://sniffer.site/tags/WALT/"/>
    
    <category term="进程调度" scheme="https://sniffer.site/tags/%E8%BF%9B%E7%A8%8B%E8%B0%83%E5%BA%A6/"/>
    
  </entry>
  
  <entry>
    <title>时间同步协议PTP那些事</title>
    <link href="https://sniffer.site/2025/08/16/%E6%97%B6%E9%97%B4%E5%90%8C%E6%AD%A5%E5%8D%8F%E8%AE%AEPTP%E9%82%A3%E4%BA%9B%E4%BA%8B/"/>
    <id>https://sniffer.site/2025/08/16/%E6%97%B6%E9%97%B4%E5%90%8C%E6%AD%A5%E5%8D%8F%E8%AE%AEPTP%E9%82%A3%E4%BA%9B%E4%BA%8B/</id>
    <published>2025-08-16T09:36:55.000Z</published>
    <updated>2025-08-19T05:56:17.507Z</updated>
    
    <content type="html"><![CDATA[<p>现代人的生活已经离不开时间了，无论是出门上班，还是外出旅行，都需要准确的知道我们所处位置的时间。日常生活中，往往分钟、秒级的时间精确度就够用了，但在工程技术中，比如飞机巡航、机器控制、网络管理都需要更高精度的时间测量。我们需要准确的知道两个事件之间发生的时间。精确的测量时间是一件非常复杂的技术活儿。在世界各地，要在不同网络与设备之间同步时间是一件非常具有挑战的事情。首先，需要解决的问题是如何精确测量时间，其次是将时间准确的同步到其他系统或者设备。第一个问题可以通过<a href="https://en.wikipedia.org/wiki/Atomic_clock">原子钟(<code>atomic clocks</code>)</a>来解决，比如标准时间的采用的​​铯原子钟​​误差可以达到1亿年1秒；卫星导航系统如<code>GPS</code>，北斗都会搭载一个原子钟用于高精度的导航，因此<code>GPS</code>信号也可以作为一个时钟源用于授时；第二个时间同步一般通过标准的协议来实现，本文重点介绍使用较为普遍的一种同步协议<a href="https://en.wikipedia.org/wiki/Precision_Time_Protocol"><code>PTP(Precise Time Protocol)</code></a>。</p><span id="more"></span><p>时间同步协议如<a href="https://www.rfc-editor.org/rfc/rfc1305"><code>NTP(Network Time Protocol)</code>或者<code>SNTP(Simple Network Time Protocol)</code></a>本质上是一种基于<code>UDP</code>协议（协议端口号<code>123</code>）的同步协议，用于同步世界时钟(<code>UTC</code>)与主机的时间。<code>NTP</code>最早在1981年提出，经过多个版本的迭代优化，最新<code>NTPv4</code>版本同时支持<code>IPV4/IPV6</code>，并提供加密认证的流程，在安全性上有比较大的提升。</p><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/ntp-version-history.png" alt="NTP的版本历史"></p><p><code>NTP</code>使用分布式、分层的时间同步架构，每层的时间同步被称为<code>stratum</code>（层），比如最高层<code>stratum0</code>一般是原子钟或者<code>GPS</code>时钟，作为整个系统的时钟源。具体来说，时间同步协议采用客户端-服务端架构，客户端通过向<code>NTP</code>服务器发送时间同步请求（<code>NTP</code>服务器则通过原子钟或者<code>GPS</code>进行授时），时间同步的具体步骤简述如下：</p><ol><li>客户端首先向服务端发送一个<code>NTP</code>请求报文，其中包含了该报文离开客户端的时间戳<code>T1</code></li><li><code>NTP</code>请求报文到达<code>NTP</code>服务器，此时<code>NTP</code>服务器的时刻为<code>T2</code>。当服务端接收到该报文时，<code>NTP</code>服务器处理之后，在<code>T3</code>时刻发出<code>NTP</code>应答报文。该应答报文中携带报文离开<code>NTP</code>客户端时的时间戳<code>T1</code>、到达<code>NTP</code>服务器时的时间戳<code>T2</code>、离开<code>NTP</code>服务器时的时间戳<code>T3</code></li><li>客户端在接收到响应报文时，记录报文返回的时间戳<code>T4</code></li></ol><p>根据上述4个时间戳，客户端可以计算出<code>NTP</code>报文从客户端到服务端的延迟</p><p>$$ delay &#x3D; (T4 - T1) - (T3 - T2) $$</p><p>假定客户端与服务端之间的时间差为<code>offset</code>，可以得到：</p><p>$$ T4 &#x3D; T3 - offset + \frac{delay}{2}$$</p><p>因此我们可以计算出时间差<code>offset</code>为：</p><p>$$ offset &#x3D; \frac{(T2 - T1) + (T4 - T3)}{2} $$</p><p>客户端基于该时间差来调整自己的系统时间，完成最终的时间同步。</p><h2 id="PTP协议基本概念"><a href="#PTP协议基本概念" class="headerlink" title="PTP协议基本概念"></a><strong>PTP协议基本概念</strong></h2><p><code>NTP</code>一般用于操作系统的时间同步，如<code>Windows</code>、<code>Android</code>等系统都支持<code>NTP</code>时间同步，但是由于<code>NTP</code>是一个应用层的同步协议，因此会受系统调度延迟的影响，而且会因为网络不对称、系统<code>RTC(Real-Time Clock)</code>时钟温漂、老化等因素，导致时间同步的精度下降，一般只能达到<code>ms</code>级别的精度。而在工业自动化如机器人控制，5G通信，高频金融交易以及电力系统中，需要更高精度的时间同步，为此<code>IEEE</code>在<code>2002</code>年发布了一个新的时间同步协议<code>1588v1</code>，之后在<code>2008</code>年又发布了第二个版本<code>1588v2</code>；<code>2019</code>发布了一个改进版本<code>1588v2.1</code>，增强了安全性与兼容性。目前常用的时间同步协议<code>PTP</code>都是基于<code>1588v2</code>版本实现，如<code>gPTP(Generalized PTP)</code>协议就是基于<code>1588v2</code>扩展而来。</p><p><code>PTP</code>协议主要有如下几个核心的概念：</p><ul><li><code>PTP</code>域： 应用了<code>PTP</code>协议的网络称为一个<code>PTP</code>域；<code>PTP</code>域内有且只有一个同步时钟，域内的其他设备需要与该时钟保持同步；域内负责同步时间的节点称为<code>master</code>，而接收时间同步的设备节点称为<code>slave</code></li><li><code>PTP</code>域中有几种不同类型的时钟：<ul><li><code>OC(Ordinary Clock)</code>普通时钟，只有一个物理端口用于时间同步，可以作为首节点（<code>Grandmaster Clock</code>）向下游节点发布时间，也可以作为末节点（<code>slave clock</code>）从上游节点同步时间</li><li><code>BC(Boundary Clock)</code>边界时钟：该时钟节点有多个物理端口可以用于网络通讯，其中一个端口用于从上游设备同步时间，其余端口向下游设备发布时间</li><li><code>TC(Transparent Clock)</code>透明时钟，节点有多个物理端口可以进行网络通讯，不过不用于同步时间，只负责处理与转发<code>PTP</code>协议报文，透明时钟节点有两种类型，一种是<code>E2E(End-to-End)</code>，一种是<code>P2P(Peer-to-Peer)</code>，区别在于<code>E2E TC</code>转发报文时，会测量报文经过时的转发延迟，并修正到<code>PTP</code>报文中；<code>P2P TC</code>不仅修正转发延迟，还会测量并修正该节点每个端口相连链路的时延（链路传递的延迟）。</li></ul></li></ul><p><img src="https://sniffer-site.oss-cn-shenzhen.aliyuncs.com/ptp-clock-types.png" alt="PTP clock types"></p><h2 id="PTP时间同步原理"><a href="#PTP时间同步原理" class="headerlink" title="PTP时间同步原理"></a><strong>PTP时间同步原理</strong></h2><p><code>PTP</code>协议在时间同步之前，一般需要从同步域中选择一个最优时钟<code>Grandmaster Clock, GM)</code>，即整个<code>PTP</code>同步域中的时间源。最优时钟可以通过静态配置制定，也可以通过<code>BMC(Best Master Clock)</code>算法动态选举得到：</p><ol><li>各个时钟节点通过<code>Announce</code>包围报告端口上的时钟源信息（最终时钟优先级、时间等级、时间精度、本地晶振的稳定性等），维护本地获得的时钟数据组，按照严格的时钟等级选择最佳时间源，并确定端口状态。通过时钟选举过程，整个<code>PTP</code>域内构建出一颗无环、全连通，以<code>GM</code>为根的生成树。</li><li>此后，<code>master</code>节点会定时发送<code>Announce</code>报文给其他节点，如果网络发生变化，或者从节点没有收到来自主节点的<code>Announce</code>报文，需重新进行最优时钟的选择</li></ol><p>在上文中提到，<code>PTP</code>时间同步协议中，有两种不同的同步机制，一种是端到端<code>E2E(End-to-End)</code>，另一种是点对点<code>P2P(Peer-to-Peer)</code>，两者的差异在于主时钟节点(<code>master</code>)与从时钟节点(<code>slave</code>)的链路延迟测量机制不同：</p><ul><li><code>E2E</code>会直接测量两个<code>OC</code>或者<code>BC</code>之间的总链路延迟，包括其间的所有中间<code>TC</code>节点。</li><li><code>P2P</code>仅限于测量两个直连相连的<code>OC</code>，<code>BC</code>或者<code>TC</code>节点之间的逐点链路延迟</li></ul><h3 id="E2E同步"><a href="#E2E同步" class="headerlink" title="E2E同步"></a><strong>E2E同步</strong></h3><p><code>E2E</code>时间同步基于主从节点(<code>master-slave</code>)的方式，通过<code>Sync</code>，<code>Delay_Req</code>，<code>Delay_Resp</code>报文交互，从节点计算出与主节点的时间差，从而完成系统时间的同步。具体的流程如下：</p><ol><li><code>Master</code>节点在<code>t1</code>时刻发送<code>Sync</code>报文（如果配置为双步模式(<code>two-step</code>)，需要发送<code>Follow_Up</code>报文；单步模式下(<code>one-step</code>)，无需发送<code>Follow_Up</code>报文），并将<code>t1</code>时间戳携带在Sync报文（或<code>Follow_Up</code>报文）中</li><li><code>Slave</code>节点在<code>t2</code>时刻接收到<code>Sync</code>报文，在本地产生<code>t2</code>时间戳，并从报文中提取<code>t1</code>时间戳</li><li><code>Slave</code>节点在<code>t3</code>时刻发送<code>Delay_Req</code>报文，并在本地产生<code>t3</code>时间戳</li><li><code>Master</code>节点在t4时刻接收到<code>Delay_Req</code>报文，并在本地产生<code>t4</code>时间戳，然后将<code>t4</code>时间戳携带在<code>Delay_Resp</code>报文中，回传给<code>Slave</code></li><li><code>Slave</code>节点接收到<code>Delay_Resp</code>报文，从报文中提取<code>t4</code>时间戳。最后<code>Slave</code>节点得到了一组时间戳<code>（t1，t2，t3，t4）</code></li></ol><p><img src="https://sniffer-site.oss-cn-shenzhen.aliyuncs.com/ptp-e2e.png" alt="PTP E2E"></p><p>假设<code>Master</code>节点到<code>Slave</code>节点的发送链路延迟是$t_{ms}$，<code>Slave</code>节点到<code>Master</code>节点的发送链路延迟是$t_{sm}$，<code>Slave</code>节点和<code>Master</code>之间的时间偏差为<code>offset</code>，于是可以得到：</p><p>$$ t2 - t1 &#x3D; t_{ms} + offset $$</p><p>$$ t4 - t3 &#x3D; t_{sm} - offset $$</p><p>$$ (t2 - t1) - (t4 - t3) &#x3D; (t_{ms} + offset) - (t_{sm} - offset) $$</p><p>$$ offset &#x3D; [(t2 - t1) - (t4 - t3) - (t_{ms} - t_{sm})] &#x2F; 2 $$</p><p>如果$t_{ms} ＝ t_{sm}$，即<code>Master</code>节点和<code>Slave</code>节点之间的收发链路延迟对称，那么：</p><p>$$offset &#x3D; [(t2 - t1) - (t4 - t3)] &#x2F; 2 $$</p><p>这样<code>Slave</code>节点就可以根据<code>t1，t2，t3，t4</code>四个时间戳计算出自己和<code>Master</code>节点之间的时间偏差<code>offset</code>，完成了<code>Slave</code>节点与<code>Master</code>节点的时间同步。但如果<code>Master</code>节点和<code>Slave</code>节点之间的收发链路延迟存在不对称，会存在同步误差，误差的大小为两个方向链路延迟差值的二分之一。因此，对于一些高精度同步场景，需要对<code>Master</code>和<code>Slave</code>之间的收发链路延迟不对称进行补偿。</p><h3 id="P2P同步"><a href="#P2P同步" class="headerlink" title="P2P同步"></a><strong>P2P同步</strong></h3><p><code>P2P</code>时间同步模式下，所有节点都会与相连节点进行报文交互，这样每个节点都可以计算出与其他连接节点的链路延迟；但真正的时间同步，依然只存在于<code>Master</code>节点与<code>Slave</code>节点之间。类似于<code>E2E</code>的模式，每个节点发送报文也分为单步与双步两种方式，对于单步方式，<code>Pdelay_Resp</code>报文带有本报文发送时刻的时间戳；而双步方式，<code>Pdelay_Resp</code>报文并不带有本报文发送时刻的时间戳，只是记录本报文发送时的时间，本报文发送时刻的时间戳由后续报文<code>Pdelay_Resp_Follow_Up</code>携带。各个节点的链路延迟测量步骤如下：</p><p><img src="https://sniffer-site.oss-cn-shenzhen.aliyuncs.com/ptp-p2p.png" alt="PTP P2P"></p><ol><li>节点2在<code>t1</code>时刻发送<code>Pdelay_Req</code>报文</li><li>节点1在<code>t2</code>时刻接收到<code>Pdelay_Req</code>报文，生成该报文的接收时间戳<code>t2</code></li><li>节点1在<code>t3</code>时刻发送<code>Pdelay_Resp</code>报文，生成该报文的发送时间戳<code>t3</code></li></ol><ul><li>对于单步方式，把<code>t3 – t2</code>携带在<code>Pdelay_Resp</code>报文中</li><li>对于双步方式，把<code>t3 – t2</code>携带在<code>Pdelay_Resp_Follow_Up</code>报文中，或者<code>Pdelay_Resp</code>报文携带<code>t2</code>，<code>Pdelay_Resp_Follow_Up</code>报文携带<code>t3</code></li></ul><ol start="4"><li>节点2在<code>t4</code>时刻接收到<code>Pdelay_Resp</code>报文，在本地产生<code>t4</code>时间戳；最后节点2得到了一组时间戳<code>(t1，t2，t3，t4)</code></li></ol><p>假设节点2到节点1的发送链路延迟是$t_{reqresp}$，节点1到节点2的发送链路延迟是$t_{respreq}$，可以得到节点2到节点1的总链路往返延迟为：</p><p>$$ （t_{reqresp} + t_{respreq}&#x3D; (t4 - t1) - (t3 - t2) $$</p><p>如果$t_{reqresp} &#x3D; t_{respreq}$，即节点2到节点1之间的收发链路延迟对称，那么节点2和节点1之间的链路平均延迟为：</p><p>$$ MeanPathDelay &#x3D; \frac{[(t4 - t1) - (t3 - t2)]}{2} $$</p><p>上述过程只是不断地实时计算和更新相连节点之间的链路延迟，并不进行时间同步。时间同步，还需要有<code>Master</code>节点与<code>Slave</code>通过交互<code>Sync</code>&#x2F;<code>Pdelay_Req</code>&#x2F;<code>Pdelay_Resp</code>报文计算得到（如下图所示）：<code>Master</code>节点向<code>Slave</code>节点周期发送<code>Sync</code>报文（<code>Slave</code>节点得到<code>t5/t6</code>两个时间戳）。最终，<code>Slave</code>节点与<code>Master</code>节点之间的时间偏差为：</p><p>$$ offset ＝ t6 - t5 - MeanPathDelay $$</p><p><img src="https://sniffer-site.oss-cn-shenzhen.aliyuncs.com/ptp-p2p-time-sync.png" alt="PTP time-sync"></p><h2 id="PTP协议在Linux中是如何实现的"><a href="#PTP协议在Linux中是如何实现的" class="headerlink" title="PTP协议在Linux中是如何实现的"></a><strong>PTP协议在Linux中是如何实现的</strong></h2><p><code>PTP</code>是一个通用的时间同步协议，可用于不同的网络环境，其支持<code>UDP(V4/V6)</code>协议发送同步报文，也可以基于<code>L2</code>(<code>MAC</code>)进行时间的同步。从上面的<code>PTP</code>协议时间同步的机制来看，时间同步的精度主要依赖于如下几个个关键的因素：</p><ul><li>报文的时间戳标记的层级，时间戳的位置离硬件层级越近，时间戳的精度越高；如果依赖于软件时间戳，则会受到软件调度与网络协议栈的波动影响</li><li>两个同步节点之间的延迟波动，如果节点之间的延迟不对称，则可能影响时间精度；类似地，如果中间节点存在延迟波动，也会影响时间精度</li><li>系统时钟与<code>MAC/PHY</code>中的硬件时钟（<code>PHC, PTP Hardware Clock</code>）的精度，容易受温度、电压、环境等因素的影响</li></ul><p><code>Linux</code>中有一个开源的<a href="https://linuxptp.sourceforge.net/"><code>PTP</code>协议栈(简称<code>LinuxPTP</code>)</a>，主要包括两个核心的工具，一个是<code>ptp4l</code>，主要用于发送、接收<code>PTP</code>报文，完成时间同步；一个是<code>phc2sys</code>，用于同步系统中不同的时间域的时间，比如<code>PHC</code>时钟与<code>RTC</code>时钟的同步。为了实现纳秒级别时间精度，通常需要使用物理时间戳，也就是以太网网卡<code>MAC/PHY</code>中增加一个<code>TSU(TimeStamping Unit)</code>模块，用于专门解析<code>PTP</code>报文，并将报文的时间戳用<code>PHC</code>物理时钟的时间替代，以尽可能降低系统带来的精度波动(<code>jitter</code>)，比如<code>IEEE 802.1AS</code>的时间同步协议<code>gPTP</code>就要求必须支持物理时间戳。</p><p>如下图所示，<code>PTP</code>时间同步主要有如下几个核心模块：</p><ul><li><code>LinuxPTP</code>协议栈，实现了<code>IEEE1588v2</code>协议，包含<code>ptp4l</code>和<code>phc2sys</code>两个核心工具</li><li><code>MAC/PHY</code>中的<code>TSU(TimeStamping Unit)</code>模块，提供物理时间戳能力</li><li><code>PHC</code>时钟，为<code>TSU</code>模块提供时钟源参考</li><li><code>Linux</code>内核驱动，包括<code>PTP</code>时钟驱动，<code>posix</code>时钟驱动，以及<code>UDP</code>协议栈</li></ul><p><img src="https://sniffer-site.oss-cn-shenzhen.aliyuncs.com/linux-ptp-stack.png" alt="Linux PTP stack"></p><p>要查看以太网网卡是否支持物理时间戳，可以使用<code>ethtool</code>命令：<code>ethtool -T eth0</code>，如果结果中显示有<code>hardware-transmit/hardware-receive</code>能力，则表示网卡是支持物理时间戳。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line">~$ ethtool -T enp0s31f6</span><br><span class="line">Time stamping parameters <span class="keyword">for</span> enp0s31f6:</span><br><span class="line">Capabilities:</span><br><span class="line">hardware-transmit</span><br><span class="line">software-transmit</span><br><span class="line">hardware-receive</span><br><span class="line">software-receive</span><br><span class="line">software-system-clock</span><br><span class="line">hardware-raw-clock</span><br><span class="line">PTP Hardware Clock: 0</span><br><span class="line">Hardware Transmit Timestamp Modes:</span><br><span class="line">off</span><br><span class="line">on</span><br><span class="line">Hardware Receive Filter Modes:</span><br><span class="line">none</span><br><span class="line">all</span><br><span class="line">ptpv1-l4-sync</span><br><span class="line">ptpv1-l4-delay-req</span><br><span class="line">ptpv2-l4-sync</span><br><span class="line">ptpv2-l4-delay-req</span><br><span class="line">ptpv2-l2-sync</span><br><span class="line">ptpv2-l2-delay-req</span><br><span class="line">ptpv2-event</span><br><span class="line">ptpv2-sync</span><br><span class="line">ptpv2-delay-req</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>接下来，我们结合<code>LinuxPTP</code>协议栈与<code>Linux</code>内核的源码看一看<code>PTP</code>协议的实现；主要分为两个部分，一个是<code>ptp4l</code>代码的实现，一个是<code>Linux</code>内核驱动部分包括<code>PTP</code>时钟的实现。</p><blockquote><p>本文使用的<code>LinuxPTP</code>版本为<code>V4.2</code></p></blockquote><h3 id="ptp4l的实现"><a href="#ptp4l的实现" class="headerlink" title="ptp4l的实现"></a><strong><code>ptp4l</code>的实现</strong></h3><p><code>ptp4l</code>除了常规的命令行参数之外，还可以通过一个配置文件来设定时间同步的参数；以配置文件为例，常见的参数主要有如下几个</p><ul><li><code>logSyncInterval</code>: <code>PTP</code>时间同步间隔，更低的间隔通常能改善本地时间的精度</li><li><code>delay_mechanism</code>: 时间同步的方式，有<code>E2E</code>&#x2F;<code>P2P</code>&#x2F;<code>Auto</code>三种，默认是<code>E2E</code></li><li><code>network_transport</code>: 网络传输方式，有<code>UDPv4/UDPv6</code>&#x2F;<code>L2</code>三种，默认是<code>UDPv4</code></li><li><code>twoStepFlag</code>: 是否支持双步同步，默认是开启双步同步</li><li><code>masterOnly</code>: 是否只支持<code>master</code>节点，默认是<code>false</code></li><li><code>ptp_dst_mac</code>: <code>PTP</code>报文目的MAC地址，如果选择<code>L2</code>的通讯方式，则需要指定目的<code>MAC</code>地址</li><li><code>clock_type</code>: 时钟类型，有<code>OC</code>&#x2F;<code>BC</code>&#x2F;<code>P2P_TC</code>&#x2F;<code>E2E_TC</code>几种，默认是<code>OC</code></li><li><code>BMCA</code>：最佳时钟选择算法，用于配置<code>master</code>与<code>slave</code>节点，如果设定为<code>noop</code>，则会跳过常规的<code>BMCA</code>过程，使用静态的配置</li></ul><p>在<code>LinuxPTP</code>源码的目录(<code>configs/automotive-master.cfg</code>)，已有部分<code>ptp4l</code>的配置文件可以参考，以车载网络中的<code>master</code>节点配置为例：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">#</span></span><br><span class="line"><span class="comment"># Automotive Profile example configuration for master containing those</span></span><br><span class="line"><span class="comment"># attributes which differ from the defaults.  See the file, default.cfg, for</span></span><br><span class="line"><span class="comment"># the complete list of available options.</span></span><br><span class="line"><span class="comment">#</span></span><br><span class="line">[global]</span><br><span class="line"><span class="comment"># Options carried over from gPTP.</span></span><br><span class="line">gmCapable1</span><br><span class="line">priority1248</span><br><span class="line">priority2248</span><br><span class="line">logSyncInterval-3</span><br><span class="line">syncReceiptTimeout3</span><br><span class="line">neighborPropDelayThresh800</span><br><span class="line">min_neighbor_prop_delay-20000000</span><br><span class="line">assume_two_step1</span><br><span class="line">path_trace_enabled1</span><br><span class="line">follow_up_info1</span><br><span class="line">transportSpecific0x1</span><br><span class="line">ptp_dst_mac01:80:C2:00:00:0E</span><br><span class="line">network_transportL2</span><br><span class="line">delay_mechanismP2P</span><br><span class="line"><span class="comment">#</span></span><br><span class="line"><span class="comment"># Automotive Profile specific options</span></span><br><span class="line"><span class="comment">#</span></span><br><span class="line">BMCAnoop</span><br><span class="line">serverOnly1</span><br><span class="line">inhibit_announce1</span><br><span class="line">asCapable               <span class="literal">true</span></span><br><span class="line">inhibit_delay_req       1</span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>ptp4l</code>的源代码主要有几个关联的部分：</p><ul><li><code>port</code>: 对应于以太网网卡的网口，一个网口可能有好几个状态<code>enum port_state</code>，比如初始化、运行、监听等</li><li><code>clock</code>: <code>PTP</code>时钟对象，可能包含了很多的<code>port</code>，也保存了时间同步的一些状态信息</li><li><code>PTP</code>时间同步协议消息的封装与发送，系统不同时钟之间的处理</li></ul><p>我们找到<code>ptp4l</code>的入口函数是<code>main()</code>函数，可以看到，其核心逻辑主要有如下几个步骤：</p><ul><li><code>clock_create</code>: 根据用户指定的参数与配置文件，创建<code>PTP</code>时钟对象，同时会创建时钟对应的<code>port</code>对象<code>clock_add_port</code></li><li><code>clock_poll</code>: 持续监控<code>port</code>状态，根据<code>port</code>的时间类型进行状态转换与处理</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">int</span> <span class="title function_">main</span><span class="params">(<span class="type">int</span> argc, <span class="type">char</span> *argv[])</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="type">char</span> *config = <span class="literal">NULL</span>, *req_phc = <span class="literal">NULL</span>, *progname;</span><br><span class="line"><span class="class"><span class="keyword">enum</span> <span class="title">clock_type</span> <span class="title">type</span> =</span> CLOCK_TYPE_ORDINARY;</span><br><span class="line"><span class="type">int</span> c, err = <span class="number">-1</span>, index, print_level;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">clock</span> *<span class="title">clock</span> =</span> <span class="literal">NULL</span>;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">option</span> *<span class="title">opts</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">config</span> *<span class="title">cfg</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (handle_term_signals())</span><br><span class="line"><span class="keyword">return</span> <span class="number">-1</span>;</span><br><span class="line"></span><br><span class="line">cfg = config_create();</span><br><span class="line"><span class="keyword">if</span> (!cfg) &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="number">-1</span>;</span><br><span class="line">&#125;</span><br><span class="line">  ...</span><br><span class="line"></span><br><span class="line">print_set_progname(progname);</span><br><span class="line">print_set_tag(config_get_string(cfg, <span class="literal">NULL</span>, <span class="string">&quot;message_tag&quot;</span>));</span><br><span class="line">print_set_verbose(config_get_int(cfg, <span class="literal">NULL</span>, <span class="string">&quot;verbose&quot;</span>));</span><br><span class="line">print_set_syslog(config_get_int(cfg, <span class="literal">NULL</span>, <span class="string">&quot;use_syslog&quot;</span>));</span><br><span class="line">print_set_level(config_get_int(cfg, <span class="literal">NULL</span>, <span class="string">&quot;logging_level&quot;</span>));</span><br><span class="line"></span><br><span class="line">assume_two_step = config_get_int(cfg, <span class="literal">NULL</span>, <span class="string">&quot;assume_two_step&quot;</span>);</span><br><span class="line">sk_check_fupsync = config_get_int(cfg, <span class="literal">NULL</span>, <span class="string">&quot;check_fup_sync&quot;</span>);</span><br><span class="line">sk_tx_timeout = config_get_int(cfg, <span class="literal">NULL</span>, <span class="string">&quot;tx_timestamp_timeout&quot;</span>);</span><br><span class="line">sk_hwts_filter_mode = config_get_int(cfg, <span class="literal">NULL</span>, <span class="string">&quot;hwts_filter&quot;</span>);</span><br><span class="line"></span><br><span class="line">ptp_hdr_ver = config_get_int(cfg, <span class="literal">NULL</span>, <span class="string">&quot;ptp_minor_version&quot;</span>);</span><br><span class="line">ptp_hdr_ver = (ptp_hdr_ver &lt;&lt; <span class="number">4</span>) | PTP_MAJOR_VERSION;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (config_get_int(cfg, <span class="literal">NULL</span>, <span class="string">&quot;clock_servo&quot;</span>) == CLOCK_SERVO_NTPSHM) &#123;</span><br><span class="line">config_set_int(cfg, <span class="string">&quot;kernel_leap&quot;</span>, <span class="number">0</span>);</span><br><span class="line">config_set_int(cfg, <span class="string">&quot;sanity_freq_limit&quot;</span>, <span class="number">0</span>);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (STAILQ_EMPTY(&amp;cfg-&gt;interfaces)) &#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">&quot;no interface specified\n&quot;</span>);</span><br><span class="line">usage(progname);</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">type = config_get_int(cfg, <span class="literal">NULL</span>, <span class="string">&quot;clock_type&quot;</span>);</span><br><span class="line"><span class="keyword">switch</span> (type) &#123;</span><br><span class="line"><span class="keyword">case</span> CLOCK_TYPE_ORDINARY:</span><br><span class="line"><span class="keyword">if</span> (cfg-&gt;n_interfaces &gt; <span class="number">1</span>) &#123;</span><br><span class="line">type = CLOCK_TYPE_BOUNDARY;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> CLOCK_TYPE_BOUNDARY:</span><br><span class="line"><span class="keyword">if</span> (cfg-&gt;n_interfaces &lt; <span class="number">2</span>) &#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">&quot;BC needs at least two interfaces\n&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> CLOCK_TYPE_P2P:</span><br><span class="line"><span class="keyword">if</span> (cfg-&gt;n_interfaces &lt; <span class="number">2</span>) &#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">&quot;TC needs at least two interfaces\n&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">if</span> (DM_P2P != config_get_int(cfg, <span class="literal">NULL</span>, <span class="string">&quot;delay_mechanism&quot;</span>)) &#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">&quot;P2P_TC needs P2P delay mechanism\n&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> CLOCK_TYPE_E2E:</span><br><span class="line"><span class="keyword">if</span> (cfg-&gt;n_interfaces &lt; <span class="number">2</span>) &#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">&quot;TC needs at least two interfaces\n&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">if</span> (DM_E2E != config_get_int(cfg, <span class="literal">NULL</span>, <span class="string">&quot;delay_mechanism&quot;</span>)) &#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">&quot;E2E_TC needs E2E delay mechanism\n&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> CLOCK_TYPE_MANAGEMENT:</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">clock = clock_create(type, cfg, req_phc);</span><br><span class="line"><span class="keyword">if</span> (!clock) &#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">&quot;failed to create a clock\n&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">err = <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line"><span class="keyword">while</span> (is_running()) &#123;</span><br><span class="line"><span class="keyword">if</span> (clock_poll(clock))</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">  ...</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>限于篇幅，感兴趣的可以参考源码中的<code>clock.c</code>、<code>port.c</code>等文件。接下来，我们简单看看<code>PTP</code>时钟内核的实现框架。</p><h3 id="PTP物理时钟驱动框架"><a href="#PTP物理时钟驱动框架" class="headerlink" title="PTP物理时钟驱动框架"></a><strong>PTP物理时钟驱动框架</strong></h3><p><a href="https://docs.kernel.org/driver-api/ptp.html"><code>PTP</code>物理时钟的驱动框架</a>主要分为两个部分，一个是提供公共接口与驱动框架的，相关的类型定义都放在<code>include/linux/ptp_clock_kernel.h</code>中；<code>PTP</code>物理时钟对应一个结构体<code>struct ptp_clock_info</code>，包含了<code>PTP</code>时钟的配置以及获取、设置时钟参数的接口。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ptp_clock_info</span> &#123;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">module</span> *<span class="title">owner</span>;</span></span><br><span class="line"><span class="type">char</span> name[<span class="number">16</span>];</span><br><span class="line">s32 max_adj;</span><br><span class="line"><span class="type">int</span> n_alarm;</span><br><span class="line"><span class="type">int</span> n_ext_ts;</span><br><span class="line"><span class="type">int</span> n_per_out;</span><br><span class="line"><span class="type">int</span> n_pins;</span><br><span class="line"><span class="type">int</span> pps;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ptp_pin_desc</span> *<span class="title">pin_config</span>;</span></span><br><span class="line"><span class="type">int</span> (*adjfine)(<span class="keyword">struct</span> ptp_clock_info *ptp, <span class="type">long</span> scaled_ppm);</span><br><span class="line"><span class="type">int</span> (*adjfreq)(<span class="keyword">struct</span> ptp_clock_info *ptp, s32 delta);</span><br><span class="line"><span class="type">int</span> (*adjphase)(<span class="keyword">struct</span> ptp_clock_info *ptp, s32 phase);</span><br><span class="line"><span class="type">int</span> (*adjtime)(<span class="keyword">struct</span> ptp_clock_info *ptp, s64 delta);</span><br><span class="line"><span class="type">int</span> (*gettime64)(<span class="keyword">struct</span> ptp_clock_info *ptp, <span class="keyword">struct</span> timespec64 *ts);</span><br><span class="line"><span class="type">int</span> (*gettimex64)(<span class="keyword">struct</span> ptp_clock_info *ptp, <span class="keyword">struct</span> timespec64 *ts,</span><br><span class="line">  <span class="keyword">struct</span> ptp_system_timestamp *sts);</span><br><span class="line"><span class="type">int</span> (*getcrosststamp)(<span class="keyword">struct</span> ptp_clock_info *ptp,</span><br><span class="line">      <span class="keyword">struct</span> system_device_crosststamp *cts);</span><br><span class="line"><span class="type">int</span> (*settime64)(<span class="keyword">struct</span> ptp_clock_info *p, <span class="type">const</span> <span class="keyword">struct</span> timespec64 *ts);</span><br><span class="line"><span class="type">int</span> (*enable)(<span class="keyword">struct</span> ptp_clock_info *ptp,</span><br><span class="line">      <span class="keyword">struct</span> ptp_clock_request *request, <span class="type">int</span> on);</span><br><span class="line"><span class="type">int</span> (*verify)(<span class="keyword">struct</span> ptp_clock_info *ptp, <span class="type">unsigned</span> <span class="type">int</span> pin,</span><br><span class="line">      <span class="keyword">enum</span> ptp_pin_function func, <span class="type">unsigned</span> <span class="type">int</span> chan);</span><br><span class="line"><span class="type">long</span> (*do_aux_work)(<span class="keyword">struct</span> ptp_clock_info *ptp);</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>对于支持<code>PTP</code>物理时钟的以太网驱动来说，需要在初始化的时候调用<code>ptp_clock_register</code>注册<code>PTP</code>时钟，并在驱动卸载的时候调用<code>ptp_clock_unregister</code>进行反注册。以英特尔的一个千兆以太网驱动<code>ethernet/intel/igb/igb_main.c</code>为例，可以看到网卡驱动在执行初始化<code>igb_probe</code>时，会调用<code>igb_ptp_init</code>注册<code>PTP</code>时钟：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">/**</span></span><br><span class="line"><span class="comment"> * igb_ptp_init - Initialize PTP functionality</span></span><br><span class="line"><span class="comment"> * @adapter: Board private structure</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * This function is called at device probe to initialize the PTP</span></span><br><span class="line"><span class="comment"> * functionality.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="type">void</span> <span class="title function_">igb_ptp_init</span><span class="params">(<span class="keyword">struct</span> igb_adapter *adapter)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">e1000_hw</span> *<span class="title">hw</span> =</span> &amp;adapter-&gt;hw;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">net_device</span> *<span class="title">netdev</span> =</span> adapter-&gt;netdev;</span><br><span class="line"><span class="type">int</span> i;</span><br><span class="line"></span><br><span class="line"><span class="keyword">switch</span> (hw-&gt;mac.type) &#123;</span><br><span class="line"><span class="keyword">case</span> e1000_82576:</span><br><span class="line"><span class="built_in">snprintf</span>(adapter-&gt;ptp_caps.name, <span class="number">16</span>, <span class="string">&quot;%pm&quot;</span>, netdev-&gt;dev_addr);</span><br><span class="line">adapter-&gt;ptp_caps.owner = THIS_MODULE;</span><br><span class="line">adapter-&gt;ptp_caps.max_adj = <span class="number">999999881</span>;</span><br><span class="line">adapter-&gt;ptp_caps.n_ext_ts = <span class="number">0</span>;</span><br><span class="line">adapter-&gt;ptp_caps.pps = <span class="number">0</span>;</span><br><span class="line">adapter-&gt;ptp_caps.adjfreq = igb_ptp_adjfreq_82576;</span><br><span class="line">adapter-&gt;ptp_caps.adjtime = igb_ptp_adjtime_82576;</span><br><span class="line">adapter-&gt;ptp_caps.gettimex64 = igb_ptp_gettimex_82576;</span><br><span class="line">adapter-&gt;ptp_caps.settime64 = igb_ptp_settime_82576;</span><br><span class="line">adapter-&gt;ptp_caps.enable = igb_ptp_feature_enable;</span><br><span class="line">adapter-&gt;cc.read = igb_ptp_read_82576;</span><br><span class="line">adapter-&gt;cc.mask = CYCLECOUNTER_MASK(<span class="number">64</span>);</span><br><span class="line">adapter-&gt;cc.mult = <span class="number">1</span>;</span><br><span class="line">adapter-&gt;cc.shift = IGB_82576_TSYNC_SHIFT;</span><br><span class="line">adapter-&gt;ptp_flags |= IGB_PTP_OVERFLOW_CHECK;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> e1000_82580:</span><br><span class="line"><span class="keyword">case</span> e1000_i354:</span><br><span class="line"><span class="keyword">case</span> e1000_i350:</span><br><span class="line"><span class="built_in">snprintf</span>(adapter-&gt;ptp_caps.name, <span class="number">16</span>, <span class="string">&quot;%pm&quot;</span>, netdev-&gt;dev_addr);</span><br><span class="line">adapter-&gt;ptp_caps.owner = THIS_MODULE;</span><br><span class="line">adapter-&gt;ptp_caps.max_adj = <span class="number">62499999</span>;</span><br><span class="line">adapter-&gt;ptp_caps.n_ext_ts = <span class="number">0</span>;</span><br><span class="line">adapter-&gt;ptp_caps.pps = <span class="number">0</span>;</span><br><span class="line">adapter-&gt;ptp_caps.adjfine = igb_ptp_adjfine_82580;</span><br><span class="line">adapter-&gt;ptp_caps.adjtime = igb_ptp_adjtime_82576;</span><br><span class="line">adapter-&gt;ptp_caps.gettimex64 = igb_ptp_gettimex_82580;</span><br><span class="line">adapter-&gt;ptp_caps.settime64 = igb_ptp_settime_82576;</span><br><span class="line">adapter-&gt;ptp_caps.enable = igb_ptp_feature_enable;</span><br><span class="line">adapter-&gt;cc.read = igb_ptp_read_82580;</span><br><span class="line">adapter-&gt;cc.mask = CYCLECOUNTER_MASK(IGB_NBITS_82580);</span><br><span class="line">adapter-&gt;cc.mult = <span class="number">1</span>;</span><br><span class="line">adapter-&gt;cc.shift = <span class="number">0</span>;</span><br><span class="line">adapter-&gt;ptp_flags |= IGB_PTP_OVERFLOW_CHECK;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> e1000_i210:</span><br><span class="line"><span class="keyword">case</span> e1000_i211:</span><br><span class="line"><span class="keyword">for</span> (i = <span class="number">0</span>; i &lt; IGB_N_SDP; i++) &#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ptp_pin_desc</span> *<span class="title">ppd</span> =</span> &amp;adapter-&gt;sdp_config[i];</span><br><span class="line"></span><br><span class="line"><span class="built_in">snprintf</span>(ppd-&gt;name, <span class="keyword">sizeof</span>(ppd-&gt;name), <span class="string">&quot;SDP%d&quot;</span>, i);</span><br><span class="line">ppd-&gt;index = i;</span><br><span class="line">ppd-&gt;func = PTP_PF_NONE;</span><br><span class="line">&#125;</span><br><span class="line"><span class="built_in">snprintf</span>(adapter-&gt;ptp_caps.name, <span class="number">16</span>, <span class="string">&quot;%pm&quot;</span>, netdev-&gt;dev_addr);</span><br><span class="line">adapter-&gt;ptp_caps.owner = THIS_MODULE;</span><br><span class="line">adapter-&gt;ptp_caps.max_adj = <span class="number">62499999</span>;</span><br><span class="line">adapter-&gt;ptp_caps.n_ext_ts = IGB_N_EXTTS;</span><br><span class="line">adapter-&gt;ptp_caps.n_per_out = IGB_N_PEROUT;</span><br><span class="line">adapter-&gt;ptp_caps.n_pins = IGB_N_SDP;</span><br><span class="line">adapter-&gt;ptp_caps.pps = <span class="number">1</span>;</span><br><span class="line">adapter-&gt;ptp_caps.pin_config = adapter-&gt;sdp_config;</span><br><span class="line">adapter-&gt;ptp_caps.adjfine = igb_ptp_adjfine_82580;</span><br><span class="line">adapter-&gt;ptp_caps.adjtime = igb_ptp_adjtime_i210;</span><br><span class="line">adapter-&gt;ptp_caps.gettimex64 = igb_ptp_gettimex_i210;</span><br><span class="line">adapter-&gt;ptp_caps.settime64 = igb_ptp_settime_i210;</span><br><span class="line">adapter-&gt;ptp_caps.enable = igb_ptp_feature_enable_i210;</span><br><span class="line">adapter-&gt;ptp_caps.verify = igb_ptp_verify_pin;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">default</span>:</span><br><span class="line">adapter-&gt;ptp_clock = <span class="literal">NULL</span>;</span><br><span class="line"><span class="keyword">return</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">spin_lock_init(&amp;adapter-&gt;tmreg_lock);</span><br><span class="line">INIT_WORK(&amp;adapter-&gt;ptp_tx_work, igb_ptp_tx_work);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (adapter-&gt;ptp_flags &amp; IGB_PTP_OVERFLOW_CHECK)</span><br><span class="line">INIT_DELAYED_WORK(&amp;adapter-&gt;ptp_overflow_work,</span><br><span class="line">  igb_ptp_overflow_check);</span><br><span class="line"></span><br><span class="line">adapter-&gt;tstamp_config.rx_filter = HWTSTAMP_FILTER_NONE;</span><br><span class="line">adapter-&gt;tstamp_config.tx_type = HWTSTAMP_TX_OFF;</span><br><span class="line"></span><br><span class="line">igb_ptp_reset(adapter);</span><br><span class="line"></span><br><span class="line">adapter-&gt;ptp_clock = ptp_clock_register(&amp;adapter-&gt;ptp_caps,</span><br><span class="line">&amp;adapter-&gt;pdev-&gt;dev);</span><br><span class="line"><span class="keyword">if</span> (IS_ERR(adapter-&gt;ptp_clock)) &#123;</span><br><span class="line">adapter-&gt;ptp_clock = <span class="literal">NULL</span>;</span><br><span class="line">dev_err(&amp;adapter-&gt;pdev-&gt;dev, <span class="string">&quot;ptp_clock_register failed\n&quot;</span>);</span><br><span class="line">&#125; <span class="keyword">else</span> <span class="keyword">if</span> (adapter-&gt;ptp_clock) &#123;</span><br><span class="line">dev_info(&amp;adapter-&gt;pdev-&gt;dev, <span class="string">&quot;added PHC on %s\n&quot;</span>,</span><br><span class="line"> adapter-&gt;netdev-&gt;name);</span><br><span class="line">adapter-&gt;ptp_flags |= IGB_PTP_ENABLED;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>函数<code>ptp_clock_register</code>主要完成了如下几个工作：</p><ol><li>创建一个字符设备，设置设备号与名称，注册完成后可以在<code>/dev/ptpx</code>上访问到<code>PTP</code>时钟</li><li>如果时钟本身支持<code>PPS(Pulse-per-Second)</code>，那么还需要通过<code>pps_register_source</code>注册<code>PPS</code>源</li><li>最后将<code>PTP</code>时钟注册为一个标准的<code>posix</code>时钟，这样可以通过标准的<code>posix</code>接口方式来访问<code>PTP</code>时钟</li></ol><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="keyword">struct</span> ptp_clock *<span class="title function_">ptp_clock_register</span><span class="params">(<span class="keyword">struct</span> ptp_clock_info *info,</span></span><br><span class="line"><span class="params">     <span class="keyword">struct</span> device *parent)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ptp_clock</span> *<span class="title">ptp</span>;</span></span><br><span class="line"><span class="type">int</span> err = <span class="number">0</span>, index, major = MAJOR(ptp_devt);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (info-&gt;n_alarm &gt; PTP_MAX_ALARMS)</span><br><span class="line"><span class="keyword">return</span> ERR_PTR(-EINVAL);</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Initialize a clock structure. */</span></span><br><span class="line">err = -ENOMEM;</span><br><span class="line">ptp = kzalloc(<span class="keyword">sizeof</span>(<span class="keyword">struct</span> ptp_clock), GFP_KERNEL);</span><br><span class="line"><span class="keyword">if</span> (ptp == <span class="literal">NULL</span>)</span><br><span class="line"><span class="keyword">goto</span> no_memory;</span><br><span class="line"></span><br><span class="line">index = ida_simple_get(&amp;ptp_clocks_map, <span class="number">0</span>, MINORMASK + <span class="number">1</span>, GFP_KERNEL);</span><br><span class="line"><span class="keyword">if</span> (index &lt; <span class="number">0</span>) &#123;</span><br><span class="line">err = index;</span><br><span class="line"><span class="keyword">goto</span> no_slot;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">ptp-&gt;clock.ops = ptp_clock_ops;</span><br><span class="line">ptp-&gt;info = info;</span><br><span class="line">ptp-&gt;devid = MKDEV(major, index);</span><br><span class="line">ptp-&gt;index = index;</span><br><span class="line">spin_lock_init(&amp;ptp-&gt;tsevq.lock);</span><br><span class="line">mutex_init(&amp;ptp-&gt;tsevq_mux);</span><br><span class="line">mutex_init(&amp;ptp-&gt;pincfg_mux);</span><br><span class="line">init_waitqueue_head(&amp;ptp-&gt;tsev_wq);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (ptp-&gt;info-&gt;do_aux_work) &#123;</span><br><span class="line">kthread_init_delayed_work(&amp;ptp-&gt;aux_work, ptp_aux_kworker);</span><br><span class="line">ptp-&gt;kworker = kthread_create_worker(<span class="number">0</span>, <span class="string">&quot;ptp%d&quot;</span>, ptp-&gt;index);</span><br><span class="line"><span class="keyword">if</span> (IS_ERR(ptp-&gt;kworker)) &#123;</span><br><span class="line">err = PTR_ERR(ptp-&gt;kworker);</span><br><span class="line">pr_err(<span class="string">&quot;failed to create ptp aux_worker %d\n&quot;</span>, err);</span><br><span class="line"><span class="keyword">goto</span> kworker_err;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">err = ptp_populate_pin_groups(ptp);</span><br><span class="line"><span class="keyword">if</span> (err)</span><br><span class="line"><span class="keyword">goto</span> no_pin_groups;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Register a new PPS source. */</span></span><br><span class="line"><span class="keyword">if</span> (info-&gt;pps) &#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">pps_source_info</span> <span class="title">pps</span>;</span></span><br><span class="line"><span class="built_in">memset</span>(&amp;pps, <span class="number">0</span>, <span class="keyword">sizeof</span>(pps));</span><br><span class="line"><span class="built_in">snprintf</span>(pps.name, PPS_MAX_NAME_LEN, <span class="string">&quot;ptp%d&quot;</span>, index);</span><br><span class="line">pps.mode = PTP_PPS_MODE;</span><br><span class="line">pps.owner = info-&gt;owner;</span><br><span class="line">ptp-&gt;pps_source = pps_register_source(&amp;pps, PTP_PPS_DEFAULTS);</span><br><span class="line"><span class="keyword">if</span> (IS_ERR(ptp-&gt;pps_source)) &#123;</span><br><span class="line">err = PTR_ERR(ptp-&gt;pps_source);</span><br><span class="line">pr_err(<span class="string">&quot;failed to register pps source\n&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> no_pps;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Initialize a new device of our class in our clock structure. */</span></span><br><span class="line">device_initialize(&amp;ptp-&gt;dev);</span><br><span class="line">ptp-&gt;dev.devt = ptp-&gt;devid;</span><br><span class="line">ptp-&gt;dev.class = ptp_class;</span><br><span class="line">ptp-&gt;dev.parent = parent;</span><br><span class="line">ptp-&gt;dev.groups = ptp-&gt;pin_attr_groups;</span><br><span class="line">ptp-&gt;dev.release = ptp_clock_release;</span><br><span class="line">dev_set_drvdata(&amp;ptp-&gt;dev, ptp);</span><br><span class="line">dev_set_name(&amp;ptp-&gt;dev, <span class="string">&quot;ptp%d&quot;</span>, ptp-&gt;index);</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Create a posix clock and link it to the device. */</span></span><br><span class="line">err = posix_clock_register(&amp;ptp-&gt;clock, &amp;ptp-&gt;dev);</span><br><span class="line"><span class="keyword">if</span> (err) &#123;</span><br><span class="line">pr_err(<span class="string">&quot;failed to create posix clock\n&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> no_clock;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> ptp;</span><br><span class="line">  ...</span><br><span class="line">&#125;</span><br><span class="line">EXPORT_SYMBOL(ptp_clock_register);</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a><strong>总结</strong></h2><p>高精度的时间同步实现起来是一个比较复杂的事情，从最初的<code>NTP</code>到如今的<code>PTP</code>，时间同步的精度已经可以达到微妙级别，但随着实时音视频、自动驾驶等场景对时间同步精度的要求日益提高，基于<code>PTP</code>协议衍生出了新的时间同步方式，比较常见的有：</p><ul><li><code>TSN(Time Sensitive Networking)</code>时间敏感网络，是一种可以实现确定性时延迟、高可靠性与支持优先级控制的协议，一般用于实时音视频流的传输；<code>TSN</code>协议集中包含了一个改善的<a href="https://blog.meinbergglobal.com/2024/03/27/what-is-gptp/"><code>PTP</code>协议<code>gPTP(generalized Precision Time Protocol, IEEE802.1AS)</code></a></li><li><a href="https://www.intel.com/content/www/us/en/docs/programmable/683140/21-4-4-0-0/precision-time-measurement-ptm-58323.html"><code>PTM(Precision Time Measurement)/ePTM(enhanced Precision Time Measurement)</code>时间测量</a>，基于<code>PCIE</code>总线实现的一种时间同步协议。<code>PTM</code>协议目前在桌面系统、服务器领域有使用，车载领域还是以<code>PTP</code>使用最为广泛</li></ul><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a><strong>参考资料</strong></h2><ul><li><a href="https://info.support.huawei.com/info-finder/encyclopedia/zh/1588v2.html">https://info.support.huawei.com/info-finder/encyclopedia/zh/1588v2.html</a></li><li><a href="https://blog.csdn.net/woswod/article/details/82345380">https://blog.csdn.net/woswod/article/details/82345380</a></li><li><a href="https://blog.csdn.net/yaojiawan/article/details/124601694">https://blog.csdn.net/yaojiawan/article/details/124601694</a></li><li><a href="https://linux.die.net/man/8/ptp4l">https://linux.die.net/man/8/ptp4l</a></li><li><a href="https://info.support.huawei.com/info-finder/encyclopedia/zh/NTP.html">https://info.support.huawei.com/info-finder/encyclopedia/zh/NTP.html</a></li><li><a href="https://eci.intel.com/docs/3.3/development/performance/tsnrefsw/tsn-overview.html">https://eci.intel.com/docs/3.3/development/performance/tsnrefsw/tsn-overview.html</a></li><li><a href="https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/system_administrators_guide/ch-configuring_ptp_using_ptp4l">https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/system_administrators_guide/ch-configuring_ptp_using_ptp4l</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;现代人的生活已经离不开时间了，无论是出门上班，还是外出旅行，都需要准确的知道我们所处位置的时间。日常生活中，往往分钟、秒级的时间精确度就够用了，但在工程技术中，比如飞机巡航、机器控制、网络管理都需要更高精度的时间测量。我们需要准确的知道两个事件之间发生的时间。精确的测量时间是一件非常复杂的技术活儿。在世界各地，要在不同网络与设备之间同步时间是一件非常具有挑战的事情。首先，需要解决的问题是如何精确测量时间，其次是将时间准确的同步到其他系统或者设备。第一个问题可以通过&lt;a href=&quot;https://en.wikipedia.org/wiki/Atomic_clock&quot;&gt;原子钟(&lt;code&gt;atomic clocks&lt;/code&gt;)&lt;/a&gt;来解决，比如标准时间的采用的​​铯原子钟​​误差可以达到1亿年1秒；卫星导航系统如&lt;code&gt;GPS&lt;/code&gt;，北斗都会搭载一个原子钟用于高精度的导航，因此&lt;code&gt;GPS&lt;/code&gt;信号也可以作为一个时钟源用于授时；第二个时间同步一般通过标准的协议来实现，本文重点介绍使用较为普遍的一种同步协议&lt;a href=&quot;https://en.wikipedia.org/wiki/Precision_Time_Protocol&quot;&gt;&lt;code&gt;PTP(Precise Time Protocol)&lt;/code&gt;&lt;/a&gt;。&lt;/p&gt;</summary>
    
    
    
    <category term="网络协议" scheme="https://sniffer.site/categories/%E7%BD%91%E7%BB%9C%E5%8D%8F%E8%AE%AE/"/>
    
    <category term="Linux" scheme="https://sniffer.site/categories/%E7%BD%91%E7%BB%9C%E5%8D%8F%E8%AE%AE/Linux/"/>
    
    
    <category term="时间同步" scheme="https://sniffer.site/tags/%E6%97%B6%E9%97%B4%E5%90%8C%E6%AD%A5/"/>
    
    <category term="PTP" scheme="https://sniffer.site/tags/PTP/"/>
    
    <category term="Precise Time Protocol" scheme="https://sniffer.site/tags/Precise-Time-Protocol/"/>
    
    <category term="gPTP" scheme="https://sniffer.site/tags/gPTP/"/>
    
  </entry>
  
  <entry>
    <title>深入Linux容器LXC之二-LXC源码分析</title>
    <link href="https://sniffer.site/2025/06/18/%E6%B7%B1%E5%85%A5Linux%E5%AE%B9%E5%99%A8LXC%E4%B9%8B%E4%BA%8C-LXC%E6%BA%90%E7%A0%81%E5%88%86%E6%9E%90/"/>
    <id>https://sniffer.site/2025/06/18/%E6%B7%B1%E5%85%A5Linux%E5%AE%B9%E5%99%A8LXC%E4%B9%8B%E4%BA%8C-LXC%E6%BA%90%E7%A0%81%E5%88%86%E6%9E%90/</id>
    <published>2025-06-18T09:55:26.000Z</published>
    <updated>2025-07-09T11:29:59.537Z</updated>
    
    <content type="html"><![CDATA[<p>在上一篇文章中<a href="https://sniffer.site/2025/06/06/%E6%B7%B1%E5%85%A5linux%E5%AE%B9%E5%99%A8lxc%E4%B9%8B%E4%B8%80-lxc%E7%9A%84%E5%AE%9E%E7%8E%B0%E5%8E%9F%E7%90%86/">深入Linux容器LXC之一-LXC的实现原理</a>着重介绍了<code>LXC</code>容器的实现原理，我们对<code>LXC</code>容器的基本原理有了一定的了解，但其中有一个问题，启动一个容器后，容器系统中存在两个名为<code>init</code>的进程。为什么会有两个<code>init</code>进程？为了解开这个疑问，需要对<code>LXC</code>的源码做一个深入的分析。本篇文章主要是围绕这个问题展开，大致分为两个大的部分：</p><ul><li><code>lxc-create</code>的实现： <code>LXC</code>容器是如何创建的</li><li><code>lxc-start</code>的实现： <code>LXC</code>容器是如何启动的</li></ul><span id="more"></span><p><code>LXC</code>的源码可以通过<a href="https://github.com/lxc/lxc"><code>Github</code>仓库下载</a>，其核心代码主要包括如下几个部分：</p><ul><li><code>config</code>: 包含了常见的配置，如<code>apparmor</code>配置，容器启动配置，<code>init</code>进程的配置，<code>selinux</code>规则配置</li><li><code>doc</code>: 包含<code>LXC</code>工具的说明文档，常见的配置模版</li><li><code>src</code>: 源代码目录，包括头文件，核心代码目录以及测试代码目录</li><li><code>templates</code>: 容器镜像模版，包含了<code>busybox</code>, <code>local</code>，<code>download</code>等几种常用的模版，更多的容器模版可以到<a href="https://linuxcontainers.org/lxc/downloads/"><code>LXC</code>官网</a>下载</li></ul><p>接下来我们就从容器的创建与启动两个过程来分析下<code>LXC</code>的源代码。</p><h2 id="LXC容器的创建过程"><a href="#LXC容器的创建过程" class="headerlink" title="LXC容器的创建过程"></a><strong><code>LXC</code>容器的创建过程</strong></h2><p><code>LXC</code>工具相关的源码都在<code>src/lxc/tools</code>下面。这里我们重点已<code>LXC</code>容器的创建与启动两个流程来重点梳理下<code>LXC</code>源代码的实现原理。首先来看看容器容器<code>lxc-create</code>的具体调用流程；比如，通过如下指令创建一个容器，会调用<code>lxc-create</code>工具：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">sudo lxc-create -n busybox-lxc -t busybox</span><br><span class="line"></span><br><span class="line"><span class="comment">## 如果我们想看到更多启动日志，可以加上 -l, -o两个参数，这样可以看到详细的启动过程</span></span><br><span class="line">sudo lxc-create -n busybox-lxc1 -t busybox -l TRACE -o /home/jason/Downloads/busybox-lxc-create.log</span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>lxc-create</code>的实现都位于文件<code>lxc_create.c</code>，找到中入口函数<code>lxc_create_main</code>，可以看到容器的创建大致有如下几个关键的步骤：</p><ul><li><code>lxc_arguments_parse</code>:解析传入的参数，主要有容器名称(<code>-n</code>)与容器的模版(<code>-t</code>)，并设置文件路径用于保存容器的持久化配置</li><li><code>lxc_log_init</code>:初始化容器日志存储的目录（默认是无日志存储，需要通过<code>-o</code>参数设定日志保存的路径）</li><li><code>lxc_mkdir_p</code>：创建容器对应的目录，用于保存配置与容器的<code>rootfs</code>镜像，<code>ubuntu</code>系统默认的路径为<code>/var/lib/lxc</code></li><li><code>lxc_container_new</code>: 创建一个新的容器<code>struct lxc_container</code>，并对其进行初始化；如果该容器已创建，则退出后续流程</li><li>调用容器函数<code>load_config</code>加载配置，如果是第一次创建容器，配置文件为空（配置文件位于<code>/var/lib/lxc/&lt;lxc-name&gt;/config</code>）</li><li>调用<code>create</code>创建容器对象，主要是创建容器<code>rootfs</code>存储所需的目录，并执行模版对应的脚本，保存相关的容器配置</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">int</span> __attribute__((weak, alias(<span class="string">&quot;lxc_create_main&quot;</span>))) main(<span class="type">int</span> argc, <span class="type">char</span> *argv[]);</span><br><span class="line"><span class="type">int</span> <span class="title function_">lxc_create_main</span><span class="params">(<span class="type">int</span> argc, <span class="type">char</span> *argv[])</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">lxc_container</span> *<span class="title">c</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">bdev_specs</span> <span class="title">spec</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">lxc_log</span> <span class="title">log</span>;</span></span><br><span class="line"><span class="type">int</span> flags = <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (lxc_arguments_parse(&amp;my_args, argc, argv))</span><br><span class="line"><span class="built_in">exit</span>(EXIT_FAILURE);</span><br><span class="line"></span><br><span class="line"><span class="built_in">log</span>.name = my_args.name;</span><br><span class="line"><span class="built_in">log</span>.file = my_args.log_file;</span><br><span class="line"><span class="built_in">log</span>.level = my_args.log_priority;</span><br><span class="line"><span class="built_in">log</span>.prefix = my_args.progname;</span><br><span class="line"><span class="built_in">log</span>.quiet = my_args.quiet;</span><br><span class="line"><span class="built_in">log</span>.lxcpath = my_args.lxcpath[<span class="number">0</span>];</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (lxc_log_init(&amp;<span class="built_in">log</span>))</span><br><span class="line"><span class="built_in">exit</span>(EXIT_FAILURE);</span><br><span class="line"></span><br><span class="line">···</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (lxc_mkdir_p(my_args.lxcpath[<span class="number">0</span>], <span class="number">0755</span>))</span><br><span class="line"><span class="built_in">exit</span>(EXIT_FAILURE);</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line"></span><br><span class="line">c = lxc_container_new(my_args.name, my_args.lxcpath[<span class="number">0</span>]);</span><br><span class="line"><span class="keyword">if</span> (!c) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to create lxc container&quot;</span>);</span><br><span class="line"><span class="built_in">exit</span>(EXIT_FAILURE);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (c-&gt;is_defined(c)) &#123;</span><br><span class="line">lxc_container_put(c);</span><br><span class="line">ERROR(<span class="string">&quot;Container already exists&quot;</span>);</span><br><span class="line"><span class="built_in">exit</span>(EXIT_FAILURE);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (my_args.configfile)</span><br><span class="line">c-&gt;load_config(c, my_args.configfile);</span><br><span class="line"><span class="keyword">else</span></span><br><span class="line">c-&gt;load_config(c, lxc_get_global_config_item(<span class="string">&quot;lxc.default_config&quot;</span>));</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!c-&gt;create(c, my_args.template, my_args.bdevtype, &amp;spec, flags, &amp;argv[optind])) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to create container %s&quot;</span>, c-&gt;name);</span><br><span class="line">lxc_container_put(c);</span><br><span class="line"><span class="built_in">exit</span>(EXIT_FAILURE);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">lxc_container_put(c);</span><br><span class="line"><span class="built_in">exit</span>(EXIT_SUCCESS);</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>重点看一看容器创建的关键步骤<code>lxc-&gt;create</code>的过程。容器在创建时<code>lxc_container_new</code>，会初始化容器相关的成员函数；就是说<code>lxc-&gt;create</code>函数最终调用的是<code>lxcapi_create</code>，最终会调用到核心函数<code>__lxcapi_create</code>(具体可以参考源码<code>lxccontainer.c</code>):</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">   <span class="comment">//lxc_container_new</span></span><br><span class="line">c-&gt;daemonize= <span class="literal">true</span>;</span><br><span class="line">c-&gt;pidfile= <span class="literal">NULL</span>;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Assign the member functions. */</span></span><br><span class="line">c-&gt;is_defined= lxcapi_is_defined;</span><br><span class="line">c-&gt;state= lxcapi_state;</span><br><span class="line">c-&gt;is_running= lxcapi_is_running;</span><br><span class="line">c-&gt;freeze= lxcapi_freeze;</span><br><span class="line">c-&gt;unfreeze= lxcapi_unfreeze;</span><br><span class="line">c-&gt;console= lxcapi_console;</span><br><span class="line">c-&gt;console_getfd= lxcapi_console_getfd;</span><br><span class="line">c-&gt;devpts_fd= lxcapi_devpts_fd;</span><br><span class="line">c-&gt;init_pid= lxcapi_init_pid;</span><br><span class="line">c-&gt;init_pidfd= lxcapi_init_pidfd;</span><br><span class="line">c-&gt;load_config= lxcapi_load_config;</span><br><span class="line">c-&gt;want_daemonize= lxcapi_want_daemonize;</span><br><span class="line">c-&gt;want_close_all_fds= lxcapi_want_close_all_fds;</span><br><span class="line">c-&gt;start= lxcapi_start;</span><br><span class="line">c-&gt;startl= lxcapi_startl;</span><br><span class="line">c-&gt;stop= lxcapi_stop;</span><br><span class="line">c-&gt;config_file_name= lxcapi_config_file_name;</span><br><span class="line">c-&gt;wait= lxcapi_wait;</span><br><span class="line">c-&gt;set_config_item= lxcapi_set_config_item;</span><br><span class="line">c-&gt;destroy= lxcapi_destroy;</span><br><span class="line">c-&gt;destroy_with_snapshots= lxcapi_destroy_with_snapshots;</span><br><span class="line">c-&gt;rename= lxcapi_rename;</span><br><span class="line">c-&gt;save_config= lxcapi_save_config;</span><br><span class="line">c-&gt;get_keys= lxcapi_get_keys;</span><br><span class="line">c-&gt;create= lxcapi_create;</span><br><span class="line">c-&gt;createl= lxcapi_createl;</span><br><span class="line">c-&gt;shutdown= lxcapi_shutdown;</span><br><span class="line">c-&gt;reboot= lxcapi_reboot;</span><br><span class="line">c-&gt;reboot2= lxcapi_reboot2;</span><br><span class="line">c-&gt;clear_config= lxcapi_clear_config;</span><br><span class="line">c-&gt;clear_config_item= lxcapi_clear_config_item;</span><br><span class="line">c-&gt;get_config_item= lxcapi_get_config_item;</span><br><span class="line">c-&gt;get_running_config_item= lxcapi_get_running_config_item;</span><br><span class="line">c-&gt;get_cgroup_item= lxcapi_get_cgroup_item;</span><br><span class="line">c-&gt;set_cgroup_item = lxcapi_set_cgroup_item;</span><br><span class="line">c-&gt;get_config_path = lxcapi_get_config_path;</span><br><span class="line">c-&gt;set_config_path = lxcapi_set_config_path;</span><br><span class="line">c-&gt;clone= lxcapi_clone;</span><br><span class="line">c-&gt;get_interfaces= lxcapi_get_interfaces;</span><br><span class="line">c-&gt;get_ips= lxcapi_get_ips;</span><br><span class="line">c-&gt;attach= lxcapi_attach;</span><br><span class="line">c-&gt;attach_run_wait= lxcapi_attach_run_wait;</span><br><span class="line">c-&gt;attach_run_waitl= lxcapi_attach_run_waitl;</span><br><span class="line">c-&gt;snapshot= lxcapi_snapshot;</span><br><span class="line">c-&gt;snapshot_list= lxcapi_snapshot_list;</span><br><span class="line">c-&gt;snapshot_restore= lxcapi_snapshot_restore;</span><br><span class="line">c-&gt;snapshot_destroy = lxcapi_snapshot_destroy;</span><br><span class="line">c-&gt;snapshot_destroy_all= lxcapi_snapshot_destroy_all;</span><br><span class="line">c-&gt;may_control= lxcapi_may_control;</span><br><span class="line">c-&gt;add_device_node= lxcapi_add_device_node;</span><br><span class="line">c-&gt;remove_device_node= lxcapi_remove_device_node;</span><br><span class="line">c-&gt;attach_interface= lxcapi_attach_interface;</span><br><span class="line">c-&gt;detach_interface = lxcapi_detach_interface;</span><br><span class="line">c-&gt;checkpoint= lxcapi_checkpoint;</span><br><span class="line">c-&gt;restore= lxcapi_restore;</span><br><span class="line">c-&gt;migrate = lxcapi_migrate;</span><br><span class="line">c-&gt;console_log= lxcapi_console_log;</span><br><span class="line">c-&gt;mount= lxcapi_mount;</span><br><span class="line">c-&gt;umount= lxcapi_umount;</span><br><span class="line">c-&gt;seccomp_notify_fd= lxcapi_seccomp_notify_fd;</span><br><span class="line">c-&gt;seccomp_notify_fd_active= lxcapi_seccomp_notify_fd_active;</span><br><span class="line">c-&gt;set_timeout= lxcapi_set_timeout;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>容器的创建函数<code>__lxcapi_create</code>主要有如下几个流程：</p><ul><li>首先通过<code>get_template_path</code>获取到模版的路径（模版是用于创建系统启动环境<code>rootfs</code>的一个脚本）</li><li>接着调用<code>create_container_dir</code>创建保存容器<code>rootfs</code>的文件目录</li><li>创建一个子进程用于创建容器持久化备份需要的目录，并保存当前的配置到持久化的设备中</li><li>执行<code>busybox</code>模版脚本<code>create_run_template</code>(<code>LXC</code>源码中内置的模版可以参考<code>src/templates</code>)，用于创建容器启动的初始化环境</li><li>最后通过<code>load_config_locked</code>将保存好的配置文件（<code>/var/lib/lxc/busybox/config</code>）的配置加载到内存</li></ul><blockquote><p>为了确保容器在系统重启后依然可用，需要对容器的rootfs进行备份，目前lxc支持多种设备进行备份，比如基于一个文件目录（默认方式），也可以基于逻辑的设备文件，或者overlay的文件系统，具体可以参考<code>src/lxc/storage/storage.c</code></p></blockquote><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">bool</span> __lxcapi_create(<span class="keyword">struct</span> lxc_container *c, <span class="type">const</span> <span class="type">char</span> *t,</span><br><span class="line">    <span class="type">const</span> <span class="type">char</span> *bdevtype, <span class="keyword">struct</span> bdev_specs *specs,</span><br><span class="line">    <span class="type">int</span> flags, <span class="type">char</span> *<span class="type">const</span> argv[])</span><br><span class="line">&#123;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (t) &#123;</span><br><span class="line">path_template = get_template_path(t);</span><br><span class="line"><span class="keyword">if</span> (!path_template)</span><br><span class="line"><span class="keyword">return</span> log_error(<span class="literal">false</span>, <span class="string">&quot;Template \&quot;%s\&quot; not found&quot;</span>, t);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line"></span><br><span class="line">fd_rootfs = create_container_dir(c);</span><br><span class="line"><span class="keyword">if</span> (fd_rootfs &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">return</span> log_error(<span class="literal">false</span>, <span class="string">&quot;Failed to create container %s&quot;</span>, c-&gt;name);</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment">/* Mark that this container as being created */</span></span><br><span class="line">partial_fd = create_partial(fd_rootfs, c);</span><br><span class="line"><span class="keyword">if</span> (partial_fd &lt; <span class="number">0</span>) &#123;</span><br><span class="line">SYSERROR(<span class="string">&quot;Failed to mark container as being partially created&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* No need to get disk lock bc we have the partial lock. */</span></span><br><span class="line"></span><br><span class="line">mask = umask(<span class="number">0022</span>);</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Create the storage.</span></span><br><span class="line"><span class="comment"> * Note we can&#x27;t do this in the same task as we use to execute the</span></span><br><span class="line"><span class="comment"> * template because of the way zfs works.</span></span><br><span class="line"><span class="comment"> * After you &#x27;zfs create&#x27;, zfs mounts the fs only in the initial</span></span><br><span class="line"><span class="comment"> * namespace.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">pid = fork();</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (pid == <span class="number">0</span>) &#123; <span class="comment">/* child */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">lxc_storage</span> *<span class="title">bdev</span> =</span> <span class="literal">NULL</span>;</span><br><span class="line"></span><br><span class="line">bdev = do_storage_create(c, bdevtype, specs);</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Save config file again to store the new rootfs location. */</span></span><br><span class="line"><span class="keyword">if</span> (!do_lxcapi_save_config(c, <span class="literal">NULL</span>)) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to save initial config for %s&quot;</span>, c-&gt;name);</span><br><span class="line"><span class="comment">/* Parent task won&#x27;t see the storage driver in the</span></span><br><span class="line"><span class="comment"> * config so we delete it.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">bdev-&gt;ops-&gt;umount(bdev);</span><br><span class="line">bdev-&gt;ops-&gt;destroy(bdev);</span><br><span class="line">_exit(EXIT_FAILURE);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">_exit(EXIT_SUCCESS);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!wait_exited(pid))</span><br><span class="line"><span class="keyword">goto</span> out_unlock;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Reload config to get the rootfs. */</span></span><br><span class="line">lxc_conf_free(c-&gt;lxc_conf);</span><br><span class="line">c-&gt;lxc_conf = <span class="literal">NULL</span>;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!load_config_locked(c, c-&gt;configfile))</span><br><span class="line"><span class="keyword">goto</span> out_unlock;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!create_run_template(c, path_template, !!(flags &amp; LXC_CREATE_QUIET), argv))</span><br><span class="line"><span class="keyword">goto</span> out_unlock;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Now clear out the lxc_conf we have, reload from the created</span></span><br><span class="line"><span class="comment"> * container.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">do_lxcapi_clear_config(c);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (t) &#123;</span><br><span class="line"><span class="keyword">if</span> (!prepend_lxc_header(c-&gt;configfile, path_template, argv)) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to prepend header to config file&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out_unlock;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">bret = load_config_locked(c, c-&gt;configfile);</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> bret;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="LXC容器的启动过程"><a href="#LXC容器的启动过程" class="headerlink" title="LXC容器的启动过程"></a><strong>LXC容器的启动过程</strong></h2><h3 id="lxc-create"><a href="#lxc-create" class="headerlink" title="lxc-create"></a><strong>lxc-create</strong></h3><p>与<code>lxc-create</code>的实现类似，<code>lxc-start</code>对应的实现都在文件<code>lxc_start</code>中，找到入口函数<code>lxc_start_main</code>，启动的过程主要有如下几个过程：</p><ul><li><code>lxc_caps_init</code>: 检查程序运行的权限，如果是非<code>root</code>权限执行，则会退出</li><li><code>lxc_arguments_parse</code>: 解析制定的参数，这里我们只是给了一个<code>-n</code>参数容器名称</li><li><code>lxc_container_new</code>: 创建一个容器对象，并检查容器是否已经处于运行状态；如果已运行，则执行结束</li><li>调用<code>lxc_container-&gt;start</code>启动容器系统，创建新的<code>init</code>进程</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">int</span> __attribute__((weak, alias(<span class="string">&quot;lxc_start_main&quot;</span>))) main(<span class="type">int</span> argc, <span class="type">char</span> *argv[]);</span><br><span class="line"><span class="type">int</span> <span class="title function_">lxc_start_main</span><span class="params">(<span class="type">int</span> argc, <span class="type">char</span> *argv[])</span></span><br><span class="line">&#123;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (lxc_caps_init())</span><br><span class="line"><span class="built_in">exit</span>(err);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (lxc_arguments_parse(&amp;my_args, argc, argv))</span><br><span class="line"><span class="built_in">exit</span>(err);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!my_args.argc)</span><br><span class="line">args = default_args;</span><br><span class="line"><span class="keyword">else</span></span><br><span class="line">args = my_args.argv;</span><br><span class="line"></span><br><span class="line"><span class="built_in">log</span>.name = my_args.name;</span><br><span class="line"><span class="built_in">log</span>.file = my_args.log_file;</span><br><span class="line"><span class="built_in">log</span>.level = my_args.log_priority;</span><br><span class="line"><span class="built_in">log</span>.prefix = my_args.progname;</span><br><span class="line"><span class="built_in">log</span>.quiet = my_args.quiet;</span><br><span class="line"><span class="built_in">log</span>.lxcpath = my_args.lxcpath[<span class="number">0</span>];</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (lxc_log_init(&amp;<span class="built_in">log</span>))</span><br><span class="line"><span class="built_in">exit</span>(err);</span><br><span class="line"></span><br><span class="line">lxcpath = my_args.lxcpath[<span class="number">0</span>];</span><br><span class="line"><span class="keyword">if</span> (access(lxcpath, O_RDONLY) &lt; <span class="number">0</span>) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;You lack access to %s&quot;</span>, lxcpath);</span><br><span class="line"><span class="built_in">exit</span>(err);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (my_args.rcfile) &#123;</span><br><span class="line">...</span><br><span class="line"><span class="comment">//未指定rc配置文件</span></span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line"><span class="type">int</span> rc;</span><br><span class="line"></span><br><span class="line">rc = asprintf(&amp;rcfile, <span class="string">&quot;%s/%s/config&quot;</span>, lxcpath, my_args.name);</span><br><span class="line"><span class="keyword">if</span> (rc == <span class="number">-1</span>) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to allocate memory&quot;</span>);</span><br><span class="line"><span class="built_in">exit</span>(err);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* container configuration does not exist */</span></span><br><span class="line"><span class="keyword">if</span> (access(rcfile, F_OK)) &#123;</span><br><span class="line"><span class="built_in">free</span>(rcfile);</span><br><span class="line">rcfile = <span class="literal">NULL</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">c = lxc_container_new(my_args.name, lxcpath);</span><br><span class="line"><span class="keyword">if</span> (!c) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to create lxc_container&quot;</span>);</span><br><span class="line"><span class="built_in">exit</span>(err);</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">...</span><br><span class="line"><span class="keyword">if</span> (c-&gt;is_running(c)) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Container is already running&quot;</span>);</span><br><span class="line">err = EXIT_SUCCESS;</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line">    ...</span><br><span class="line"></span><br><span class="line"><span class="comment">//使用默认启动变量</span></span><br><span class="line"><span class="keyword">if</span> (args == default_args)</span><br><span class="line">err = c-&gt;start(c, <span class="number">0</span>, <span class="literal">NULL</span>) ? EXIT_SUCCESS : EXIT_FAILURE;</span><br><span class="line"><span class="keyword">else</span></span><br><span class="line">err = c-&gt;start(c, <span class="number">0</span>, args) ? EXIT_SUCCESS : EXIT_FAILURE;</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><h3 id="do-lxcapi-start"><a href="#do-lxcapi-start" class="headerlink" title="do_lxcapi_start"></a><strong>do_lxcapi_start</strong></h3><p>从上文容器创建的过程可以知道，<code>lxc_container-&gt;start</code>实际调用的是函数<code>lxcapi_start</code>；最终调用的是<code>do_lxcapi_start</code>，主要有三个参数：</p><ul><li><code>struct lxc_container</code>已创建的容器对象</li><li><code>int useinit</code> 是否使用<code>init</code>进程，对于启动一个完整容器镜像的情况，<code>useinit</code>为<code>0</code>；否则，就利用<code>init</code>来启动一个新的进程</li><li><code>char *const argv[]</code> 系统启动的参数，这里传入的是<code>NULL</code>参数</li></ul><p>一个容器启动的关键过程有如下几个步骤：</p><ul><li><p><code>ongoing_create</code>: 查看容器是否正常创建，如果已经创建成功，则执行下面的流程</p></li><li><p><code>lxc_init_handler</code>: 创建容器状态的处理对象，包括像状态管理的<code>unix</code>套接字等</p></li><li><p>容器启动默认是后台运行的(<code>daemonize</code>)，为了确保容器启动正常，会调用两次<code>fork</code>，第一个父进程用于监听容器创建的状态，并返回给<code>lxc-create</code>调用者，最终会退出；而第一个子进程则作为容器的监控进程，进程名为<code>[lxc monitor] &lt;config&gt; &lt;lxc-name&gt;</code>，用于维护进程的状态，并且<code>fork</code>一个新的进程用于启动容器系统，而其父进程则会直接退出</p></li><li><p>对于一个后台进程来说，主要需要设置如下几个状态</p><ul><li><code>chdir</code>：将进程目录修改为根目录<code>/</code></li><li><code>inherit_fds</code>: 关闭不相关的文件</li><li><code>null_stdfds</code>: 将标准输入输出符号定向到<code>/dev/null</code>设备</li><li><code>setsid</code>: 设置为当前会话进程的首领进程(<code>leader</code>)</li></ul></li><li><p>调用<code>lxc_start</code>完成容器的最终启动与初始化</p></li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">bool</span> <span class="title function_">do_lxcapi_start</span><span class="params">(<span class="keyword">struct</span> lxc_container *c, <span class="type">int</span> useinit, <span class="type">char</span> * <span class="type">const</span> argv[])</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="type">int</span> ret;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">lxc_handler</span> *<span class="title">handler</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">lxc_conf</span> *<span class="title">conf</span>;</span></span><br><span class="line"><span class="type">char</span> *default_args[] = &#123;</span><br><span class="line"><span class="string">&quot;/sbin/init&quot;</span>,</span><br><span class="line"><span class="literal">NULL</span>,</span><br><span class="line">&#125;;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line">ret = ongoing_create(c);</span><br><span class="line"><span class="keyword">switch</span> (ret) &#123;</span><br><span class="line"><span class="keyword">case</span> LXC_CREATE_FAILED:</span><br><span class="line">ERROR(<span class="string">&quot;Failed checking for incomplete container creation&quot;</span>);</span><br><span class="line"><span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line"><span class="keyword">case</span> LXC_CREATE_ONGOING:</span><br><span class="line">ERROR(<span class="string">&quot;Ongoing container creation detected&quot;</span>);</span><br><span class="line"><span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line"><span class="keyword">case</span> LXC_CREATE_INCOMPLETE:</span><br><span class="line">ERROR(<span class="string">&quot;Failed to create container&quot;</span>);</span><br><span class="line">do_lxcapi_destroy(c);</span><br><span class="line"><span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="comment">/* initialize handler */</span></span><br><span class="line">handler = lxc_init_handler(<span class="literal">NULL</span>, c-&gt;name, conf, c-&gt;config_path, c-&gt;daemonize);</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="comment">/* ... otherwise use default_args. */</span></span><br><span class="line"><span class="keyword">if</span> (!argv) &#123;</span><br><span class="line">...</span><br><span class="line">argv = default_args;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* I&#x27;m not sure what locks we want here.Any? Is liblxc&#x27;s locking enough</span></span><br><span class="line"><span class="comment"> * here to protect the on disk container?  We don&#x27;t want to exclude</span></span><br><span class="line"><span class="comment"> * things like lxc_info while the container is running.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (c-&gt;daemonize) &#123;</span><br><span class="line"><span class="type">bool</span> started;</span><br><span class="line"><span class="type">char</span> title[<span class="number">2048</span>];</span><br><span class="line"><span class="type">pid_t</span> pid_first, pid_second;</span><br><span class="line"></span><br><span class="line">pid_first = fork();</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="comment">/* first parent */</span></span><br><span class="line"><span class="keyword">if</span> (pid_first != <span class="number">0</span>) &#123;</span><br><span class="line">...</span><br><span class="line"><span class="comment">/* Wait for container to tell us whether it started</span></span><br><span class="line"><span class="comment"> * successfully.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">started = wait_on_daemonized_start(handler, pid_first);</span><br><span class="line"></span><br><span class="line">free_init_cmd(init_cmd);</span><br><span class="line">lxc_put_handler(handler);</span><br><span class="line"><span class="keyword">return</span> started;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* first child */</span></span><br><span class="line"></span><br><span class="line"><span class="comment">/* We don&#x27;t really care if this doesn&#x27;t print all the</span></span><br><span class="line"><span class="comment"> * characters. All that it means is that the proctitle will be</span></span><br><span class="line"><span class="comment"> * ugly. Similarly, we also don&#x27;t care if setproctitle() fails.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">ret = strnprintf(title, <span class="keyword">sizeof</span>(title), <span class="string">&quot;[lxc monitor] %s %s&quot;</span>, c-&gt;config_path, c-&gt;name);</span><br><span class="line"><span class="keyword">if</span> (ret &gt; <span class="number">0</span>) &#123;</span><br><span class="line">ret = setproctitle(title);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>)</span><br><span class="line">INFO(<span class="string">&quot;Failed to set process title to %s&quot;</span>, title);</span><br><span class="line"><span class="keyword">else</span></span><br><span class="line">INFO(<span class="string">&quot;Set process title to %s&quot;</span>, title);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* We fork() a second time to be reparented to init. Like</span></span><br><span class="line"><span class="comment"> * POSIX&#x27;s daemon() function we change to &quot;/&quot; and redirect</span></span><br><span class="line"><span class="comment"> * std&#123;in,out,err&#125; to /dev/null.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">pid_second = fork();</span><br><span class="line"><span class="keyword">if</span> (pid_second &lt; <span class="number">0</span>) &#123;</span><br><span class="line">SYSERROR(<span class="string">&quot;Failed to fork first child process&quot;</span>);</span><br><span class="line">_exit(EXIT_FAILURE);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* second parent */</span></span><br><span class="line"><span class="keyword">if</span> (pid_second != <span class="number">0</span>) &#123;</span><br><span class="line">free_init_cmd(init_cmd);</span><br><span class="line">lxc_put_handler(handler);</span><br><span class="line">_exit(EXIT_SUCCESS);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* second child */</span></span><br><span class="line"></span><br><span class="line"><span class="comment">/* change to / directory */</span></span><br><span class="line">ret = chdir(<span class="string">&quot;/&quot;</span>);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>) &#123;</span><br><span class="line">SYSERROR(<span class="string">&quot;Failed to change to \&quot;/\&quot; directory&quot;</span>);</span><br><span class="line">_exit(EXIT_FAILURE);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">ret = inherit_fds(handler, <span class="literal">true</span>);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>)</span><br><span class="line">_exit(EXIT_FAILURE);</span><br><span class="line"></span><br><span class="line"><span class="comment">/* redirect std&#123;in,out,err&#125; to /dev/null */</span></span><br><span class="line">ret = null_stdfds();</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to redirect std&#123;in,out,err&#125; to /dev/null&quot;</span>);</span><br><span class="line">_exit(EXIT_FAILURE);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* become session leader */</span></span><br><span class="line">ret = setsid();</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>)</span><br><span class="line">TRACE(<span class="string">&quot;Process %d is already process group leader&quot;</span>, lxc_raw_getpid());</span><br><span class="line">&#125;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line">reboot:</span><br><span class="line">    ...</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (useinit)</span><br><span class="line">ret = lxc_execute(c-&gt;name, argv, <span class="number">1</span>, handler, c-&gt;config_path,</span><br><span class="line">  c-&gt;daemonize, &amp;c-&gt;error_num);</span><br><span class="line"><span class="keyword">else</span></span><br><span class="line">ret = lxc_start(argv, handler, c-&gt;config_path, c-&gt;daemonize,</span><br><span class="line">&amp;c-&gt;error_num);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (conf-&gt;reboot == REBOOT_REQ) &#123;</span><br><span class="line">INFO(<span class="string">&quot;Container requested reboot&quot;</span>);</span><br><span class="line">conf-&gt;reboot = REBOOT_INIT;</span><br><span class="line"><span class="keyword">goto</span> reboot;</span><br><span class="line">&#125;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><h3 id="lxc-start"><a href="#lxc-start" class="headerlink" title="lxc_start"></a><strong>lxc_start</strong></h3><p>函数<code>lxc_start</code>会设置一个启动的回调函数<code>struct lxc_operations</code>用于容器初始化后执行<code>busybox</code>的<code>init</code>进程执行，实际是通过<code>__lxc_start</code>完成真正的初始化：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="class"><span class="keyword">struct</span> <span class="title">lxc_operations</span> <span class="title">start_ops</span> =</span> &#123;</span><br><span class="line">.start = start,</span><br><span class="line">.post_start = post_start</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="type">int</span> <span class="title function_">lxc_start</span><span class="params">(<span class="type">char</span> *<span class="type">const</span> argv[], <span class="keyword">struct</span> lxc_handler *handler,</span></span><br><span class="line"><span class="params">      <span class="type">const</span> <span class="type">char</span> *lxcpath, <span class="type">bool</span> daemonize, <span class="type">int</span> *error_num)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">start_args</span> <span class="title">start_arg</span> =</span> &#123;</span><br><span class="line">.argv = argv,</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line">TRACE(<span class="string">&quot;Doing lxc_start&quot;</span>);</span><br><span class="line"><span class="keyword">return</span> __lxc_start(handler, &amp;start_ops, &amp;start_arg, lxcpath, daemonize, error_num);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>函数<code>__lxc_start</code>主要负责初始化容器的<code>namespace</code>、<code>cgroup</code>以及挂载容器的<code>rootfs</code>并执行容器的<code>init</code>进程：</p><ul><li><code>lxc_init</code>： 初始化容器，比如设置环境变量，监听进程异常退出信号<code>SIGBUS, SIGILL, SIGSEGV</code>；同时初始化<code>cgroup</code>操作相关的函数</li><li><code>attach_block_device</code>: 根据<code>rootfs</code>的根目录判断容器是否需要挂载到一个块设备上</li><li><code>monitor_create/monitor_delegate_controllers/monitor_enter</code>： <code>cgroups</code>相关的几个操作，主要用于容器的监控进程<code>cgroup</code>配置</li><li><code>resolve_clone_flags/lxc_inherit_namespaces</code>: 解析配置文件中的进程克隆的标志位，检查容器需要继承哪些命名空间</li><li><code>lxc_rootfs_init</code>: 初始化容器的<code>rootfs</code>，并锁定<code>rootfs</code>所在的位置，避免容器初始化挂载被修改导致异常</li><li><code>lxc_spawn</code>: 初始化容器执行环境，并创建容器的<code>init</code>进程，完成容器的启动</li><li><code>lxc_poll</code>: 监听容器的<code>init</code>进程状态，如果容器异常退出，则执行异常恢复、资源回收等操作</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">int</span> __lxc_start(<span class="keyword">struct</span> lxc_handler *handler, <span class="keyword">struct</span> lxc_operations *ops,</span><br><span class="line"><span class="type">void</span> *data, <span class="type">const</span> <span class="type">char</span> *lxcpath, <span class="type">bool</span> daemonize, <span class="type">int</span> *error_num)</span><br><span class="line">&#123;</span><br><span class="line"><span class="type">int</span> ret, status;</span><br><span class="line"><span class="type">const</span> <span class="type">char</span> *name = handler-&gt;name;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">lxc_conf</span> *<span class="title">conf</span> =</span> handler-&gt;conf;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">cgroup_ops</span> *<span class="title">cgroup_ops</span>;</span></span><br><span class="line"></span><br><span class="line">ret = lxc_init(name, handler);</span><br><span class="line">...</span><br><span class="line">handler-&gt;ops = ops;</span><br><span class="line">handler-&gt;data = data;</span><br><span class="line">handler-&gt;daemonize = daemonize;</span><br><span class="line">cgroup_ops = handler-&gt;cgroup_ops;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!attach_block_device(handler-&gt;conf)) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to attach block device&quot;</span>);</span><br><span class="line">ret = <span class="number">-1</span>;</span><br><span class="line"><span class="keyword">goto</span> out_abort;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!cgroup_ops-&gt;monitor_create(cgroup_ops, handler)) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to create monitor cgroup&quot;</span>);</span><br><span class="line">ret = <span class="number">-1</span>;</span><br><span class="line"><span class="keyword">goto</span> out_abort;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!cgroup_ops-&gt;monitor_delegate_controllers(cgroup_ops)) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to delegate controllers to monitor cgroup&quot;</span>);</span><br><span class="line">ret = <span class="number">-1</span>;</span><br><span class="line"><span class="keyword">goto</span> out_abort;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!cgroup_ops-&gt;monitor_enter(cgroup_ops, handler)) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to enter monitor cgroup&quot;</span>);</span><br><span class="line">ret = <span class="number">-1</span>;</span><br><span class="line"><span class="keyword">goto</span> out_abort;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">ret = resolve_clone_flags(handler);</span><br><span class="line">...</span><br><span class="line">ret = lxc_inherit_namespaces(handler);</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="comment">/* If the rootfs is not a blockdev, prevent the container from marking</span></span><br><span class="line"><span class="comment"> * it readonly.</span></span><br><span class="line"><span class="comment"> * If the container is unprivileged then skip rootfs pinning.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">ret = lxc_rootfs_init(conf, !list_empty(&amp;conf-&gt;id_map));</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line">ret = lxc_spawn(handler);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to spawn container \&quot;%s\&quot;&quot;</span>, name);</span><br><span class="line"><span class="keyword">goto</span> out_detach_blockdev;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">handler-&gt;conf-&gt;reboot = REBOOT_NONE;</span><br><span class="line"></span><br><span class="line">ret = lxc_poll(name, handler);</span><br><span class="line"><span class="keyword">if</span> (ret) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;LXC mainloop exited with error: %d&quot;</span>, ret);</span><br><span class="line"><span class="keyword">goto</span> out_delete_network;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!handler-&gt;init_died &amp;&amp; handler-&gt;pid &gt; <span class="number">0</span>) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Child process is not killed&quot;</span>);</span><br><span class="line">ret = <span class="number">-1</span>;</span><br><span class="line"><span class="keyword">goto</span> out_delete_network;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">status = lxc_wait_for_pid_status(handler-&gt;pid);</span><br><span class="line"><span class="keyword">if</span> (status &lt; <span class="number">0</span>)</span><br><span class="line">SYSERROR(<span class="string">&quot;Failed to retrieve status for %d&quot;</span>, handler-&gt;pid);</span><br><span class="line"></span><br><span class="line"><span class="comment">/* If the child process exited but was not signaled, it didn&#x27;t call</span></span><br><span class="line"><span class="comment"> * reboot. This should mean it was an lxc-execute which simply exited.</span></span><br><span class="line"><span class="comment"> * In any case, treat it as a &#x27;halt&#x27;.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (WIFSIGNALED(status)) &#123;</span><br><span class="line"><span class="type">int</span> signal_nr = WTERMSIG(status);</span><br><span class="line"><span class="keyword">switch</span>(signal_nr) &#123;</span><br><span class="line"><span class="keyword">case</span> SIGINT: <span class="comment">/* halt */</span></span><br><span class="line">DEBUG(<span class="string">&quot;%s(%d) - Container \&quot;%s\&quot; is halting&quot;</span>, signal_name(signal_nr), signal_nr, name);</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> SIGHUP: <span class="comment">/* reboot */</span></span><br><span class="line">DEBUG(<span class="string">&quot;%s(%d) - Container \&quot;%s\&quot; is rebooting&quot;</span>, signal_name(signal_nr), signal_nr, name);</span><br><span class="line">handler-&gt;conf-&gt;reboot = REBOOT_REQ;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> SIGSYS: <span class="comment">/* seccomp */</span></span><br><span class="line">DEBUG(<span class="string">&quot;%s(%d) - Container \&quot;%s\&quot; violated its seccomp policy&quot;</span>, signal_name(signal_nr), signal_nr, name);</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">default</span>:</span><br><span class="line">DEBUG(<span class="string">&quot;%s(%d) - Container \&quot;%s\&quot; init exited&quot;</span>, signal_name(signal_nr), signal_nr, name);</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">...</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="lxc-spawn"><a href="#lxc-spawn" class="headerlink" title="lxc_spawn"></a><strong>lxc_spawn</strong></h2><p>到<code>lxc_spawn</code>这一步，是容器启动的核心函数，主要负责容器<code>cgroup</code>的创建，加载<code>/sbin/init</code>启动容器<code>1</code>号进程：</p><ul><li><code>lxc_sync_init</code>创建一个本地的套接字对，用于同步父进程与子进程的初始化步骤</li><li><code>cgroup_ops-&gt;payload_create</code>创建容器的<code>cgroup</code>（对应<code>lxc.payload.busybox-lxc</code>）</li><li>由于我们没有单独设置需要继承的命名空间，因此会调用<code>lxc_clone3</code>创建一个进程</li><li>在子进程中，调用函数<code>do_start</code>完成容器的创建与启动，在这里会启动<code>busybox</code>的<code>init</code>进程</li><li>子进程的初始化需要等待父进程<code>lxc-monitor</code>的操作完成，比如设置完<code>cgroup</code>的配置后，告诉子进程<code>lxc_sync_barrier_child</code>继续初始化</li><li>完成最后的配置后，最后将容器的运行状态设置为<code>RUNNING</code>状态（此时<code>busybox</code>可能还没有完成加载）</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> <span class="title function_">lxc_spawn</span><span class="params">(<span class="keyword">struct</span> lxc_handler *handler)</span></span><br><span class="line">&#123;</span><br><span class="line">...</span><br><span class="line"><span class="keyword">if</span> (!lxc_sync_init(handler))</span><br><span class="line"><span class="keyword">return</span> <span class="number">-1</span>;</span><br><span class="line"></span><br><span class="line">ret = socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, <span class="number">0</span>,</span><br><span class="line"> handler-&gt;data_sock);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">goto</span> out_sync_fini;</span><br><span class="line">data_sock0 = handler-&gt;data_sock[<span class="number">0</span>];</span><br><span class="line">data_sock1 = handler-&gt;data_sock[<span class="number">1</span>];</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (container_uses_namespace(handler, CLONE_NEWNET)) &#123;</span><br><span class="line">ret = lxc_find_gateway_addresses(handler);</span><br><span class="line"><span class="keyword">if</span> (ret) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to find gateway addresses&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out_sync_fini;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!cgroup_ops-&gt;payload_create(cgroup_ops, handler)) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed creating cgroups&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out_delete_net;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Create a process in a new set of namespaces. */</span></span><br><span class="line"><span class="comment">// 如果没有需要继承的namespace，则执行下面的代码</span></span><br><span class="line"><span class="keyword">if</span> (inherits_namespaces(handler)) &#123;</span><br><span class="line">...</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line"><span class="type">int</span> cgroup_fd = -EBADF;</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">clone_args</span> <span class="title">clone_args</span> =</span> &#123;</span><br><span class="line">.flags = handler-&gt;clone_flags,</span><br><span class="line">.pidfd = ptr_to_u64(&amp;handler-&gt;pidfd),</span><br><span class="line">.exit_signal = SIGCHLD,</span><br><span class="line">&#125;;</span><br><span class="line">...</span><br><span class="line"><span class="comment">/* Try to spawn directly into target cgroup. */</span></span><br><span class="line">handler-&gt;pid = lxc_clone3(&amp;clone_args, CLONE_ARGS_SIZE_VER2);</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (handler-&gt;pid == <span class="number">0</span>) &#123;</span><br><span class="line">(<span class="type">void</span>)do_start(handler);</span><br><span class="line">_exit(EXIT_FAILURE);</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line"><span class="keyword">if</span> (!cgroup_ops-&gt;setup_limits_legacy(cgroup_ops, handler-&gt;conf, <span class="literal">false</span>)) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to setup cgroup limits for container \&quot;%s\&quot;&quot;</span>, name);</span><br><span class="line"><span class="keyword">goto</span> out_delete_net;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!cgroup_ops-&gt;payload_delegate_controllers(cgroup_ops)) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to delegate controllers to payload cgroup&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out_delete_net;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!cgroup_ops-&gt;payload_enter(cgroup_ops, handler)) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to enter cgroups&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out_delete_net;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!cgroup_ops-&gt;setup_limits(cgroup_ops, handler)) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to setup cgroup limits for container \&quot;%s\&quot;&quot;</span>, name);</span><br><span class="line"><span class="keyword">goto</span> out_delete_net;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!cgroup_ops-&gt;chown(cgroup_ops, handler-&gt;conf))</span><br><span class="line"><span class="keyword">goto</span> out_delete_net;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!lxc_sync_barrier_child(handler, START_SYNC_STARTUP))</span><br><span class="line"><span class="keyword">goto</span> out_delete_net;</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line"></span><br><span class="line">ret = setup_proc_filesystem(conf, handler-&gt;pid);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to setup procfs limits&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out_delete_net;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">ret = setup_resource_limits(conf, handler-&gt;pid);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to setup resource limits&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out_delete_net;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Tell the child to continue its initialization. */</span></span><br><span class="line"><span class="keyword">if</span> (!lxc_sync_wake_child(handler, START_SYNC_POST_CONFIGURE))</span><br><span class="line"><span class="keyword">goto</span> out_delete_net;</span><br><span class="line"></span><br><span class="line">    ...</span><br><span class="line"></span><br><span class="line">ret = handler-&gt;ops-&gt;post_start(handler, handler-&gt;data);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">goto</span> out_abort;</span><br><span class="line"></span><br><span class="line">ret = lxc_set_state(name, handler, RUNNING);</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">...</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><h3 id="do-start"><a href="#do-start" class="headerlink" title="do_start"></a><strong>do_start</strong></h3><p>直行道<code>do_start</code>这里，是容器启动最后的一个关键步骤，会启动一个新的<code>init</code>进程，初始化对应的<code>busybox</code>最小系统的执行环境，关键的流程如下：</p><ul><li><code>lxc_sync_wake_parent</code>: 等待父进程的状态<code>START_SYNC_CONFIGURE</code>，开始执行容器的初始化操作，比如通过<code>unshare</code>创建新的控制分组<code>cgroup</code></li><li><code>lxc_setup</code>: 容器配置初始化，比如挂载容器的<code>rootfs</code>，设置容器的主机名，设置容器网络设备状态</li><li>执行其他初始化动作，比如设置标准输入输出到<code>/dev/null</code>；调用<code>setsid</code>启动新的会话；<code>lxc_set_environment</code>设置容器的环境变量</li><li><code>handler-&gt;ops-&gt;start</code>函数对应的<code>start</code>函数（参考<code>lxc_start</code>），实际调用<code>execvp</code>执行容器的可执行文件<code>/sbin/init</code></li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> <span class="title function_">do_start</span><span class="params">(<span class="type">void</span> *data)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">lxc_handler</span> *<span class="title">handler</span> =</span> data;</span><br><span class="line">__lxc_unused __do_close <span class="type">int</span> data_sock0 = handler-&gt;data_sock[<span class="number">0</span>],</span><br><span class="line">    data_sock1 = handler-&gt;data_sock[<span class="number">1</span>];</span><br><span class="line">__do_close <span class="type">int</span> devnull_fd = -EBADF, status_fd = -EBADF;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Tell the parent task it can begin to configure the container and wait</span></span><br><span class="line"><span class="comment"> * for it to finish.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (!lxc_sync_wake_parent(handler, START_SYNC_CONFIGURE))</span><br><span class="line"><span class="keyword">goto</span> out_error;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Unshare cgroup namespace after we have setup our cgroups. If we do it</span></span><br><span class="line"><span class="comment"> * earlier we end up with a wrong view of /proc/self/cgroup. For</span></span><br><span class="line"><span class="comment"> * example, assume we unshare(CLONE_NEWCGROUP) first, and then create</span></span><br><span class="line"><span class="comment"> * the cgroup for the container, say /sys/fs/cgroup/cpuset/lxc/c, then</span></span><br><span class="line"><span class="comment"> * /proc/self/cgroup would show us:</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> *8:cpuset:/lxc/c</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * whereas it should actually show</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> *8:cpuset:/</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (handler-&gt;ns_unshare_flags &amp; CLONE_NEWCGROUP) &#123;</span><br><span class="line">ret = unshare(CLONE_NEWCGROUP);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>) &#123;</span><br><span class="line"><span class="keyword">if</span> (errno != EINVAL) &#123;</span><br><span class="line">SYSERROR(<span class="string">&quot;Failed to unshare CLONE_NEWCGROUP&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out_warn_father;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">handler-&gt;ns_clone_flags &amp;= ~CLONE_NEWCGROUP;</span><br><span class="line">SYSINFO(<span class="string">&quot;Kernel does not support CLONE_NEWCGROUP&quot;</span>);</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">INFO(<span class="string">&quot;Unshared CLONE_NEWCGROUP&quot;</span>);</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Setup the container, ip, names, utsname, ... */</span></span><br><span class="line">ret = lxc_setup(handler);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to setup container \&quot;%s\&quot;&quot;</span>, handler-&gt;name);</span><br><span class="line"><span class="keyword">goto</span> out_warn_father;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (handler-&gt;conf-&gt;console.pty &lt; <span class="number">0</span> &amp;&amp; handler-&gt;daemonize) &#123;</span><br><span class="line"><span class="keyword">if</span> (devnull_fd &lt; <span class="number">0</span>) &#123;</span><br><span class="line">devnull_fd = open_devnull();</span><br><span class="line"><span class="keyword">if</span> (devnull_fd &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">goto</span> out_warn_father;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">ret = set_stdfds(devnull_fd);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>) &#123;</span><br><span class="line">ERROR(<span class="string">&quot;Failed to redirect std&#123;in,out,err&#125; to \&quot;/dev/null\&quot;&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out_warn_father;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line"></span><br><span class="line">setsid();</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Reset the environment variables the user requested in a clear</span></span><br><span class="line"><span class="comment"> * environment.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">ret = clearenv();</span><br><span class="line"><span class="comment">/* Don&#x27;t error out though. */</span></span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>)</span><br><span class="line">SYSERROR(<span class="string">&quot;Failed to clear environment.&quot;</span>);</span><br><span class="line"></span><br><span class="line">ret = lxc_set_environment(handler-&gt;conf);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">goto</span> out_warn_father;</span><br><span class="line"></span><br><span class="line">ret = putenv(<span class="string">&quot;container=lxc&quot;</span>);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>) &#123;</span><br><span class="line">SYSERROR(<span class="string">&quot;Failed to set environment variable: container=lxc&quot;</span>);</span><br><span class="line"><span class="keyword">goto</span> out_warn_father;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * After this call, we are in error because this ops should not return</span></span><br><span class="line"><span class="comment"> * as it execs.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">handler-&gt;ops-&gt;start(handler, handler-&gt;data);</span><br><span class="line">...</span><br><span class="line"><span class="keyword">return</span> <span class="number">-1</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="为什么有两个init进程"><a href="#为什么有两个init进程" class="headerlink" title="为什么有两个init进程"></a><strong>为什么有两个init进程</strong></h2><p>容器启动完成后，通过<code>pstree</code>我们可以看到系统多了几个进程，<code>lxc-start(83578)</code>这个进程是容器的监控进程，<code>init(83579)</code>这个进程是<code>busybox</code>的最小系统的<code>init</code>进程，即整个容器的一号进程，这个进程会执行初始化脚本，完成容器系统启动。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">~$ sudo pstree -p |grep lxc</span><br><span class="line">           |-lxc-monitord(762)</span><br><span class="line">           |-lxcfs(670)-+-&#123;lxcfs&#125;(680)</span><br><span class="line">           |            `-&#123;lxcfs&#125;(681)</span><br><span class="line">           |               |-lxc-start(83578)---init(83579)-+-getty(83668)</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>到容器的<code>rootfs</code>目录<code>/var/lib/lxc/busybox-lxc/rootfs</code>查看初始化脚本<code>/etc/inittab</code>，可以看到，系统会执行三个初始化动作：</p><ul><li><code>sysinit</code> 启动系统初始化脚本<code>/etc/init.d/rcS</code>，比如启动<code>syslogd</code>，挂载所有的<code>fstab</code>文件系统，启动<code>DHCP</code>服务</li><li><code>respawn</code>意思是进程会在异常时重新启动，这里会启动一个<code>/bin/getty</code>守护进程用于系统的登录</li><li><code>askfirst</code>执行前需要询问用户输入，这里会创建一个<code>sh</code>进程（实际<code>/bin/sh</code>软链接到了<code>busybox</code>）</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"></span><br><span class="line">::sysinit:/etc/init.d/rcS</span><br><span class="line">tty1::respawn:/bin/getty -L tty1 115200 vt100</span><br><span class="line">console::askfirst:/bin/sh</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment">## /etc/init.d/rcS</span></span><br><span class="line"><span class="comment">#!/bin/sh</span></span><br><span class="line">/bin/syslogd</span><br><span class="line">/bin/mount -a</span><br><span class="line">/bin/udhcpc</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>在上一篇文章中，我们提到了在<code>attach</code>进入到容器中可以看到两个<code>init</code>进程，<code>pid</code>为<code>1</code>的进程就是<code>lxc-start</code>启动的容器进程；而<code>pid</code>为<code>16</code>的进程实际是执行脚本<code>console::askfirst:/bin/sh</code>创建的进程；如果我们把<code>/var/lib/lxc/busybox-lxc/rootfs/etc/inittab</code>中的<code>console::askfirst:/bin/sh</code>注释掉，再进去容器里用<code>top</code>查看，会发现没有了<code>16</code>号<code>init</code>进程。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">Mem: 40201368K used, 549680K free, 1329880K shrd, 738404K buff, 19484428K cached</span><br><span class="line">CPU:   5% usr   0% sys   0% nic  94% idle   0% io   0% irq   0% sirq</span><br><span class="line">Load average: 0.91 1.13 1.47 2/2955 21</span><br><span class="line">  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND</span><br><span class="line">    1     0 root     S     2456   0%   0% init</span><br><span class="line">    4     1 root     S     2456   0%   0% /bin/syslogd</span><br><span class="line">   14     1 root     S     2456   0%   0% /bin/udhcpc</span><br><span class="line">   15     1 root     S     2456   0%   0% /bin/getty -L tty1 115200 vt100</span><br><span class="line">   16     1 root     S     2456   0%   0% init</span><br><span class="line">   17     0 root     S     2456   0%   0% /bin/sh</span><br><span class="line">   21    17 root     R     2456   0%   0% top</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a><strong>参考资料</strong></h2><ul><li><a href="https://github.com/lxc/lxc">https://github.com/lxc/lxc</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;在上一篇文章中&lt;a href=&quot;https://sniffer.site/2025/06/06/%E6%B7%B1%E5%85%A5linux%E5%AE%B9%E5%99%A8lxc%E4%B9%8B%E4%B8%80-lxc%E7%9A%84%E5%AE%9E%E7%8E%B0%E5%8E%9F%E7%90%86/&quot;&gt;深入Linux容器LXC之一-LXC的实现原理&lt;/a&gt;着重介绍了&lt;code&gt;LXC&lt;/code&gt;容器的实现原理，我们对&lt;code&gt;LXC&lt;/code&gt;容器的基本原理有了一定的了解，但其中有一个问题，启动一个容器后，容器系统中存在两个名为&lt;code&gt;init&lt;/code&gt;的进程。为什么会有两个&lt;code&gt;init&lt;/code&gt;进程？为了解开这个疑问，需要对&lt;code&gt;LXC&lt;/code&gt;的源码做一个深入的分析。本篇文章主要是围绕这个问题展开，大致分为两个大的部分：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;lxc-create&lt;/code&gt;的实现： &lt;code&gt;LXC&lt;/code&gt;容器是如何创建的&lt;/li&gt;
&lt;li&gt;&lt;code&gt;lxc-start&lt;/code&gt;的实现： &lt;code&gt;LXC&lt;/code&gt;容器是如何启动的&lt;/li&gt;
&lt;/ul&gt;</summary>
    
    
    
    <category term="Linux" scheme="https://sniffer.site/categories/Linux/"/>
    
    
    <category term="虚拟化" scheme="https://sniffer.site/tags/%E8%99%9A%E6%8B%9F%E5%8C%96/"/>
    
    <category term="LXC" scheme="https://sniffer.site/tags/LXC/"/>
    
    <category term="容器" scheme="https://sniffer.site/tags/%E5%AE%B9%E5%99%A8/"/>
    
  </entry>
  
  <entry>
    <title>深入Linux容器LXC之一-LXC的实现原理</title>
    <link href="https://sniffer.site/2025/06/06/%E6%B7%B1%E5%85%A5Linux%E5%AE%B9%E5%99%A8LXC%E4%B9%8B%E4%B8%80-LXC%E7%9A%84%E5%AE%9E%E7%8E%B0%E5%8E%9F%E7%90%86/"/>
    <id>https://sniffer.site/2025/06/06/%E6%B7%B1%E5%85%A5Linux%E5%AE%B9%E5%99%A8LXC%E4%B9%8B%E4%B8%80-LXC%E7%9A%84%E5%AE%9E%E7%8E%B0%E5%8E%9F%E7%90%86/</id>
    <published>2025-06-06T01:30:52.000Z</published>
    <updated>2025-07-18T09:20:12.843Z</updated>
    
    <content type="html"><![CDATA[<p>容器(<code>Containers</code>)是一种创建轻量级<code>虚拟</code>的应用执行环境的技术；基于容器技术，我们可以轻松的在同一个操作系统中构建出多个隔离、虚拟的运行环境，不同于基于虚拟化技术(<code>hypervisor</code>)的硬件级别的隔离方案，容器通过<code>Linux</code>内核中的命名空间(<code>Namespace</code>)以及控制分组(<code>Cgroups</code>)来实现进程级资源如CPU、内存、IO、网络等隔离与管理。目前常见的容器方案有<a href="https://linuxcontainers.org/"><code>Linux Containers(LXC)</code></a>与<a href="https://en.wikipedia.org/wiki/Docker_(software)"><code>Docker</code></a>；<code>LXC</code>可以用于进程执行也可以用于启动一个系统镜像（包含<code>rootfs</code>的完整系统执行环境），而<code>Docker</code>一般用于云计算中的应用程序的打包运行。</p><p>深入Linux容器文章系列准备分为上下两篇来写，第一篇主要围绕<code>LXC</code>容器的基本实现原理以及如何在<code>ubuntu</code>系统中创建自己的容器；下篇主要从源码的角度分析下<code>LXC</code>是如何实现的。这篇文章，我们着重了解下<code>LXC</code>的实现原理，主要从如下两个方面进行介绍：</p><ul><li>首先从<code>namespace</code>、<code>cgroups</code>两个基本的概念介绍<code>LXC</code>的基本原理</li><li>基于<code>Ubuntu</code>系统搭建、启动一个完整的<code>LXC</code>容器</li></ul><span id="more"></span><blockquote><p>本文基于内核5.10版本源代码分析</p></blockquote><h2 id="LXC的实现原理"><a href="#LXC的实现原理" class="headerlink" title="LXC的实现原理"></a><strong>LXC的实现原理</strong></h2><p><code>LXC</code>是一种操作系统级别的系统隔离方案，容器之间通过<code>namespace</code>与<code>cgroups</code>来实现资源的隔离与控制；<code>SELinux</code>则用于控制宿主系统与容器以及容器与容器之间的安全隔离与权限控制。在容器与内核中间通过容器运行时环境来统一管理不同的容器的创建、启动与销毁，不同容器之间实际是共用一个内核与所有的硬件设备，这个不同于<code>XEN</code>、<code>QNX</code>这样的虚拟化方案：</p><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/lxc-architecture.png" alt="LXC-architecture"></p><p>下面我们就来看看构成<code>LXC</code>的两个基础能力<code>namespace</code>与<code>cgroups</code>。</p><h3 id="namespace-命名空间"><a href="#namespace-命名空间" class="headerlink" title="namespace(命名空间)"></a><strong>namespace(命名空间)</strong></h3><p>内核中的<a href="https://man7.org/linux/man-pages/man7/namespaces.7.html"><code>namespace</code></a>是用于隔离不同进程资源的手段；每个进程在初始化的时候，都会有自己的命名空间用于管理系统的资源，比如CPU、网络、IPC（跨进程通讯）、PID等，这样不同命名空间的进程资源是相互隔离的，无法被对方看到、访问。这个跟编程语言中的命名空间有点类似，本质上都是对不同类型的资源进行隔离，避免相互影响。</p><p>当前内核中，有8中不同类型的命名空间：</p><table><thead><tr><th>Namespace</th><th>标志位</th><th>隔离的资源</th></tr></thead><tbody><tr><td>Cgroup</td><td>CLONE_NEWCGROUP</td><td>cgroup根目录</td></tr><tr><td>IPC</td><td>CLONE_NEWIPC</td><td>System V IPC, POSIX消息队列</td></tr><tr><td>Network</td><td>CLONE_NEWNET</td><td>网络设备，协议栈，协议端口</td></tr><tr><td>Mount</td><td>CLONE_NEWNS</td><td>文件系统挂载点</td></tr><tr><td>PID</td><td>CLONE_NEWPID</td><td>进程PID</td></tr><tr><td>Time</td><td>CLONE_NEWTIME</td><td>系统启动、运行的时钟</td></tr><tr><td>User</td><td>CLONE_NEWUSER</td><td>UID、GID</td></tr><tr><td>UTS</td><td>CLONE_NEWUTS</td><td>主机名、NIS域服务名</td></tr></tbody></table><p>在内核代码中，所有的命名空间都用一个结构体<code>struct nsproxy</code>封装起来，进程的数据结构<code>struct task_struct</code>会有一个对应的指针来表示该进程所属的命名空间：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">task_struct</span> &#123;</span></span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_THREAD_INFO_IN_TASK</span></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * For reasons of header soup (see current_thread_info()), this</span></span><br><span class="line"><span class="comment"> * must be the first element of task_struct.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">thread_info</span><span class="title">thread_info</span>;</span></span><br><span class="line"><span class="meta">#<span class="keyword">endif</span></span></span><br><span class="line"><span class="comment">/* -1 unrunnable, 0 runnable, &gt;0 stopped: */</span></span><br><span class="line"><span class="keyword">volatile</span> <span class="type">long</span>state;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * This begins the randomizable portion of task_struct. Only</span></span><br><span class="line"><span class="comment"> * scheduling-critical items should be added above here.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">randomized_struct_fields_start</span><br><span class="line"></span><br><span class="line"><span class="type">void</span>*<span class="built_in">stack</span>;</span><br><span class="line"><span class="type">refcount_t</span>usage;</span><br><span class="line"><span class="comment">/* Per task flags (PF_*), defined further below: */</span></span><br><span class="line"><span class="type">unsigned</span> <span class="type">int</span>flags;</span><br><span class="line"><span class="type">unsigned</span> <span class="type">int</span>ptrace;</span><br><span class="line">...</span><br><span class="line"><span class="type">int</span>on_rq;</span><br><span class="line"></span><br><span class="line"><span class="type">int</span>prio;</span><br><span class="line"><span class="type">int</span>static_prio;</span><br><span class="line"><span class="type">int</span>normal_prio;</span><br><span class="line"><span class="type">unsigned</span> <span class="type">int</span>rt_priority;</span><br><span class="line"></span><br><span class="line"><span class="type">const</span> <span class="class"><span class="keyword">struct</span> <span class="title">sched_class</span>*<span class="title">sched_class</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">sched_entity</span><span class="title">se</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">sched_rt_entity</span><span class="title">rt</span>;</span></span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_CGROUP_SCHED</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">task_group</span>*<span class="title">sched_task_group</span>;</span></span><br><span class="line"><span class="meta">#<span class="keyword">endif</span></span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">sched_dl_entity</span><span class="title">dl</span>;</span></span><br><span class="line"></span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="type">unsigned</span> <span class="type">int</span>policy;</span><br><span class="line"><span class="type">int</span>nr_cpus_allowed;</span><br><span class="line"><span class="type">const</span> <span class="type">cpumask_t</span>*cpus_ptr;</span><br><span class="line"><span class="type">cpumask_t</span>cpus_mask;</span><br><span class="line">...</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">sched_info</span><span class="title">sched_info</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">list_head</span><span class="title">tasks</span>;</span></span><br><span class="line">...</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">mm_struct</span>*<span class="title">mm</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">mm_struct</span>*<span class="title">active_mm</span>;</span></span><br><span class="line">...</span><br><span class="line"><span class="type">int</span>exit_state;</span><br><span class="line"><span class="type">int</span>exit_code;</span><br><span class="line"><span class="type">int</span>exit_signal;</span><br><span class="line"><span class="comment">/* The signal sent when the parent dies: */</span></span><br><span class="line"><span class="type">int</span>pdeath_signal;</span><br><span class="line"><span class="comment">/* JOBCTL_*, siglock protected: */</span></span><br><span class="line"><span class="type">unsigned</span> <span class="type">long</span>jobctl;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Used for emulating ABI behavior of previous Linux versions: */</span></span><br><span class="line"><span class="type">unsigned</span> <span class="type">int</span>personality;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Scheduler bits, serialized by scheduler locks: */</span></span><br><span class="line"><span class="type">unsigned</span>sched_reset_on_fork:<span class="number">1</span>;</span><br><span class="line"><span class="type">unsigned</span>sched_contributes_to_load:<span class="number">1</span>;</span><br><span class="line"><span class="type">unsigned</span>sched_migrated:<span class="number">1</span>;</span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_PSI</span></span><br><span class="line"><span class="type">unsigned</span>sched_psi_wake_requeue:<span class="number">1</span>;</span><br><span class="line"><span class="meta">#<span class="keyword">endif</span></span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment">/* Bit to tell LSMs we&#x27;re in execve(): */</span></span><br><span class="line"><span class="type">unsigned</span>in_execve:<span class="number">1</span>;</span><br><span class="line"><span class="type">unsigned</span>in_iowait:<span class="number">1</span>;</span><br><span class="line">...</span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_CGROUPS</span></span><br><span class="line"><span class="comment">/* disallow userland-initiated cgroup migration */</span></span><br><span class="line"><span class="type">unsigned</span>no_cgroup_migration:<span class="number">1</span>;</span><br><span class="line"><span class="comment">/* task is frozen/stopped (used by the cgroup freezer) */</span></span><br><span class="line"><span class="type">unsigned</span>frozen:<span class="number">1</span>;</span><br><span class="line"><span class="meta">#<span class="keyword">endif</span></span></span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_BLK_CGROUP</span></span><br><span class="line"><span class="type">unsigned</span>use_memdelay:<span class="number">1</span>;</span><br><span class="line"><span class="meta">#<span class="keyword">endif</span></span></span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_PSI</span></span><br><span class="line"><span class="comment">/* Stalled due to lack of memory */</span></span><br><span class="line"><span class="type">unsigned</span>in_memstall:<span class="number">1</span>;</span><br><span class="line"><span class="meta">#<span class="keyword">endif</span></span></span><br><span class="line">...</span><br><span class="line"><span class="type">pid_t</span>pid;</span><br><span class="line"><span class="type">pid_t</span>tgid;</span><br><span class="line">...</span><br><span class="line"><span class="comment">/* Namespaces: */</span>  --&gt; 命名空间</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">nsproxy</span>*<span class="title">nsproxy</span>;</span></span><br><span class="line">    ...</span><br><span class="line">&#125;;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>可以看到，结构体<code>struct nsproxy</code>实际是上述几种类型的命名空间的集合，每个命名空间都有指向各个类型命名空间的的指针，另外还包括了<code>count</code>用于计数，表示当前命名空间被多少个进程使用。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">nsproxy</span> &#123;</span></span><br><span class="line"><span class="type">atomic_t</span> count;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">uts_namespace</span> *<span class="title">uts_ns</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ipc_namespace</span> *<span class="title">ipc_ns</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">mnt_namespace</span> *<span class="title">mnt_ns</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">pid_namespace</span> *<span class="title">pid_ns_for_children</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">net</span>      *<span class="title">net_ns</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">time_namespace</span> *<span class="title">time_ns</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">time_namespace</span> *<span class="title">time_ns_for_children</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">cgroup_namespace</span> *<span class="title">cgroup_ns</span>;</span></span><br><span class="line">&#125;;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>通过系统<code>proc</code>目录，我们可以查看当前系统中进程的命名空间信息；在3.8之前的版本，这些都是硬链接(<code>hard link</code>)，3.8版本开始统一使用符号链接(<code>symbolic link</code>)，由命名空间的名字加上对应的文件<code>inode</code>号组成的字符串。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">~<span class="comment"># ls -al /proc/1/ns/</span></span><br><span class="line">total 0</span><br><span class="line">dr-x--x--x 2 root root 0  6月  5 09:33 .</span><br><span class="line">dr-xr-xr-x 9 root root 0  6月  5 09:33 ..</span><br><span class="line">lrwxrwxrwx 1 root root 0  6月  5 09:33 cgroup -&gt; <span class="string">&#x27;cgroup:[4026531835]&#x27;</span></span><br><span class="line">lrwxrwxrwx 1 root root 0  6月  5 15:41 ipc -&gt; <span class="string">&#x27;ipc:[4026531839]&#x27;</span></span><br><span class="line">lrwxrwxrwx 1 root root 0  6月  5 09:33 mnt -&gt; <span class="string">&#x27;mnt:[4026531841]&#x27;</span></span><br><span class="line">lrwxrwxrwx 1 root root 0  6月  5 15:41 net -&gt; <span class="string">&#x27;net:[4026531840]&#x27;</span></span><br><span class="line">lrwxrwxrwx 1 root root 0  6月  5 15:41 pid -&gt; <span class="string">&#x27;pid:[4026531836]&#x27;</span></span><br><span class="line">lrwxrwxrwx 1 root root 0  6月  5 15:41 pid_for_children -&gt; <span class="string">&#x27;pid:[4026531836]&#x27;</span></span><br><span class="line">lrwxrwxrwx 1 root root 0  6月  5 15:41 time -&gt; <span class="string">&#x27;time:[4026531834]&#x27;</span></span><br><span class="line">lrwxrwxrwx 1 root root 0  6月  5 15:41 time_for_children -&gt; <span class="string">&#x27;time:[4026531834]&#x27;</span></span><br><span class="line">lrwxrwxrwx 1 root root 0  6月  5 15:41 user -&gt; <span class="string">&#x27;user:[4026531837]&#x27;</span></span><br><span class="line">lrwxrwxrwx 1 root root 0  6月  5 15:41 uts -&gt; <span class="string">&#x27;uts:[4026531838]&#x27;</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="cgroups-控制分组"><a href="#cgroups-控制分组" class="headerlink" title="cgroups(控制分组)"></a><strong>cgroups(控制分组)</strong></h2><p><code>cgroups(control groups)</code>是一种任务资源分配与管控的机制，比如我们可以通过<code>cpuset</code>控制分组将某些CPU分配给特定的分组；通过<code>memcfg</code>控制分组可以限制某些进程的内存使用。与Linux中的任务层级结构类似，<code>cgroups</code>也是一种树状的层级结构，子进程自动的继承了父进程的<code>cgroups</code>，两者之间不同的是，<code>cgroups</code>会同时存在多个子系统，每个子系统都有自己独立的层级结构。目前<code>Linux</code>内核中常见的<code>cgroups</code>有如下几种（具体的类型可以参考<code>linux/inclue/cgroup_subsys.h</code>中的定义）：</p><ul><li><code>cpu</code>子系统，为调度器提供限制进程的cpu使用率的参数</li><li><code>cpuacct</code>子系统，可以统计<code>cgroups</code>中的进程的cpu使用数据</li><li><code>cpuset</code>子系统，为<code>cgroups</code>中的进程分配单独的cpu节点或者内存节点</li><li><code>memory</code>子系统，用于限制进程的内存使用量</li><li><code>blkio</code>子系统，可以限制进程的块设备I&#x2F;O请求</li><li><code>devices</code>子系统，可以控制<code>cgroups</code>中的进程访问某些设备</li><li><code>net_cls</code>子系统，用于标记<code>cgroups</code>中进程的网络数据包，然后通过<code>tc(traffic control)</code>对数据包进行控制</li><li><code>net_prio</code>子系统，用于动态设置某个网卡流量的优先级</li><li><code>freezer</code>子系统，用于挂起或者恢复<code>cgroups</code>中的进程</li><li><code>ns</code>子系统，可以让不同<code>cgroups</code>进程使用不同的<code>namespace</code></li><li><code>perf_event</code>子系统，用于分析不同<code>cgroups</code>进程的性能</li><li><code>pids</code>子系统，用于限制某个<code>cgroup</code>中的进程数量</li><li><code>hugetlb</code>子系统，用于限制<code>cgroup</code>中进程的大页内存数量</li><li><code>rdma</code>子系统，用于限制<code>cgroup</code>中的<code>RDMA(Remote Direct Memory Access)</code>的使用量</li></ul><p>为了实现<code>cgroups</code>机制，内核在每个任务的结构体中都增加了一个<code>struct css_set</code>的指针，而<code>css_set</code>包含了引用计数的<code>cgroup_subsys_state</code>指针数组，每个<code>cgroup_subsys_state</code>对应着系统中注册的<code>cgroup</code>子系统类型。这样做一方面可以避免每个<code>task_struct</code>都保存一个<code>css_set</code>指针，减少存储的空间，另一方面在进程创建与退出的时候只需要对单个<code>css_set</code>进行操作，而无需对所有的子系统进行状态的更新：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">css_set</span> &#123;</span></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Set of subsystem states, one for each subsystem. This array is</span></span><br><span class="line"><span class="comment"> * immutable after creation apart from the init_css_set during</span></span><br><span class="line"><span class="comment"> * subsystem registration (at boot time).</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">cgroup_subsys_state</span> *<span class="title">subsys</span>[<span class="title">CGROUP_SUBSYS_COUNT</span>];</span></span><br><span class="line"></span><br><span class="line"><span class="comment">/* reference count */</span></span><br><span class="line"><span class="type">refcount_t</span> refcount;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * For a domain cgroup, the following points to self.  If threaded,</span></span><br><span class="line"><span class="comment"> * to the matching cset of the nearest domain ancestor.  The</span></span><br><span class="line"><span class="comment"> * dom_cset provides access to the domain cgroup and its csses to</span></span><br><span class="line"><span class="comment"> * which domain level resource consumptions should be charged.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">css_set</span> *<span class="title">dom_cset</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="comment">/* the default cgroup associated with this css_set */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">cgroup</span> *<span class="title">dfl_cgrp</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="comment">/* internal task count, protected by css_set_lock */</span></span><br><span class="line"><span class="type">int</span> nr_tasks;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Lists running through all tasks using this cgroup group.</span></span><br><span class="line"><span class="comment"> * mg_tasks lists tasks which belong to this cset but are in the</span></span><br><span class="line"><span class="comment"> * process of being migrated out or in.  Protected by</span></span><br><span class="line"><span class="comment"> * css_set_rwsem, but, during migration, once tasks are moved to</span></span><br><span class="line"><span class="comment"> * mg_tasks, it can be read safely while holding cgroup_mutex.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">list_head</span> <span class="title">tasks</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">list_head</span> <span class="title">mg_tasks</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">list_head</span> <span class="title">dying_tasks</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="comment">/* all css_task_iters currently walking this cset */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">list_head</span> <span class="title">task_iters</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * On the default hierarhcy, -&gt;subsys[ssid] may point to a css</span></span><br><span class="line"><span class="comment"> * attached to an ancestor instead of the cgroup this css_set is</span></span><br><span class="line"><span class="comment"> * associated with.  The following node is anchored at</span></span><br><span class="line"><span class="comment"> * -&gt;subsys[ssid]-&gt;cgroup-&gt;e_csets[ssid] and provides a way to</span></span><br><span class="line"><span class="comment"> * iterate through all css&#x27;s attached to a given cgroup.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">list_head</span> <span class="title">e_cset_node</span>[<span class="title">CGROUP_SUBSYS_COUNT</span>];</span></span><br><span class="line">...</span><br><span class="line">&#125;;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>内核把<code>cgroup</code>当做一个特殊的文件系统来对待，因此用户想要浏览与管理<code>cgroup</code>，首先需要通过挂载<code>cgroup</code>文件系统，然后像操作文件一样来管理整个<code>cgroup</code>的层级结构；目前内核支持<a href="https://man7.org/linux/man-pages/man7/cgroups.7.html"><code>cgroup1</code>与<code>cgroup2</code>两种类型</a>，具体挂载的时候需要制定不同的参数：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">#cgroup1 </span></span><br><span class="line">mount -t cgroup -o all cgroup /sys/fs/cgroup</span><br><span class="line"></span><br><span class="line"><span class="comment">#cgroup2</span></span><br><span class="line">mount -t cgroup2 none /sys/fs/cgroup</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>系统初始化时，会调用<code>cgroup_init</code>对整个<code>cgroup</code>系统进行初始化，并注册两个特殊的文件系统<code>cgroup/cgroup2</code>到内核中，这样用户就可以对<code>cgroup</code>进行类似于常规文件设备进行操作了（在Linux内核中真是万物皆可为文件），具体可以参考内核的代码<code>kernel/cgroup/cgroup.c</code>。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">file_system_type</span> <span class="title">cgroup_fs_type</span> =</span> &#123;</span><br><span class="line">.name= <span class="string">&quot;cgroup&quot;</span>,</span><br><span class="line">.init_fs_context= cgroup_init_fs_context,</span><br><span class="line">.parameters= cgroup1_fs_parameters,</span><br><span class="line">.kill_sb= cgroup_kill_sb,</span><br><span class="line">.fs_flags= FS_USERNS_MOUNT,</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="class"><span class="keyword">struct</span> <span class="title">file_system_type</span> <span class="title">cgroup2_fs_type</span> =</span> &#123;</span><br><span class="line">.name= <span class="string">&quot;cgroup2&quot;</span>,</span><br><span class="line">.init_fs_context= cgroup_init_fs_context,</span><br><span class="line">.parameters= cgroup2_fs_parameters,</span><br><span class="line">.kill_sb= cgroup_kill_sb,</span><br><span class="line">.fs_flags= FS_USERNS_MOUNT,</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="comment">/**</span></span><br><span class="line"><span class="comment"> * cgroup_init - cgroup initialization</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * Register cgroup filesystem and /proc file, and initialize</span></span><br><span class="line"><span class="comment"> * any subsystems that didn&#x27;t request early init.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="type">int</span> __init <span class="title function_">cgroup_init</span><span class="params">(<span class="type">void</span>)</span></span><br><span class="line">&#123;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="comment">/* init_css_set.subsys[] has been updated, re-hash */</span></span><br><span class="line">hash_del(&amp;init_css_set.hlist);</span><br><span class="line">hash_add(css_set_table, &amp;init_css_set.hlist,</span><br><span class="line"> css_set_hash(init_css_set.subsys));</span><br><span class="line"></span><br><span class="line">WARN_ON(sysfs_create_mount_point(fs_kobj, <span class="string">&quot;cgroup&quot;</span>));</span><br><span class="line">WARN_ON(register_filesystem(&amp;cgroup_fs_type));</span><br><span class="line">WARN_ON(register_filesystem(&amp;cgroup2_fs_type));</span><br><span class="line">WARN_ON(!proc_create_single(<span class="string">&quot;cgroups&quot;</span>, <span class="number">0</span>, <span class="literal">NULL</span>, proc_cgroupstats_show));</span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_CPUSETS</span></span><br><span class="line">WARN_ON(register_filesystem(&amp;cpuset_fs_type));</span><br><span class="line"><span class="meta">#<span class="keyword">endif</span></span></span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>我们可以通过<code>/proc/cgroups</code>查看内核支持的<code>cgroup</code>类型；也可以通过<code>/proc/&lt;pid&gt;/cgroup</code>来查看某个具体进程所在的<code>cgroup</code>种类。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">~$ <span class="built_in">cat</span> /proc/cgroups </span><br><span class="line"><span class="comment">#subsys_name    hierarchy   num_cgroups   enabled</span></span><br><span class="line">cpuset          0           259            1</span><br><span class="line">cpu             0           259            1</span><br><span class="line">cpuacct         0           259            1</span><br><span class="line">blkio           0           259            1</span><br><span class="line">memory          0           259            1</span><br><span class="line">devices         0           259            1</span><br><span class="line">freezer         0           259            1</span><br><span class="line">net_cls         0           259            1</span><br><span class="line">perf_event      0           259            1</span><br><span class="line">net_prio        0           259            1</span><br><span class="line">hugetlb         0           259            1</span><br><span class="line">pids            0           259            1</span><br><span class="line">rdma            0           259            1</span><br><span class="line">misc            0           259            1</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="如何启动LXC容器"><a href="#如何启动LXC容器" class="headerlink" title="如何启动LXC容器"></a><strong>如何启动<code>LXC</code>容器</strong></h2><p><a href="https://linuxcontainers.org/"><code>LXC</code>容器</a>相关的代码与工具都是开源的，其包含了好几个独立的组件：</p><ul><li><code>liblxc</code>库，主要包括容器核心的代码实现</li><li>其他编程语言如<code>python/lua/Go/ruby/Haskkell</code>的胶水接口</li><li>一整套创建、启动、监控、销毁容器的工具</li><li>不同系统环境容器模版，可以在<a href="https://linuxcontainers.org/lxc/downloads/">LXC官网</a>找到参考的模版</li></ul><p>接下来我们就以<code>Ubuntu</code>系统为例说明如何利用<code>LXC</code>工具来启动容器。首先，需要安装<code>LXC</code>容器相关的依赖库，执行如下命令：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">sudo apt-get install lxc</span><br><span class="line"></span><br><span class="line">sudo apt-get install lxc-templates</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>安装成功后，可以看到系统中多了很多<code>LXC</code>相关的工具，比如<code>lxc-create</code>&#x2F;<code>lxc-start</code>&#x2F;<code>lxc-stop</code>等；为了确保容器功能的正常，在创建容器之前，执行<code>lxc-checkconfig</code>来检查当前系统的配置是否满足要求，执行完后会输出相关的配置信息状态，可以看到当前系统版本是满足容器的运行环境的。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">~$ lxc-checkconfig </span><br><span class="line">LXC version 5.0.0~git2209-g5a7b9ce67</span><br><span class="line">Kernel configuration not found at /proc/config.gz; searching...</span><br><span class="line">Kernel configuration found at /boot/config-6.2.0-36-generic</span><br><span class="line">--- Namespaces ---</span><br><span class="line">Namespaces: enabled</span><br><span class="line">Utsname namespace: enabled</span><br><span class="line">Ipc namespace: enabled</span><br><span class="line">Pid namespace: enabled</span><br><span class="line">User namespace: enabled</span><br><span class="line">Network namespace: enabled</span><br><span class="line"></span><br><span class="line">--- Control <span class="built_in">groups</span> ---</span><br><span class="line">Cgroups: enabled</span><br><span class="line">Cgroup namespace: enabled</span><br><span class="line"></span><br><span class="line">Cgroup v1 mount points: </span><br><span class="line"></span><br><span class="line"></span><br><span class="line">Cgroup v2 mount points: </span><br><span class="line">/sys/fs/cgroup</span><br><span class="line"></span><br><span class="line">Cgroup v1 systemd controller: missing</span><br><span class="line">Cgroup v1 freezer controller: missing</span><br><span class="line">Cgroup ns_cgroup: required</span><br><span class="line">Cgroup device: enabled</span><br><span class="line">Cgroup <span class="built_in">sched</span>: enabled</span><br><span class="line">Cgroup cpu account: enabled</span><br><span class="line">Cgroup memory controller: enabled</span><br><span class="line">Cgroup cpuset: enabled</span><br><span class="line"></span><br><span class="line">--- Misc ---</span><br><span class="line">Veth pair device: enabled, not loaded</span><br><span class="line">Macvlan: enabled, not loaded</span><br><span class="line">Vlan: enabled, not loaded</span><br><span class="line">Bridges: enabled, loaded</span><br><span class="line">Advanced netfilter: enabled, loaded</span><br><span class="line">CONFIG_IP_NF_TARGET_MASQUERADE: enabled, not loaded</span><br><span class="line">CONFIG_IP6_NF_TARGET_MASQUERADE: enabled, not loaded</span><br><span class="line">CONFIG_NETFILTER_XT_TARGET_CHECKSUM: enabled, not loaded</span><br><span class="line">CONFIG_NETFILTER_XT_MATCH_COMMENT: enabled, not loaded</span><br><span class="line">FUSE (<span class="keyword">for</span> use with lxcfs): enabled, not loaded</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>接着，通过<code>lxc-create</code>来创建一个容器，这里我们以<code>busybox</code>这个容器模版来执行创建:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># 管理容器需要root权限</span></span><br><span class="line">sudo lxc-create -n busybox-lxc -t busybox</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>创建成功后，可以通过<code>lxc-info</code>来查看对应容器的状态，未启动的容器状态是<code>STOPPED</code>（停止态）：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">~$ sudo lxc-info -n busybox-lxc</span><br><span class="line">Name:           busybox-lxc</span><br><span class="line">State:          STOPPED</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>接着，我们需要通过<code>lxc-start</code>来启动该容器，让其处于运行状态，此时再检查容器状态就变为了<code>RUNNING</code>(运行态)：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># 启动容器</span></span><br><span class="line">~$ sudo lxc-start -n busybox-lxc</span><br><span class="line"></span><br><span class="line">~$ sudo lxc-info -n busybox-lxc</span><br><span class="line">Name:           busybox-lxc</span><br><span class="line">State:          RUNNING</span><br><span class="line">PID:            83579</span><br><span class="line">Link:           vethlkXyJr</span><br><span class="line"> TX bytes:      612 bytes</span><br><span class="line"> RX bytes:      5.92 KiB</span><br><span class="line"> Total bytes:   6.52 KiB</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>除了通过<code>lxc-info</code>查看容器状态外，还可以通过<code>lxc-attach</code>开启一个模拟的终端进入到容器内部查看系统运行的状态，登录成功后，会弹出一个新的终端符<code>/#</code>：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">sudo lxc-attach -n busybox-lxc</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">BusyBox v1.30.1 (Ubuntu 1:1.30.1-7ubuntu3) built-in shell (ash)</span><br><span class="line">Enter <span class="string">&#x27;help&#x27;</span> <span class="keyword">for</span> a list of built-in commands.</span><br><span class="line"></span><br><span class="line">/ <span class="comment"># </span></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>试着在终端输入<code>top</code>指令看看系统的进程状态，可以看到系统主要有几个进程：</p><ul><li>两个名为<code>init</code>进程，但<code>pid</code>并不一样，为啥有两个<code>init</code>进程了？这个留到下篇文章再深入分析。</li><li>一个<code>syslogd</code>进程用于收集系统日志</li><li><code>udhcpc</code>进程用于<code>DHCP</code>服务IP地址的分配与管理</li><li><code>getty</code>用于与伪终端进行交互的守护进程</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">Mem: 40201368K used, 549680K free, 1329880K shrd, 738404K buff, 19484428K cached</span><br><span class="line">CPU:   5% usr   0% sys   0% nic  94% idle   0% io   0% irq   0% sirq</span><br><span class="line">Load average: 0.91 1.13 1.47 2/2955 21</span><br><span class="line">  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND</span><br><span class="line">    1     0 root     S     2456   0%   0% init</span><br><span class="line">    4     1 root     S     2456   0%   0% /bin/syslogd</span><br><span class="line">   14     1 root     S     2456   0%   0% /bin/udhcpc</span><br><span class="line">   15     1 root     S     2456   0%   0% /bin/getty -L tty1 115200 vt100</span><br><span class="line">   16     1 root     S     2456   0%   0% init</span><br><span class="line">   17     0 root     S     2456   0%   0% /bin/sh</span><br><span class="line">   21    17 root     R     2456   0%   0% top</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>至此我们创建并启动了一个容器，如果要停止或者销毁某个容器，可以使用<code>lxc-stop</code>&#x2F;<code>lxc-destroy</code>命令；除了这些指令外，<code>LXC</code>还提供了如下几个常用的命令，便于对容器进行管控：</p><ul><li><code>lxc-execute</code>在某个容器中执行一个特定的程序</li><li><code>lxc-freeze/lxc-unfreeze</code>冻结、解冻某个容器的所有进程</li><li><code>lxc-cgroup</code>管理容器中的<code>cgroup</code>配置</li><li><code>lxc-monitor</code>监控容器的状态</li><li><code>lxc-device</code>管理容器中的设备</li><li><code>lxc-snapshot</code>保存容器快照</li></ul><p>下一篇文章主要从源码的角度介绍下<code>LXC</code>容器具体是如何实现的。</p><h2 id="参考文献"><a href="#参考文献" class="headerlink" title="参考文献"></a><strong>参考文献</strong></h2><ul><li><a href="https://github.com/lxc/lxc">https://github.com/lxc/lxc</a></li><li><a href="https://www.linuxjournal.com/content/exploring-lxc-containerization-ubuntu-servers">https://www.linuxjournal.com/content/exploring-lxc-containerization-ubuntu-servers</a></li><li><a href="https://linuxcontainers.org/lxc/downloads/">https://linuxcontainers.org/lxc/downloads/</a></li><li><a href="https://docs.kernel.org/admin-guide/cgroup-v1/cgroups.html">https://docs.kernel.org/admin-guide/cgroup-v1/cgroups.html</a></li><li><a href="https://man7.org/linux/man-pages/man7/cgroups.7.html">https://man7.org/linux/man-pages/man7/cgroups.7.html</a></li><li><a href="https://github.com/Friz-zy/awesome-linux-containers">https://github.com/Friz-zy/awesome-linux-containers</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;容器(&lt;code&gt;Containers&lt;/code&gt;)是一种创建轻量级&lt;code&gt;虚拟&lt;/code&gt;的应用执行环境的技术；基于容器技术，我们可以轻松的在同一个操作系统中构建出多个隔离、虚拟的运行环境，不同于基于虚拟化技术(&lt;code&gt;hypervisor&lt;/code&gt;)的硬件级别的隔离方案，容器通过&lt;code&gt;Linux&lt;/code&gt;内核中的命名空间(&lt;code&gt;Namespace&lt;/code&gt;)以及控制分组(&lt;code&gt;Cgroups&lt;/code&gt;)来实现进程级资源如CPU、内存、IO、网络等隔离与管理。目前常见的容器方案有&lt;a href=&quot;https://linuxcontainers.org/&quot;&gt;&lt;code&gt;Linux Containers(LXC)&lt;/code&gt;&lt;/a&gt;与&lt;a href=&quot;https://en.wikipedia.org/wiki/Docker_(software)&quot;&gt;&lt;code&gt;Docker&lt;/code&gt;&lt;/a&gt;；&lt;code&gt;LXC&lt;/code&gt;可以用于进程执行也可以用于启动一个系统镜像（包含&lt;code&gt;rootfs&lt;/code&gt;的完整系统执行环境），而&lt;code&gt;Docker&lt;/code&gt;一般用于云计算中的应用程序的打包运行。&lt;/p&gt;
&lt;p&gt;深入Linux容器文章系列准备分为上下两篇来写，第一篇主要围绕&lt;code&gt;LXC&lt;/code&gt;容器的基本实现原理以及如何在&lt;code&gt;ubuntu&lt;/code&gt;系统中创建自己的容器；下篇主要从源码的角度分析下&lt;code&gt;LXC&lt;/code&gt;是如何实现的。这篇文章，我们着重了解下&lt;code&gt;LXC&lt;/code&gt;的实现原理，主要从如下两个方面进行介绍：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;首先从&lt;code&gt;namespace&lt;/code&gt;、&lt;code&gt;cgroups&lt;/code&gt;两个基本的概念介绍&lt;code&gt;LXC&lt;/code&gt;的基本原理&lt;/li&gt;
&lt;li&gt;基于&lt;code&gt;Ubuntu&lt;/code&gt;系统搭建、启动一个完整的&lt;code&gt;LXC&lt;/code&gt;容器&lt;/li&gt;
&lt;/ul&gt;</summary>
    
    
    
    <category term="Linux" scheme="https://sniffer.site/categories/Linux/"/>
    
    
    <category term="虚拟化" scheme="https://sniffer.site/tags/%E8%99%9A%E6%8B%9F%E5%8C%96/"/>
    
    <category term="LXC" scheme="https://sniffer.site/tags/LXC/"/>
    
    <category term="容器" scheme="https://sniffer.site/tags/%E5%AE%B9%E5%99%A8/"/>
    
  </entry>
  
  <entry>
    <title>说一说VLAN</title>
    <link href="https://sniffer.site/2025/05/14/%E8%AF%B4%E4%B8%80%E8%AF%B4VLAN/"/>
    <id>https://sniffer.site/2025/05/14/%E8%AF%B4%E4%B8%80%E8%AF%B4VLAN/</id>
    <published>2025-05-14T07:58:59.000Z</published>
    <updated>2025-09-09T03:36:26.861Z</updated>
    
    <content type="html"><![CDATA[<p><code>VLAN(Virtual Local-Area-Network)</code>虚拟局域网，用于将一个物理局域网(<code>LAN</code>)在逻辑上分割为多个独立虚拟的广播域；每个<code>VLAN</code>都对应一个广播域，可以直接通讯，而不同<code>VLAN</code>的主机则无法直接互通，这样广播报文就限定在一个固定的<code>VLAN</code>内。<code>VLAN</code>工作在网络协议栈的数据链路层(<code>L2</code>)，通过在网络数据报文中增加一个额外的<code>VLAN</code>标签，从而让同一个物理局域网的流量可以像多个物理局域网一样分隔开来；另外，我们也可以利用<code>VLAN</code>中的优先级标签来保证局域网中的高优先级流量可以更低延迟的进行传输，从而提升整个网络的传输质量。这篇文章，主要从两个方面介绍下<code>VLAN</code>:</p><ul><li>首先介绍下如何创建、配置<code>VLAN</code></li><li>其次基于数据报文分析下<code>VLAN</code>是如何在Linux内核中实现的</li></ul><span id="more"></span><h2 id="配置VLAN"><a href="#配置VLAN" class="headerlink" title="配置VLAN"></a><strong>配置VLAN</strong></h2><p>在实际的应用中，我们可以基于一个物理网卡或者桥接口(<code>bridge</code>)来创建<code>VLAN</code>，Linux下可以通过<code>ip link</code>命令执行<code>VLAN</code>的创建：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">ip <span class="built_in">link</span> add <span class="built_in">link</span> eth0 name vlan100 <span class="built_in">type</span> vlan <span class="built_in">id</span> 100</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>这里我们基于一个物理网口<code>eth0</code>配置了一个<code>ID</code>为<code>100</code>的VLAN网口，其对应的网口名为<code>vlan100</code>（可以自定义）;为了让<code>VLAN</code>的网口开始工作，我们需要像配置一个物理网卡一样，给<code>VLAN</code>网口配置<code>IP</code>地址，并设置为运行状态(<code>UP</code>):</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"></span><br><span class="line">ip addr add 192.168.100.1/24 brd 192.168.100.255 dev vlan100</span><br><span class="line">ip <span class="built_in">link</span> <span class="built_in">set</span> dev vlan100 up</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>通过<code>ifconfig vlan100</code>我们可以查看到配置好的<code>VLAN</code>网口状态：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"></span><br><span class="line">vlan100: flags=4163&lt;UP,BROADCAST,RUNNING,MULTICAST&gt;  mtu 1500</span><br><span class="line">        inet 192.168.100.2  netmask 255.255.255.0  broadcast 192.168.100.255</span><br><span class="line">        inet6 fe80::a6f9:33ff:fe8b:932c  prefixlen 64  scopeid 0x20&lt;<span class="built_in">link</span>&gt;</span><br><span class="line">        ether a4:f9:33:8b:93:2c  txqueuelen 1000  (Ethernet)</span><br><span class="line">        RX packets 0  bytes 0 (0.0 B)</span><br><span class="line">        RX errors 0  dropped 0  overruns 0  frame 0</span><br><span class="line">        TX packets 101  bytes 16276 (16.2 KB)</span><br><span class="line">        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="VLAN的实现原理"><a href="#VLAN的实现原理" class="headerlink" title="VLAN的实现原理"></a><strong>VLAN的实现原理</strong></h2><p>交换机要识别不同<code>VLAN</code>的数据报文，需要在原有以太网的数据帧中增加一个<code>4 bytes</code>的<code>VLAN</code>标签，用于识别<code>VLAN</code>信息；从下图可以看到，<code>VLAN</code>帧在原有的以太网帧中增加了一个<code>tag</code>标签数据，其中2个字节用于表示对应的上层协议类型;另外2个字节包含3个部分的信息：</p><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/vlan_header.png" alt="VLAN tag"></p><ul><li><code>TPID(Tag Protocol Identifier, 2bytes)</code>: 基于<code>IEEE 802.1Q</code>标准，一般设定为<code>0x8100</code></li><li><code>PRI(Priority, 3bits)</code>: 0~7，用于表示流量的优先级</li><li><code>CFI(Canonical Format Indiciator, 1bit)/DEI(Drop Eligible Indicator, 1bit)</code>: <code>CFI</code>用于表示<code>MAC</code>地址是否以标准的形式表示，如果该值是<code>0</code>表示是标准形式（低位先传输）；为<code>1</code>则表示非标准形式（高位先传输）；而<code>DEI</code>配合优先级状态<code>PRI</code>，在网络拥塞时选择丢弃某些报文。实际应用中，这两个标志位可以自行选择使用</li><li><code>VID(VLAN Identifier, 12bits)</code>: 大小范围为<code>0-4095</code>，用于表示数据报文对应的<code>VLAN</code>，其中<code>0/4095</code>为保留的<code>VLAN</code>值，实际可用的值在<code>1-4094</code>之间</li></ul><p>以实际抓取<code>tcpdump</code>报文来看看具体的<code>VLAN</code>数据是如何组成的，报文的开始是由目标与源<code>MAC</code>地址，共12个字节；接着是4个字节的<code>VLAN</code>标签数据：</p><ul><li><code>Type(0x8100)</code>: 对应的是<code>TPID</code>，可以看到该报文是基于<code>IEEE 802.1Q</code></li><li><code>Priority(0)</code>: 对应<code>PRI</code>，默认值是<code>0</code></li><li><code>DEI(0)</code>: 丢弃优先指示，不会丢弃报文</li><li><code>ID(200)</code>: <code>VLAN ID</code>为200</li></ul><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/vlan-frame-examples.png" alt="VLAN frame examples"></p><p>接下来，我们还是从Linux的源码的角度来看看<code>VLAN</code>的大致实现原理。</p><h3 id="Linux如何实现VLAN"><a href="#Linux如何实现VLAN" class="headerlink" title="Linux如何实现VLAN"></a><strong>Linux如何实现VLAN</strong></h3><p>从文章开头部分，我们知道，配置<code>VLAN</code>是通过的<code>ip link</code>命令来实现的，<code>Linux</code>下的<code>ip</code>相关指令都在<code>iproute</code>的开源库中实现的，我们找到对应的源代码<code>iproute2/ip/iplink.c</code>，可以看到，实际<code>ip link</code>命令最后都是通过<code>iplink_modify</code>函数发送一个<a href="https://man7.org/linux/man-pages/man7/rtnetlink.7.html"><code>netlink</code></a>的套接字消息给内核，用于建立新的<code>VLAN</code>网口：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> <span class="title function_">iplink_modify</span><span class="params">(<span class="type">int</span> cmd, <span class="type">unsigned</span> <span class="type">int</span> flags, <span class="type">int</span> argc, <span class="type">char</span> **argv)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="type">char</span> *dev = <span class="literal">NULL</span>;</span><br><span class="line"><span class="type">char</span> *name = <span class="literal">NULL</span>;</span><br><span class="line"><span class="type">char</span> *link = <span class="literal">NULL</span>;</span><br><span class="line"><span class="type">char</span> *type = <span class="literal">NULL</span>;</span><br><span class="line"><span class="type">int</span> index = <span class="number">-1</span>;</span><br><span class="line"><span class="type">int</span> group;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">link_util</span> *<span class="title">lu</span> =</span> <span class="literal">NULL</span>;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">iplink_req</span> <span class="title">req</span> =</span> &#123;</span><br><span class="line">.n.nlmsg_len = NLMSG_LENGTH(<span class="keyword">sizeof</span>(<span class="keyword">struct</span> ifinfomsg)),</span><br><span class="line">.n.nlmsg_flags = NLM_F_REQUEST | flags,</span><br><span class="line">.n.nlmsg_type = cmd,</span><br><span class="line">.i.ifi_family = preferred_family,</span><br><span class="line">&#125;;</span><br><span class="line"><span class="type">int</span> ret;</span><br><span class="line"></span><br><span class="line">ret = iplink_parse(argc, argv,</span><br><span class="line">   &amp;req, &amp;name, &amp;type, &amp;link, &amp;dev, &amp;group, &amp;index);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">return</span> ret;</span><br><span class="line"></span><br><span class="line">argc -= ret;</span><br><span class="line">argv += ret;</span><br><span class="line">...</span><br><span class="line"><span class="keyword">if</span> (!(flags &amp; NLM_F_CREATE)) &#123;</span><br><span class="line">...</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line"><span class="comment">/* Allow &quot;ip link add dev&quot; and &quot;ip link add name&quot; */</span></span><br><span class="line"><span class="keyword">if</span> (!name)</span><br><span class="line">name = dev;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (link) &#123;</span><br><span class="line"><span class="type">int</span> ifindex;</span><br><span class="line"></span><br><span class="line">ifindex = ll_name_to_index(link);</span><br><span class="line"><span class="keyword">if</span> (ifindex == <span class="number">0</span>) &#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">&quot;Cannot find device \&quot;%s\&quot;\n&quot;</span>,</span><br><span class="line">link);</span><br><span class="line"><span class="keyword">return</span> <span class="number">-1</span>;</span><br><span class="line">&#125;</span><br><span class="line">addattr_l(&amp;req.n, <span class="keyword">sizeof</span>(req), IFLA_LINK, &amp;ifindex, <span class="number">4</span>);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (index == <span class="number">-1</span>)</span><br><span class="line">req.i.ifi_index = <span class="number">0</span>;</span><br><span class="line"><span class="keyword">else</span></span><br><span class="line">req.i.ifi_index = index;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (name) &#123;</span><br><span class="line">addattr_l(&amp;req.n, <span class="keyword">sizeof</span>(req),</span><br><span class="line">  IFLA_IFNAME, name, <span class="built_in">strlen</span>(name) + <span class="number">1</span>);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (type) &#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">rtattr</span> *<span class="title">linkinfo</span>;</span></span><br><span class="line"><span class="type">char</span> *ulinep = <span class="built_in">strchr</span>(type, <span class="string">&#x27;_&#x27;</span>);</span><br><span class="line"><span class="type">int</span> iflatype;</span><br><span class="line"></span><br><span class="line">linkinfo = addattr_nest(&amp;req.n, <span class="keyword">sizeof</span>(req), IFLA_LINKINFO);</span><br><span class="line">addattr_l(&amp;req.n, <span class="keyword">sizeof</span>(req), IFLA_INFO_KIND, type,</span><br><span class="line"> <span class="built_in">strlen</span>(type));</span><br><span class="line"></span><br><span class="line">lu = get_link_kind(type);</span><br><span class="line"><span class="keyword">if</span> (ulinep &amp;&amp; !<span class="built_in">strcmp</span>(ulinep, <span class="string">&quot;_slave&quot;</span>))</span><br><span class="line">iflatype = IFLA_INFO_SLAVE_DATA;</span><br><span class="line"><span class="keyword">else</span></span><br><span class="line">iflatype = IFLA_INFO_DATA;</span><br><span class="line"><span class="keyword">if</span> (lu &amp;&amp; argc) &#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">rtattr</span> *<span class="title">data</span></span></span><br><span class="line"><span class="class">=</span> addattr_nest(&amp;req.n,</span><br><span class="line">       <span class="keyword">sizeof</span>(req), iflatype);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (lu-&gt;parse_opt &amp;&amp;</span><br><span class="line">    lu-&gt;parse_opt(lu, argc, argv, &amp;req.n))</span><br><span class="line"><span class="keyword">return</span> <span class="number">-1</span>;</span><br><span class="line"></span><br><span class="line">addattr_nest_end(&amp;req.n, data);</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="comment">//发送netlink消息给内核</span></span><br><span class="line"><span class="keyword">if</span> (rtnl_talk(&amp;rth, &amp;req.n, <span class="literal">NULL</span>, <span class="number">0</span>) &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">return</span> <span class="number">-2</span>;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>在Linux内核中，<code>netlink</code>相关的内核实现都在<code>net/core/rtlink.c</code>中，内核在初始化<code>rtnetlink_init</code>的时候，会注册很多<code>netlink</code>相关的指令处理函数，这样应用可以通过<code>netlink</code>套接字来向内核发送指令，实现诸如路由配置、VLAN创建、ARP配置等功能：</p><blockquote><p>本文的分析基于Linux V5.10版本</p></blockquote><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="type">void</span> __init <span class="title function_">rtnetlink_init</span><span class="params">(<span class="type">void</span>)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="keyword">if</span> (register_pernet_subsys(&amp;rtnetlink_net_ops))</span><br><span class="line">panic(<span class="string">&quot;rtnetlink_init: cannot initialize rtnetlink\n&quot;</span>);</span><br><span class="line"></span><br><span class="line">register_netdevice_notifier(&amp;rtnetlink_dev_notifier);</span><br><span class="line"></span><br><span class="line">rtnl_register(PF_UNSPEC, RTM_GETLINK, rtnl_getlink,</span><br><span class="line">      rtnl_dump_ifinfo, <span class="number">0</span>);</span><br><span class="line">rtnl_register(PF_UNSPEC, RTM_SETLINK, rtnl_setlink, <span class="literal">NULL</span>, <span class="number">0</span>);</span><br><span class="line"><span class="comment">//创建link的netlink函数</span></span><br><span class="line">rtnl_register(PF_UNSPEC, RTM_NEWLINK, rtnl_newlink, <span class="literal">NULL</span>, <span class="number">0</span>);</span><br><span class="line">rtnl_register(PF_UNSPEC, RTM_DELLINK, rtnl_dellink, <span class="literal">NULL</span>, <span class="number">0</span>);</span><br><span class="line"></span><br><span class="line">rtnl_register(PF_UNSPEC, RTM_GETADDR, <span class="literal">NULL</span>, rtnl_dump_all, <span class="number">0</span>);</span><br><span class="line">rtnl_register(PF_UNSPEC, RTM_GETROUTE, <span class="literal">NULL</span>, rtnl_dump_all, <span class="number">0</span>);</span><br><span class="line">rtnl_register(PF_UNSPEC, RTM_GETNETCONF, <span class="literal">NULL</span>, rtnl_dump_all, <span class="number">0</span>);</span><br><span class="line"></span><br><span class="line">rtnl_register(PF_UNSPEC, RTM_NEWLINKPROP, rtnl_newlinkprop, <span class="literal">NULL</span>, <span class="number">0</span>);</span><br><span class="line">rtnl_register(PF_UNSPEC, RTM_DELLINKPROP, rtnl_dellinkprop, <span class="literal">NULL</span>, <span class="number">0</span>);</span><br><span class="line"></span><br><span class="line">rtnl_register(PF_BRIDGE, RTM_NEWNEIGH, rtnl_fdb_add, <span class="literal">NULL</span>, <span class="number">0</span>);</span><br><span class="line">rtnl_register(PF_BRIDGE, RTM_DELNEIGH, rtnl_fdb_del, <span class="literal">NULL</span>, <span class="number">0</span>);</span><br><span class="line">rtnl_register(PF_BRIDGE, RTM_GETNEIGH, rtnl_fdb_get, rtnl_fdb_dump, <span class="number">0</span>);</span><br><span class="line"></span><br><span class="line">rtnl_register(PF_BRIDGE, RTM_GETLINK, <span class="literal">NULL</span>, rtnl_bridge_getlink, <span class="number">0</span>);</span><br><span class="line">rtnl_register(PF_BRIDGE, RTM_DELLINK, rtnl_bridge_dellink, <span class="literal">NULL</span>, <span class="number">0</span>);</span><br><span class="line">rtnl_register(PF_BRIDGE, RTM_SETLINK, rtnl_bridge_setlink, <span class="literal">NULL</span>, <span class="number">0</span>);</span><br><span class="line">...</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>我们重点来看看创建<code>link</code>相关的实现。<code>rtnl_newlink</code>的核心功能都通过<code>__rtnl_newlink</code>实现；函数<code>__rtnl_newlink</code>主要做了如下几个事情：</p><ul><li>首先，通过<code>nlmsg_parse_deprecated</code>解析<code>netlink</code>的消息数据，将其转换成一个<code>struct nlattr</code>数据</li><li>接着，调用<code>validate_linkmsg</code>验证下各个数据类型是否符合实际的要求，比如IP地址长度是否超限</li><li>通过<code>IFLA_LINKINFO</code>解析<code>link</code>相关的信息，用于获取<code>link</code>的类型<code>IFLA_INFO_KIND</code></li><li>调用<code>rtnl_create_link</code>创建一个<code>VLAN</code>的网卡设备</li><li>通过<code>struct rtnl_link_ops</code>中的<code>newlink</code>完成最终的<code>VLAN</code>的链路创建<blockquote><p>如果对netlink数据传输有兴趣的，可以通过strace来跟踪具体的系统调用状态</p></blockquote></li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br><span class="line">146</span><br><span class="line">147</span><br><span class="line">148</span><br><span class="line">149</span><br><span class="line">150</span><br><span class="line">151</span><br><span class="line">152</span><br><span class="line">153</span><br><span class="line">154</span><br><span class="line">155</span><br><span class="line">156</span><br><span class="line">157</span><br><span class="line">158</span><br><span class="line">159</span><br><span class="line">160</span><br><span class="line">161</span><br><span class="line">162</span><br><span class="line">163</span><br><span class="line">164</span><br><span class="line">165</span><br><span class="line">166</span><br><span class="line">167</span><br><span class="line">168</span><br><span class="line">169</span><br><span class="line">170</span><br><span class="line">171</span><br><span class="line">172</span><br><span class="line">173</span><br><span class="line">174</span><br><span class="line">175</span><br><span class="line">176</span><br><span class="line">177</span><br><span class="line">178</span><br><span class="line">179</span><br><span class="line">180</span><br><span class="line">181</span><br><span class="line">182</span><br><span class="line">183</span><br><span class="line">184</span><br><span class="line">185</span><br><span class="line">186</span><br><span class="line">187</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> __rtnl_newlink(<span class="keyword">struct</span> sk_buff *skb, <span class="keyword">struct</span> nlmsghdr *nlh,</span><br><span class="line">  <span class="keyword">struct</span> nlattr **attr, <span class="keyword">struct</span> netlink_ext_ack *extack)</span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">nlattr</span> *<span class="title">slave_attr</span>[<span class="title">RTNL_SLAVE_MAX_TYPE</span> + 1];</span></span><br><span class="line"><span class="type">unsigned</span> <span class="type">char</span> name_assign_type = NET_NAME_USER;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">nlattr</span> *<span class="title">linkinfo</span>[<span class="title">IFLA_INFO_MAX</span> + 1];</span></span><br><span class="line"><span class="type">const</span> <span class="class"><span class="keyword">struct</span> <span class="title">rtnl_link_ops</span> *<span class="title">m_ops</span> =</span> <span class="literal">NULL</span>;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">net_device</span> *<span class="title">master_dev</span> =</span> <span class="literal">NULL</span>;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">net</span> *<span class="title">net</span> =</span> sock_net(skb-&gt;sk);</span><br><span class="line"><span class="type">const</span> <span class="class"><span class="keyword">struct</span> <span class="title">rtnl_link_ops</span> *<span class="title">ops</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">nlattr</span> *<span class="title">tb</span>[<span class="title">IFLA_MAX</span> + 1];</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">net</span> *<span class="title">dest_net</span>, *<span class="title">link_net</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">nlattr</span> **<span class="title">slave_data</span>;</span></span><br><span class="line"><span class="type">char</span> kind[MODULE_NAME_LEN];</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">net_device</span> *<span class="title">dev</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ifinfomsg</span> *<span class="title">ifm</span>;</span></span><br><span class="line"><span class="type">char</span> ifname[IFNAMSIZ];</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">nlattr</span> **<span class="title">data</span>;</span></span><br><span class="line"><span class="type">int</span> err;</span><br><span class="line"></span><br><span class="line">err = nlmsg_parse_deprecated(nlh, <span class="keyword">sizeof</span>(*ifm), tb, IFLA_MAX,</span><br><span class="line">     ifla_policy, extack);</span><br><span class="line"><span class="keyword">if</span> (err &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">return</span> err;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (tb[IFLA_IFNAME])</span><br><span class="line">nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);</span><br><span class="line"><span class="keyword">else</span></span><br><span class="line">ifname[<span class="number">0</span>] = <span class="string">&#x27;\0&#x27;</span>;</span><br><span class="line"></span><br><span class="line">ifm = nlmsg_data(nlh);</span><br><span class="line"><span class="keyword">if</span> (ifm-&gt;ifi_index &gt; <span class="number">0</span>)</span><br><span class="line">dev = __dev_get_by_index(net, ifm-&gt;ifi_index);</span><br><span class="line"><span class="keyword">else</span> <span class="keyword">if</span> (tb[IFLA_IFNAME] || tb[IFLA_ALT_IFNAME])</span><br><span class="line">dev = rtnl_dev_get(net, <span class="literal">NULL</span>, tb[IFLA_ALT_IFNAME], ifname);</span><br><span class="line"><span class="keyword">else</span></span><br><span class="line">dev = <span class="literal">NULL</span>;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (dev) &#123;</span><br><span class="line">master_dev = netdev_master_upper_dev_get(dev);</span><br><span class="line"><span class="keyword">if</span> (master_dev)</span><br><span class="line">m_ops = master_dev-&gt;rtnl_link_ops;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">err = validate_linkmsg(dev, tb);</span><br><span class="line"><span class="keyword">if</span> (err &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">return</span> err;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (tb[IFLA_LINKINFO]) &#123;</span><br><span class="line">err = nla_parse_nested_deprecated(linkinfo, IFLA_INFO_MAX,</span><br><span class="line">  tb[IFLA_LINKINFO],</span><br><span class="line">  ifla_info_policy, <span class="literal">NULL</span>);</span><br><span class="line"><span class="keyword">if</span> (err &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">return</span> err;</span><br><span class="line">&#125; <span class="keyword">else</span></span><br><span class="line"><span class="built_in">memset</span>(linkinfo, <span class="number">0</span>, <span class="keyword">sizeof</span>(linkinfo));</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (linkinfo[IFLA_INFO_KIND]) &#123;</span><br><span class="line">nla_strlcpy(kind, linkinfo[IFLA_INFO_KIND], <span class="keyword">sizeof</span>(kind));</span><br><span class="line">ops = rtnl_link_ops_get(kind);</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">kind[<span class="number">0</span>] = <span class="string">&#x27;\0&#x27;</span>;</span><br><span class="line">ops = <span class="literal">NULL</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">data = <span class="literal">NULL</span>;</span><br><span class="line"><span class="keyword">if</span> (ops) &#123;</span><br><span class="line"><span class="keyword">if</span> (ops-&gt;maxtype &gt; RTNL_MAX_TYPE)</span><br><span class="line"><span class="keyword">return</span> -EINVAL;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (ops-&gt;maxtype &amp;&amp; linkinfo[IFLA_INFO_DATA]) &#123;</span><br><span class="line">err = nla_parse_nested_deprecated(attr, ops-&gt;maxtype,</span><br><span class="line">  linkinfo[IFLA_INFO_DATA],</span><br><span class="line">  ops-&gt;policy, extack);</span><br><span class="line"><span class="keyword">if</span> (err &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">return</span> err;</span><br><span class="line">data = attr;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">if</span> (ops-&gt;validate) &#123;</span><br><span class="line">err = ops-&gt;validate(tb, data, extack);</span><br><span class="line"><span class="keyword">if</span> (err &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">return</span> err;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">slave_data = <span class="literal">NULL</span>;</span><br><span class="line"><span class="keyword">if</span> (m_ops) &#123;</span><br><span class="line"><span class="keyword">if</span> (m_ops-&gt;slave_maxtype &gt; RTNL_SLAVE_MAX_TYPE)</span><br><span class="line"><span class="keyword">return</span> -EINVAL;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (m_ops-&gt;slave_maxtype &amp;&amp;</span><br><span class="line">    linkinfo[IFLA_INFO_SLAVE_DATA]) &#123;</span><br><span class="line">err = nla_parse_nested_deprecated(slave_attr,</span><br><span class="line">  m_ops-&gt;slave_maxtype,</span><br><span class="line">  linkinfo[IFLA_INFO_SLAVE_DATA],</span><br><span class="line">  m_ops-&gt;slave_policy,</span><br><span class="line">  extack);</span><br><span class="line"><span class="keyword">if</span> (err &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">return</span> err;</span><br><span class="line">slave_data = slave_attr;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!(nlh-&gt;nlmsg_flags &amp; NLM_F_CREATE)) &#123;</span><br><span class="line"><span class="keyword">if</span> (ifm-&gt;ifi_index == <span class="number">0</span> &amp;&amp; tb[IFLA_GROUP])</span><br><span class="line"><span class="keyword">return</span> rtnl_group_changelink(skb, net,</span><br><span class="line">nla_get_u32(tb[IFLA_GROUP]),</span><br><span class="line">ifm, extack, tb);</span><br><span class="line"><span class="keyword">return</span> -ENODEV;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!ops) &#123;</span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_MODULES</span></span><br><span class="line"><span class="keyword">if</span> (kind[<span class="number">0</span>]) &#123;</span><br><span class="line">__rtnl_unlock();</span><br><span class="line">request_module(<span class="string">&quot;rtnl-link-%s&quot;</span>, kind);</span><br><span class="line">rtnl_lock();</span><br><span class="line">ops = rtnl_link_ops_get(kind);</span><br><span class="line"><span class="keyword">if</span> (ops)</span><br><span class="line"><span class="keyword">goto</span> replay;</span><br><span class="line">&#125;</span><br><span class="line"><span class="meta">#<span class="keyword">endif</span></span></span><br><span class="line">NL_SET_ERR_MSG(extack, <span class="string">&quot;Unknown device type&quot;</span>);</span><br><span class="line"><span class="keyword">return</span> -EOPNOTSUPP;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!ops-&gt;setup)</span><br><span class="line"><span class="keyword">return</span> -EOPNOTSUPP;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!ifname[<span class="number">0</span>]) &#123;</span><br><span class="line"><span class="built_in">snprintf</span>(ifname, IFNAMSIZ, <span class="string">&quot;%s%%d&quot;</span>, ops-&gt;kind);</span><br><span class="line">name_assign_type = NET_NAME_ENUM;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">dest_net = rtnl_link_get_net_capable(skb, net, tb, CAP_NET_ADMIN);</span><br><span class="line"><span class="keyword">if</span> (IS_ERR(dest_net))</span><br><span class="line"><span class="keyword">return</span> PTR_ERR(dest_net);</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line">dev = rtnl_create_link(link_net ? : dest_net, ifname,</span><br><span class="line">       name_assign_type, ops, tb, extack);</span><br><span class="line"><span class="keyword">if</span> (IS_ERR(dev)) &#123;</span><br><span class="line">err = PTR_ERR(dev);</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">dev-&gt;ifindex = ifm-&gt;ifi_index;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (ops-&gt;newlink) &#123;</span><br><span class="line">err = ops-&gt;newlink(link_net ? : net, dev, tb, data, extack);</span><br><span class="line"><span class="comment">/* Drivers should call free_netdev() in -&gt;destructor</span></span><br><span class="line"><span class="comment"> * and unregister it on failure after registration</span></span><br><span class="line"><span class="comment"> * so that device could be finally freed in rtnl_unlock.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (err &lt; <span class="number">0</span>) &#123;</span><br><span class="line"><span class="comment">/* If device is not registered at all, free it now */</span></span><br><span class="line"><span class="keyword">if</span> (dev-&gt;reg_state == NETREG_UNINITIALIZED ||</span><br><span class="line">    dev-&gt;reg_state == NETREG_UNREGISTERED)</span><br><span class="line">free_netdev(dev);</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">err = register_netdevice(dev);</span><br><span class="line"><span class="keyword">if</span> (err &lt; <span class="number">0</span>) &#123;</span><br><span class="line">free_netdev(dev);</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">err = rtnl_configure_link(dev, ifm);</span><br><span class="line"><span class="keyword">if</span> (err &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">goto</span> out_unregister;</span><br><span class="line"><span class="keyword">if</span> (link_net) &#123;</span><br><span class="line">err = dev_change_net_namespace(dev, dest_net, ifname);</span><br><span class="line"><span class="keyword">if</span> (err &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">goto</span> out_unregister;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">...</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>结构体<code>struct rtnl_link_ops</code>是一个用于配置网卡的接口，内核的其他模块在初始化时会通过<code>rtnl_link_register</code>注册一个对象，这样用户就可以通过统一的<code>netlink</code>来实现各种类型的网口的配置了。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">rtnl_link_ops</span> &#123;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">list_head</span><span class="title">list</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="type">const</span> <span class="type">char</span>*kind;</span><br><span class="line"></span><br><span class="line"><span class="type">size_t</span>priv_size;</span><br><span class="line"><span class="type">void</span>(*setup)(<span class="keyword">struct</span> net_device *dev);</span><br><span class="line"></span><br><span class="line"><span class="type">unsigned</span> <span class="type">int</span>maxtype;</span><br><span class="line"><span class="type">const</span> <span class="class"><span class="keyword">struct</span> <span class="title">nla_policy</span>*<span class="title">policy</span>;</span></span><br><span class="line"><span class="type">int</span>(*validate)(<span class="keyword">struct</span> nlattr *tb[],</span><br><span class="line">    <span class="keyword">struct</span> nlattr *data[],</span><br><span class="line">    <span class="keyword">struct</span> netlink_ext_ack *extack);</span><br><span class="line"></span><br><span class="line"><span class="type">int</span>(*newlink)(<span class="keyword">struct</span> net *src_net,</span><br><span class="line">   <span class="keyword">struct</span> net_device *dev,</span><br><span class="line">   <span class="keyword">struct</span> nlattr *tb[],</span><br><span class="line">   <span class="keyword">struct</span> nlattr *data[],</span><br><span class="line">   <span class="keyword">struct</span> netlink_ext_ack *extack);</span><br><span class="line"><span class="type">int</span>(*changelink)(<span class="keyword">struct</span> net_device *dev,</span><br><span class="line">      <span class="keyword">struct</span> nlattr *tb[],</span><br><span class="line">      <span class="keyword">struct</span> nlattr *data[],</span><br><span class="line">      <span class="keyword">struct</span> netlink_ext_ack *extack);</span><br><span class="line"><span class="type">void</span>(*dellink)(<span class="keyword">struct</span> net_device *dev,</span><br><span class="line">   <span class="keyword">struct</span> list_head *head);</span><br><span class="line"></span><br><span class="line"><span class="type">size_t</span>(*get_size)(<span class="type">const</span> <span class="keyword">struct</span> net_device *dev);</span><br><span class="line"><span class="type">int</span>(*fill_info)(<span class="keyword">struct</span> sk_buff *skb,</span><br><span class="line">     <span class="type">const</span> <span class="keyword">struct</span> net_device *dev);</span><br><span class="line"></span><br><span class="line"><span class="type">size_t</span>(*get_xstats_size)(<span class="type">const</span> <span class="keyword">struct</span> net_device *dev);</span><br><span class="line"><span class="type">int</span>(*fill_xstats)(<span class="keyword">struct</span> sk_buff *skb,</span><br><span class="line">       <span class="type">const</span> <span class="keyword">struct</span> net_device *dev);</span><br><span class="line"><span class="type">unsigned</span> <span class="title function_">int</span><span class="params">(*get_num_tx_queues)</span><span class="params">(<span class="type">void</span>)</span>;</span><br><span class="line"><span class="type">unsigned</span> <span class="title function_">int</span><span class="params">(*get_num_rx_queues)</span><span class="params">(<span class="type">void</span>)</span>;</span><br><span class="line"></span><br><span class="line"><span class="type">unsigned</span> <span class="type">int</span>slave_maxtype;</span><br><span class="line"><span class="type">const</span> <span class="class"><span class="keyword">struct</span> <span class="title">nla_policy</span>*<span class="title">slave_policy</span>;</span></span><br><span class="line"><span class="type">int</span>(*slave_changelink)(<span class="keyword">struct</span> net_device *dev,</span><br><span class="line">    <span class="keyword">struct</span> net_device *slave_dev,</span><br><span class="line">    <span class="keyword">struct</span> nlattr *tb[],</span><br><span class="line">    <span class="keyword">struct</span> nlattr *data[],</span><br><span class="line">    <span class="keyword">struct</span> netlink_ext_ack *extack);</span><br><span class="line"><span class="type">size_t</span>(*get_slave_size)(<span class="type">const</span> <span class="keyword">struct</span> net_device *dev,</span><br><span class="line">  <span class="type">const</span> <span class="keyword">struct</span> net_device *slave_dev);</span><br><span class="line"><span class="type">int</span>(*fill_slave_info)(<span class="keyword">struct</span> sk_buff *skb,</span><br><span class="line">   <span class="type">const</span> <span class="keyword">struct</span> net_device *dev,</span><br><span class="line">   <span class="type">const</span> <span class="keyword">struct</span> net_device *slave_dev);</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">net</span>*(*<span class="title">get_link_net</span>)(<span class="title">const</span> <span class="keyword">struct</span> <span class="title">net_device</span> *<span class="title">dev</span>);</span></span><br><span class="line"><span class="type">size_t</span>(*get_linkxstats_size)(<span class="type">const</span> <span class="keyword">struct</span> net_device *dev,</span><br><span class="line">       <span class="type">int</span> attr);</span><br><span class="line"><span class="type">int</span>(*fill_linkxstats)(<span class="keyword">struct</span> sk_buff *skb,</span><br><span class="line">   <span class="type">const</span> <span class="keyword">struct</span> net_device *dev,</span><br><span class="line">   <span class="type">int</span> *prividx, <span class="type">int</span> attr);</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>VLAN</code>的内核模块(<code>net/8201q/vlan_netlink.c</code>)在初始化时，会注册对应的<code>struct rtnl_link_ops vlan_link_ops</code>对象：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">rtnl_link_ops</span> <span class="title">vlan_link_ops</span> __<span class="title">read_mostly</span> =</span> &#123;</span><br><span class="line">.kind= <span class="string">&quot;vlan&quot;</span>,</span><br><span class="line">.maxtype= IFLA_VLAN_MAX,</span><br><span class="line">.policy= vlan_policy,</span><br><span class="line">.priv_size= <span class="keyword">sizeof</span>(<span class="keyword">struct</span> vlan_dev_priv),</span><br><span class="line">.setup= vlan_setup,</span><br><span class="line">.validate= vlan_validate,</span><br><span class="line">.newlink= vlan_newlink,</span><br><span class="line">.changelink= vlan_changelink,</span><br><span class="line">.dellink= unregister_vlan_dev,</span><br><span class="line">.get_size= vlan_get_size,</span><br><span class="line">.fill_info= vlan_fill_info,</span><br><span class="line">.get_link_net= vlan_get_link_net,</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="type">int</span> __init <span class="title function_">vlan_netlink_init</span><span class="params">(<span class="type">void</span>)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="keyword">return</span> rtnl_link_register(&amp;vlan_link_ops);</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>也就是说，创建<code>VLAN</code>网口的实际就是调用<code>vlan_setup</code>与<code>vlan_newlink</code>等函数进行网口的设置与注册，这样对系统来说，<code>VLAN</code>网口跟其他的物理网卡没有本质上的区别，只要路由到这里的数据都会添加上<code>VLAN</code>的标签，然后通过真正的物理网卡发送出去；而从其他节点发送过来同一<code>VLAN</code>标签的数据都会发送给该网口进行处理，如果对内核的具体实现感兴趣可以参考源代码目录<code>net/8201q</code>。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> <span class="title function_">vlan_newlink</span><span class="params">(<span class="keyword">struct</span> net *src_net, <span class="keyword">struct</span> net_device *dev,</span></span><br><span class="line"><span class="params"><span class="keyword">struct</span> nlattr *tb[], <span class="keyword">struct</span> nlattr *data[],</span></span><br><span class="line"><span class="params"><span class="keyword">struct</span> netlink_ext_ack *extack)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">vlan_dev_priv</span> *<span class="title">vlan</span> =</span> vlan_dev_priv(dev);</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">net_device</span> *<span class="title">real_dev</span>;</span></span><br><span class="line"><span class="type">unsigned</span> <span class="type">int</span> max_mtu;</span><br><span class="line">__be16 proto;</span><br><span class="line"><span class="type">int</span> err;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!data[IFLA_VLAN_ID]) &#123;</span><br><span class="line">NL_SET_ERR_MSG_MOD(extack, <span class="string">&quot;VLAN id not specified&quot;</span>);</span><br><span class="line"><span class="keyword">return</span> -EINVAL;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!tb[IFLA_LINK]) &#123;</span><br><span class="line">NL_SET_ERR_MSG_MOD(extack, <span class="string">&quot;link not specified&quot;</span>);</span><br><span class="line"><span class="keyword">return</span> -EINVAL;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">real_dev = __dev_get_by_index(src_net, nla_get_u32(tb[IFLA_LINK]));</span><br><span class="line"><span class="keyword">if</span> (!real_dev) &#123;</span><br><span class="line">NL_SET_ERR_MSG_MOD(extack, <span class="string">&quot;link does not exist&quot;</span>);</span><br><span class="line"><span class="keyword">return</span> -ENODEV;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (data[IFLA_VLAN_PROTOCOL])</span><br><span class="line">proto = nla_get_be16(data[IFLA_VLAN_PROTOCOL]);</span><br><span class="line"><span class="keyword">else</span></span><br><span class="line">proto = htons(ETH_P_8021Q);</span><br><span class="line"></span><br><span class="line">vlan-&gt;vlan_proto = proto;</span><br><span class="line">vlan-&gt;vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);</span><br><span class="line">vlan-&gt;real_dev = real_dev;</span><br><span class="line">dev-&gt;priv_flags |= (real_dev-&gt;priv_flags &amp; IFF_XMIT_DST_RELEASE);</span><br><span class="line">vlan-&gt;flags = VLAN_FLAG_REORDER_HDR;</span><br><span class="line"></span><br><span class="line">err = vlan_check_real_dev(real_dev, vlan-&gt;vlan_proto, vlan-&gt;vlan_id,</span><br><span class="line">  extack);</span><br><span class="line"><span class="keyword">if</span> (err &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">return</span> err;</span><br><span class="line"></span><br><span class="line">max_mtu = netif_reduces_vlan_mtu(real_dev) ? real_dev-&gt;mtu - VLAN_HLEN :</span><br><span class="line">     real_dev-&gt;mtu;</span><br><span class="line"><span class="keyword">if</span> (!tb[IFLA_MTU])</span><br><span class="line">dev-&gt;mtu = max_mtu;</span><br><span class="line"><span class="keyword">else</span> <span class="keyword">if</span> (dev-&gt;mtu &gt; max_mtu)</span><br><span class="line"><span class="keyword">return</span> -EINVAL;</span><br><span class="line"></span><br><span class="line">err = vlan_changelink(dev, tb, data, extack);</span><br><span class="line"><span class="keyword">if</span> (!err)</span><br><span class="line">err = register_vlan_dev(dev, extack);</span><br><span class="line"><span class="keyword">if</span> (err)</span><br><span class="line">vlan_dev_uninit(dev);</span><br><span class="line"><span class="keyword">return</span> err;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>最后，我们来看一看网络数据从物理设备过来之后，<code>VLAN</code>网口是如何进行处理的？网络数据从物理网卡过来之后，通过内核的软中断线程进行统一处理，最终会调用核心函数<code>__netif_receive_skb_core</code>对网络数据报文进行处理（了解网络数据接收流程，可以参考<a href="https://sniffer.site/2020/05/12/%E4%BB%8Enapi%E8%AF%B4%E4%B8%80%E8%AF%B4linux%E5%86%85%E6%A0%B8%E6%95%B0%E6%8D%AE%E7%9A%84%E6%8E%A5%E6%94%B6%E6%B5%81%E7%A8%8B/">从NAPI说一说Linux内核数据的接收流程</a>）；在这个函数中，会检查以太网帧的协议类型是否为<code>VLAN</code>，如果带有<code>VLAN</code>标签，则首先会解析该标签，然后通过<code>vlan_do_receive</code>进行处理:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> __netif_receive_skb_core(<span class="keyword">struct</span> sk_buff **pskb, <span class="type">bool</span> pfmemalloc,</span><br><span class="line">    <span class="keyword">struct</span> packet_type **ppt_prev)</span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">packet_type</span> *<span class="title">ptype</span>, *<span class="title">pt_prev</span>;</span></span><br><span class="line"><span class="type">rx_handler_func_t</span> *rx_handler;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">sk_buff</span> *<span class="title">skb</span> =</span> *pskb;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">net_device</span> *<span class="title">orig_dev</span>;</span></span><br><span class="line"><span class="type">bool</span> deliver_exact = <span class="literal">false</span>;</span><br><span class="line"><span class="type">int</span> ret = NET_RX_DROP;</span><br><span class="line">__be16 type;</span><br><span class="line"></span><br><span class="line">net_timestamp_check(!netdev_tstamp_prequeue, skb);</span><br><span class="line"></span><br><span class="line">trace_netif_receive_skb(skb);</span><br><span class="line"></span><br><span class="line">orig_dev = skb-&gt;dev;</span><br><span class="line"></span><br><span class="line">skb_reset_network_header(skb);</span><br><span class="line"><span class="keyword">if</span> (!skb_transport_header_was_set(skb))</span><br><span class="line">skb_reset_transport_header(skb);</span><br><span class="line">skb_reset_mac_len(skb);</span><br><span class="line"></span><br><span class="line">pt_prev = <span class="literal">NULL</span>;</span><br><span class="line"></span><br><span class="line">another_round:</span><br><span class="line">skb-&gt;skb_iif = skb-&gt;dev-&gt;ifindex;</span><br><span class="line"></span><br><span class="line">__this_cpu_inc(softnet_data.processed);</span><br><span class="line"></span><br><span class="line">list_for_each_entry_rcu(ptype, &amp;ptype_all, <span class="built_in">list</span>) &#123;</span><br><span class="line"><span class="keyword">if</span> (pt_prev)</span><br><span class="line">ret = deliver_skb(skb, pt_prev, orig_dev);</span><br><span class="line">pt_prev = ptype;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">list_for_each_entry_rcu(ptype, &amp;skb-&gt;dev-&gt;ptype_all, <span class="built_in">list</span>) &#123;</span><br><span class="line"><span class="keyword">if</span> (pt_prev)</span><br><span class="line">ret = deliver_skb(skb, pt_prev, orig_dev);</span><br><span class="line">pt_prev = ptype;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// VLAN相关的报文</span></span><br><span class="line"><span class="keyword">if</span> (skb-&gt;protocol == cpu_to_be16(ETH_P_8021Q) ||</span><br><span class="line">    skb-&gt;protocol == cpu_to_be16(ETH_P_8021AD)) &#123;</span><br><span class="line"><span class="comment">//解析VLAN标签</span></span><br><span class="line">skb = skb_vlan_untag(skb);</span><br><span class="line"><span class="keyword">if</span> (unlikely(!skb))</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="comment">// 有VLAN标签数据，通过VLAN网卡进行处理</span></span><br><span class="line"><span class="keyword">if</span> (skb_vlan_tag_present(skb)) &#123;</span><br><span class="line"><span class="keyword">if</span> (pt_prev) &#123;</span><br><span class="line">ret = deliver_skb(skb, pt_prev, orig_dev);</span><br><span class="line">pt_prev = <span class="literal">NULL</span>;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">if</span> (vlan_do_receive(&amp;skb))</span><br><span class="line"><span class="keyword">goto</span> another_round;</span><br><span class="line"><span class="keyword">else</span> <span class="keyword">if</span> (unlikely(!skb))</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">rx_handler = rcu_dereference(skb-&gt;dev-&gt;rx_handler);</span><br><span class="line"><span class="keyword">if</span> (rx_handler) &#123;</span><br><span class="line"><span class="keyword">if</span> (pt_prev) &#123;</span><br><span class="line">ret = deliver_skb(skb, pt_prev, orig_dev);</span><br><span class="line">pt_prev = <span class="literal">NULL</span>;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">switch</span> (rx_handler(&amp;skb)) &#123;</span><br><span class="line"><span class="keyword">case</span> RX_HANDLER_CONSUMED:</span><br><span class="line">ret = NET_RX_SUCCESS;</span><br><span class="line"><span class="keyword">goto</span> out;</span><br><span class="line"><span class="keyword">case</span> RX_HANDLER_ANOTHER:</span><br><span class="line"><span class="keyword">goto</span> another_round;</span><br><span class="line"><span class="keyword">case</span> RX_HANDLER_EXACT:</span><br><span class="line">deliver_exact = <span class="literal">true</span>;</span><br><span class="line"><span class="keyword">case</span> RX_HANDLER_PASS:</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">default</span>:</span><br><span class="line">BUG();</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line">out:</span><br><span class="line"><span class="comment">/* The invariant here is that if *ppt_prev is not NULL</span></span><br><span class="line"><span class="comment"> * then skb should also be non-NULL.</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * Apparently *ppt_prev assignment above holds this invariant due to</span></span><br><span class="line"><span class="comment"> * skb dereferencing near it.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">*pskb = skb;</span><br><span class="line"><span class="keyword">return</span> ret;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>函数<code>vlan_do_receive</code>(<code>net/8021q/vlan_core.c</code>)首先通过协议类型与<code>VLAN ID</code>找到对应的<code>VLAN</code>网口，然后将其设置到对应的<code>struct sk_buff</code>设备上；接着会调用<code>__vlan_hwaccel_clear_tag</code>清除<code>VLAN</code>的标签，并更新网卡设备的网络数据统计：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">bool</span> <span class="title function_">vlan_do_receive</span><span class="params">(<span class="keyword">struct</span> sk_buff **skbp)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">sk_buff</span> *<span class="title">skb</span> =</span> *skbp;</span><br><span class="line">__be16 vlan_proto = skb-&gt;vlan_proto;</span><br><span class="line">u16 vlan_id = skb_vlan_tag_get_id(skb);</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">net_device</span> *<span class="title">vlan_dev</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">vlan_pcpu_stats</span> *<span class="title">rx_stats</span>;</span></span><br><span class="line"></span><br><span class="line">vlan_dev = vlan_find_dev(skb-&gt;dev, vlan_proto, vlan_id);</span><br><span class="line"><span class="keyword">if</span> (!vlan_dev)</span><br><span class="line"><span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line"></span><br><span class="line">skb = *skbp = skb_share_check(skb, GFP_ATOMIC);</span><br><span class="line"><span class="keyword">if</span> (unlikely(!skb))</span><br><span class="line"><span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (unlikely(!(vlan_dev-&gt;flags &amp; IFF_UP))) &#123;</span><br><span class="line">kfree_skb(skb);</span><br><span class="line">*skbp = <span class="literal">NULL</span>;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">skb-&gt;dev = vlan_dev;</span><br><span class="line"><span class="keyword">if</span> (unlikely(skb-&gt;pkt_type == PACKET_OTHERHOST)) &#123;</span><br><span class="line"><span class="comment">/* Our lower layer thinks this is not local, let&#x27;s make sure.</span></span><br><span class="line"><span class="comment"> * This allows the VLAN to have a different MAC than the</span></span><br><span class="line"><span class="comment"> * underlying device, and still route correctly. */</span></span><br><span class="line"><span class="keyword">if</span> (ether_addr_equal_64bits(eth_hdr(skb)-&gt;h_dest, vlan_dev-&gt;dev_addr))</span><br><span class="line">skb-&gt;pkt_type = PACKET_HOST;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!(vlan_dev_priv(vlan_dev)-&gt;flags &amp; VLAN_FLAG_REORDER_HDR) &amp;&amp;</span><br><span class="line">    !netif_is_macvlan_port(vlan_dev) &amp;&amp;</span><br><span class="line">    !netif_is_bridge_port(vlan_dev)) &#123;</span><br><span class="line"><span class="type">unsigned</span> <span class="type">int</span> offset = skb-&gt;data - skb_mac_header(skb);</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * vlan_insert_tag expect skb-&gt;data pointing to mac header.</span></span><br><span class="line"><span class="comment"> * So change skb-&gt;data before calling it and change back to</span></span><br><span class="line"><span class="comment"> * original position later</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">skb_push(skb, offset);</span><br><span class="line">skb = *skbp = vlan_insert_inner_tag(skb, skb-&gt;vlan_proto,</span><br><span class="line">    skb-&gt;vlan_tci, skb-&gt;mac_len);</span><br><span class="line"><span class="keyword">if</span> (!skb)</span><br><span class="line"><span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">skb_pull(skb, offset + VLAN_HLEN);</span><br><span class="line">skb_reset_mac_len(skb);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">skb-&gt;priority = vlan_get_ingress_priority(vlan_dev, skb-&gt;vlan_tci);</span><br><span class="line">__vlan_hwaccel_clear_tag(skb);</span><br><span class="line"></span><br><span class="line">rx_stats = this_cpu_ptr(vlan_dev_priv(vlan_dev)-&gt;vlan_pcpu_stats);</span><br><span class="line"></span><br><span class="line">u64_stats_update_begin(&amp;rx_stats-&gt;syncp);</span><br><span class="line">rx_stats-&gt;rx_packets++;</span><br><span class="line">rx_stats-&gt;rx_bytes += skb-&gt;len;</span><br><span class="line"><span class="keyword">if</span> (skb-&gt;pkt_type == PACKET_MULTICAST)</span><br><span class="line">rx_stats-&gt;rx_multicast++;</span><br><span class="line">u64_stats_update_end(&amp;rx_stats-&gt;syncp);</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a><strong>总结</strong></h2><p>虚拟局域网<code>VLAN</code>可以不用更改现有的物理网络拓扑结构，实现网络的虚拟分割，从而方便的实现网络流量的隔离；<code>VLAN</code>可以用来划分公司的内部网络，也可以用来对局域网的流量进行优先级控制。但<code>VLAN</code>最多只能有<code>4096</code>个划分，比较适合小型的局域网络，为了解决<code>VLAN</code>的这一问题，后来又出现了<a href="https://info.support.huawei.com/info-finder/encyclopedia/zh/VXLAN.html"><code>VXLAN(Virtual Extensible LAN)</code></a>可以支持更大型的网络扩展与分割。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a><strong>参考资料</strong></h2><ul><li><a href="https://dev.jmgilman.com/networking/concepts/switching/vlan/">https://dev.jmgilman.com/networking/concepts/switching/vlan/</a></li><li><a href="https://wiki.archlinux.org/title/VLAN">https://wiki.archlinux.org/title/VLAN</a></li><li><a href="https://www.redhat.com/en/blog/vlans-configuration">https://www.redhat.com/en/blog/vlans-configuration</a></li><li><a href="https://info.support.huawei.com/info-finder/encyclopedia/zh/VLAN.html">https://info.support.huawei.com/info-finder/encyclopedia/zh/VLAN.html</a></li><li><a href="https://support.huawei.com/enterprise/en/doc/EDOC1100174721/75653f56/vlan-frame-format">https://support.huawei.com/enterprise/en/doc/EDOC1100174721/75653f56/vlan-frame-format</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;&lt;code&gt;VLAN(Virtual Local-Area-Network)&lt;/code&gt;虚拟局域网，用于将一个物理局域网(&lt;code&gt;LAN&lt;/code&gt;)在逻辑上分割为多个独立虚拟的广播域；每个&lt;code&gt;VLAN&lt;/code&gt;都对应一个广播域，可以直接通讯，而不同&lt;code&gt;VLAN&lt;/code&gt;的主机则无法直接互通，这样广播报文就限定在一个固定的&lt;code&gt;VLAN&lt;/code&gt;内。&lt;code&gt;VLAN&lt;/code&gt;工作在网络协议栈的数据链路层(&lt;code&gt;L2&lt;/code&gt;)，通过在网络数据报文中增加一个额外的&lt;code&gt;VLAN&lt;/code&gt;标签，从而让同一个物理局域网的流量可以像多个物理局域网一样分隔开来；另外，我们也可以利用&lt;code&gt;VLAN&lt;/code&gt;中的优先级标签来保证局域网中的高优先级流量可以更低延迟的进行传输，从而提升整个网络的传输质量。这篇文章，主要从两个方面介绍下&lt;code&gt;VLAN&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;首先介绍下如何创建、配置&lt;code&gt;VLAN&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;其次基于数据报文分析下&lt;code&gt;VLAN&lt;/code&gt;是如何在Linux内核中实现的&lt;/li&gt;
&lt;/ul&gt;</summary>
    
    
    
    <category term="网络协议" scheme="https://sniffer.site/categories/%E7%BD%91%E7%BB%9C%E5%8D%8F%E8%AE%AE/"/>
    
    
    <category term="VLAN" scheme="https://sniffer.site/tags/VLAN/"/>
    
  </entry>
  
  <entry>
    <title>一个CPU steal-time高的问题</title>
    <link href="https://sniffer.site/2025/05/07/%E4%B8%80%E4%B8%AACPU-steal-time%E9%AB%98%E7%9A%84%E9%97%AE%E9%A2%98/"/>
    <id>https://sniffer.site/2025/05/07/%E4%B8%80%E4%B8%AACPU-steal-time%E9%AB%98%E7%9A%84%E9%97%AE%E9%A2%98/</id>
    <published>2025-05-07T03:51:49.000Z</published>
    <updated>2025-05-20T05:46:51.600Z</updated>
    
    <content type="html"><![CDATA[<p>前两天碰到一个基于<code>QNX</code>的虚拟化平台上的项目问题，同事反馈系统很卡，点击页面明显有延迟，卡顿严重。用<code>top</code>看了下<code>Android</code>系统的负载，还有<code>20%</code>左右的空闲，其他的如用户态、内核态以及中断的占用都比较正常，唯独有一个<code>%host</code>的占用特别高，最高能占到<code>60%</code>以上。这个<code>host</code>的占用是什么意思了？这篇文章，我们就基于这个问题，来详细阐述分析下虚拟化平台中<code>host</code>占用高的问题以及在虚拟化平台<code>KVM</code>是如何计算<code>host</code>占用的。  <span id="more"></span></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">800%cpu  45%user   4%<span class="built_in">nice</span>  126%sys 141%idle   3%iow  41%irq   16%sirq   425%host</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="问题定位与排查"><a href="#问题定位与排查" class="headerlink" title="问题定位与排查"></a><strong>问题定位与排查</strong></h2><p>通过<code>adb shell</code>进到设备里<code>top</code>看下系统整体状态，可以发现系统内存还有不少空间，CPU的空闲只有不到<code>30%</code>，占大头的就是<code>%host</code>这一部分。</p><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/android-steal-time-high-0.png" alt="top-steal-time-hight"></p><p>先尝试用<code>DeepSeek</code>问了下，<code>top</code>指令中的<code>host</code>占用到底是什么意思？ <code>DeepSeek</code>很快给出了答案:</p><blockquote><p>在Linux的top命令中，”host的CPU占用”（通常对应%host或st字段）是虚拟化环境特有的性能指标，用于反映​​虚拟机被宿主机（Hypervisor）抢占的CPU时间百分比</p></blockquote><p>就是说，<code>host</code>过高，可能是由于虚拟化平台中宿主机有异常，比如处于高负载，CPU配置不合理（给guest系统的资源太少），导致了客户机<code>Android</code>系统一直无法抢占到CPU，处于挂起等待的状态。那么，究竟如何排查这类问题了？接着问下<code>DeepSeek</code>： 如何排查steal CPU占用过高的问题，<code>DeepSeek</code>给出了一些可能的解释：</p><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/deep-seek-steal-time-explanation.png" alt="deep-seek CPU steal-time reason"></p><p>顺着这个思路，进入设备继续查看，通过<code>mpstat -P ALL</code>查看整体的负载，观察到类似的情况, <code>%steal</code>这一栏显示，<code>Android</code>相当一部分负载都来自于等待宿主机上；这里多个核的<code>%steal</code>占比加起来就对应<code>top</code>中的<code>%host</code>：</p><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/android-steal-time-high-1.png" alt="cpu steal-time high of mpstat"></p><p>基于这些数据，我们推测很可能宿主机<code>QNX</code>侧有问题（之前的版本没有异常），拉着开发人员对齐了下，确认了虚拟机上的<code>CPU</code>分配没有太大问题（系统只有一个客户机<code>Android</code>，可以访问所有的物理CPU），那么基本可以排除资源不足引起的问题；是不是最近有什么修改引入了这个问题？</p><p>开发反馈只有两个修改点： 一个是更新了车控相关的信号，一个是在<code>QNX</code>上新增了一个<code>PPS</code>节点。更新车控信号矩阵不应该对系统负载有太大的影响，而且从已有的数据看，<code>Android</code>的各个进程的负载并不高，况且在台架上也没有太多的车控信号需要传输，因此可以排除。那么，很可能是新增<code>PPS</code>节点导致了<code>QNX</code>侧的负载过高，从而引起<code>Android</code>侧无法拿到CPU。最后，开发排查了相关的进程，发现是某个<code>PPS</code>节点的配置异常，有一个进程在高频的写数据导致<code>QNX</code>侧的负载太高（<code>QNX</code>的<code>idle</code>已经接近0），因而客户机<code>Android</code>无法拿到足够的CPU资源。</p><blockquote><p>PPS(Persistent Publish-Subscribe)是QNX用于跨进程通讯的一种协议</p></blockquote><p>问题到这里也算告一段落。但是，为了对这类问题有更多的了解，后续碰到相似问题时时能够快速的定位分析，还是决定要深入代码层面来了解下虚拟化平台中<code>steal time</code>究竟是怎么来计算的。</p><h2 id="虚拟化中的CPU-steal-time"><a href="#虚拟化中的CPU-steal-time" class="headerlink" title="虚拟化中的CPU steal-time"></a><strong>虚拟化中的CPU steal-time</strong></h2><p>既然是‘偷来的时间’（<code>steal-time</code>），那么就说明不是客户机自己执行指令导致的CPU占用，而是等待宿主机分配资源所消耗的时间，此时宿主机可能是在处理其他客户机的请求，也有可能是在忙着处理内部的事务。</p><blockquote><p>Steal time is the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor (or host itself).</p></blockquote><p>无论是<code>top</code>指令中展示的<code>%host</code>，还是<code>mpstat</code>中的<code>%steal</code>都是通过内核的<code>/proc/stat</code>获取到的CPU占用数据。查看<code>Android</code>的源码<code>external/toybox/toys/posix/ps.c</code>，CPU的<code>steal-time</code>就是<code>/proc/stat</code>的最后一列数据。我们继续查看下内核的代码。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">void</span> <span class="title function_">top_common</span><span class="params">(</span></span><br><span class="line"><span class="params">  <span class="type">int</span> (*filter)(<span class="type">long</span> <span class="type">long</span> *oslot, <span class="type">long</span> <span class="type">long</span> *nslot, <span class="type">int</span> milis))</span></span><br><span class="line">&#123;</span><br><span class="line">  <span class="type">long</span> <span class="type">long</span> timeout = <span class="number">0</span>, now, stats[<span class="number">16</span>];</span><br><span class="line">  <span class="class"><span class="keyword">struct</span> <span class="title">proclist</span> &#123;</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">procpid</span> **<span class="title">tb</span>;</span></span><br><span class="line">    <span class="type">int</span> count;</span><br><span class="line">    <span class="type">long</span> <span class="type">long</span> whence;</span><br><span class="line">  &#125; plist[<span class="number">2</span>], *plold, *plnew, old, new, mix;</span><br><span class="line">  <span class="type">char</span> scratch[<span class="number">16</span>], *pos, *cpufields[] = &#123;<span class="string">&quot;user&quot;</span>, <span class="string">&quot;nice&quot;</span>, <span class="string">&quot;sys&quot;</span>, <span class="string">&quot;idle&quot;</span>,</span><br><span class="line">    <span class="string">&quot;iow&quot;</span>, <span class="string">&quot;irq&quot;</span>, <span class="string">&quot;sirq&quot;</span>, <span class="string">&quot;host&quot;</span>&#125;; <span class="comment">// 通过这显示CPU占用</span></span><br><span class="line">  ....</span><br><span class="line"></span><br><span class="line">  <span class="built_in">memset</span>(plist, <span class="number">0</span>, <span class="keyword">sizeof</span>(plist));</span><br><span class="line">  <span class="built_in">memset</span>(stats, <span class="number">0</span>, <span class="keyword">sizeof</span>(stats));</span><br><span class="line"></span><br><span class="line">  <span class="keyword">do</span> &#123;</span><br><span class="line">    </span><br><span class="line">    <span class="comment">//读取CPU占用数据，最后一列就是steal-time</span></span><br><span class="line">    <span class="keyword">if</span> (readfile(<span class="string">&quot;/proc/stat&quot;</span>, pos = toybuf, <span class="keyword">sizeof</span>(toybuf))) &#123;</span><br><span class="line">      <span class="type">long</span> <span class="type">long</span> *st = stats+<span class="number">8</span>*(tock&amp;<span class="number">1</span>);</span><br><span class="line"></span><br><span class="line">      <span class="comment">// user nice system idle iowait irq softirq host</span></span><br><span class="line">      <span class="built_in">sscanf</span>(pos, <span class="string">&quot;cpu %lld %lld %lld %lld %lld %lld %lld %lld&quot;</span>,</span><br><span class="line">        st, st+<span class="number">1</span>, st+<span class="number">2</span>, st+<span class="number">3</span>, st+<span class="number">4</span>, st+<span class="number">5</span>, st+<span class="number">6</span>, st+<span class="number">7</span>);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    ...</span><br><span class="line">  &#125; <span class="keyword">while</span> (!done);</span><br><span class="line"></span><br><span class="line">  <span class="keyword">if</span> (!FLAG(b)) tty_reset();</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>内核中对应<code>/proc/stat</code>的状态显示在<code>fs/proc/stat.c</code>中实现的，可以看到<code>steal-time</code>的计算是通过一个<code>per-cpu</code>结构体变量<code>kernel_cpustat</code>中的<code>cpustat</code>数组对应的<code>CPUTIME_STEAL</code>索引获取到：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> <span class="title function_">show_stat</span><span class="params">(<span class="keyword">struct</span> seq_file *p, <span class="type">void</span> *v)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="type">int</span> i, j;</span><br><span class="line">u64 user, nice, system, idle, iowait, irq, softirq, steal;</span><br><span class="line">u64 guest, guest_nice;</span><br><span class="line">u64 sum = <span class="number">0</span>;</span><br><span class="line">u64 sum_softirq = <span class="number">0</span>;</span><br><span class="line"><span class="type">unsigned</span> <span class="type">int</span> per_softirq_sums[NR_SOFTIRQS] = &#123;<span class="number">0</span>&#125;;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">timespec64</span> <span class="title">boottime</span>;</span></span><br><span class="line"></span><br><span class="line">user = nice = system = idle = iowait =</span><br><span class="line">irq = softirq = steal = <span class="number">0</span>;</span><br><span class="line">guest = guest_nice = <span class="number">0</span>;</span><br><span class="line">getboottime64(&amp;boottime);</span><br><span class="line"></span><br><span class="line">for_each_possible_cpu(i) &#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">kernel_cpustat</span> <span class="title">kcpustat</span>;</span></span><br><span class="line">u64 *cpustat = kcpustat.cpustat;</span><br><span class="line"></span><br><span class="line">kcpustat_cpu_fetch(&amp;kcpustat, i);</span><br><span class="line"></span><br><span class="line">user+= cpustat[CPUTIME_USER];</span><br><span class="line">nice+= cpustat[CPUTIME_NICE];</span><br><span class="line">system+= cpustat[CPUTIME_SYSTEM];</span><br><span class="line">idle+= get_idle_time(&amp;kcpustat, i);</span><br><span class="line">iowait+= get_iowait_time(&amp;kcpustat, i);</span><br><span class="line">irq+= cpustat[CPUTIME_IRQ];</span><br><span class="line">softirq+= cpustat[CPUTIME_SOFTIRQ];</span><br><span class="line"><span class="comment">//获取steal-time</span></span><br><span class="line">steal+= cpustat[CPUTIME_STEAL];</span><br><span class="line">guest+= cpustat[CPUTIME_GUEST];</span><br><span class="line">guest_nice+= cpustat[CPUTIME_GUEST_NICE];</span><br><span class="line">sum+= kstat_cpu_irqs_sum(i);</span><br><span class="line">sum+= arch_irq_stat_cpu(i);</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> (j = <span class="number">0</span>; j &lt; NR_SOFTIRQS; j++) &#123;</span><br><span class="line"><span class="type">unsigned</span> <span class="type">int</span> softirq_stat = kstat_softirqs_cpu(j, i);</span><br><span class="line"></span><br><span class="line">per_softirq_sums[j] += softirq_stat;</span><br><span class="line">sum_softirq += softirq_stat;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">sum += arch_irq_stat();</span><br><span class="line"></span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot;cpu  &quot;</span>, <span class="type">nsec_to_clock_t</span>(user));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(nice));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(system));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(idle));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(iowait));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(irq));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(softirq));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(steal));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(guest));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(guest_nice));</span><br><span class="line">seq_putc(p, <span class="string">&#x27;\n&#x27;</span>);</span><br><span class="line"></span><br><span class="line">for_each_online_cpu(i) &#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">kernel_cpustat</span> <span class="title">kcpustat</span>;</span></span><br><span class="line">u64 *cpustat = kcpustat.cpustat;</span><br><span class="line"></span><br><span class="line">kcpustat_cpu_fetch(&amp;kcpustat, i);</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Copy values here to work around gcc-2.95.3, gcc-2.96 */</span></span><br><span class="line">user= cpustat[CPUTIME_USER];</span><br><span class="line">nice= cpustat[CPUTIME_NICE];</span><br><span class="line">system= cpustat[CPUTIME_SYSTEM];</span><br><span class="line">idle= get_idle_time(&amp;kcpustat, i);</span><br><span class="line">iowait= get_iowait_time(&amp;kcpustat, i);</span><br><span class="line">irq= cpustat[CPUTIME_IRQ];</span><br><span class="line">softirq= cpustat[CPUTIME_SOFTIRQ];</span><br><span class="line">steal= cpustat[CPUTIME_STEAL];</span><br><span class="line">guest= cpustat[CPUTIME_GUEST];</span><br><span class="line">guest_nice= cpustat[CPUTIME_GUEST_NICE];</span><br><span class="line">seq_printf(p, <span class="string">&quot;cpu%d&quot;</span>, i);</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(user));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(nice));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(system));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(idle));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(iowait));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(irq));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(softirq));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(steal));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(guest));</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, <span class="type">nsec_to_clock_t</span>(guest_nice));</span><br><span class="line">seq_putc(p, <span class="string">&#x27;\n&#x27;</span>);</span><br><span class="line">&#125;</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot;intr &quot;</span>, (<span class="type">unsigned</span> <span class="type">long</span> <span class="type">long</span>)sum);</span><br><span class="line"></span><br><span class="line">show_all_irqs(p);</span><br><span class="line"></span><br><span class="line">seq_printf(p,</span><br><span class="line"><span class="string">&quot;\nctxt %llu\n&quot;</span></span><br><span class="line"><span class="string">&quot;btime %llu\n&quot;</span></span><br><span class="line"><span class="string">&quot;processes %lu\n&quot;</span></span><br><span class="line"><span class="string">&quot;procs_running %lu\n&quot;</span></span><br><span class="line"><span class="string">&quot;procs_blocked %lu\n&quot;</span>,</span><br><span class="line">nr_context_switches(),</span><br><span class="line">(<span class="type">unsigned</span> <span class="type">long</span> <span class="type">long</span>)boottime.tv_sec,</span><br><span class="line">total_forks,</span><br><span class="line">nr_running(),</span><br><span class="line">nr_iowait());</span><br><span class="line"></span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot;softirq &quot;</span>, (<span class="type">unsigned</span> <span class="type">long</span> <span class="type">long</span>)sum_softirq);</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> (i = <span class="number">0</span>; i &lt; NR_SOFTIRQS; i++)</span><br><span class="line">seq_put_decimal_ull(p, <span class="string">&quot; &quot;</span>, per_softirq_sums[i]);</span><br><span class="line">seq_putc(p, <span class="string">&#x27;\n&#x27;</span>);</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>内核中统计各个维度的CPU占用数据在<code>kernel/sched/cputime.c</code>统一实现了相关的接口，只有开启了内核配置<code>CONFIG_PARAVIRT</code>的虚拟化平台中才会计算<code>steam-time</code>时间，其他的则直接返回<code>0</code>：通过函数<code>paravirt_steal_clock</code>获取对应CPU的客户机系统的<code>steal</code>时间。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * When a guest is interrupted for a longer amount of time, missed clock</span></span><br><span class="line"><span class="comment"> * ticks are not redelivered later. Due to that, this function may on</span></span><br><span class="line"><span class="comment"> * occasion account more time than the calling functions think elapsed.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="type">static</span> __always_inline u64 <span class="title function_">steal_account_process_time</span><span class="params">(u64 maxtime)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_PARAVIRT</span></span><br><span class="line"><span class="keyword">if</span> (static_key_false(&amp;paravirt_steal_enabled)) &#123;</span><br><span class="line">u64 steal;</span><br><span class="line"></span><br><span class="line">steal = paravirt_steal_clock(smp_processor_id());</span><br><span class="line">steal -= this_rq()-&gt;prev_steal_time;</span><br><span class="line">steal = min(steal, maxtime);</span><br><span class="line">account_steal_time(steal);</span><br><span class="line">this_rq()-&gt;prev_steal_time += steal;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> steal;</span><br><span class="line">&#125;</span><br><span class="line"><span class="meta">#<span class="keyword">endif</span></span></span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Account for involuntary wait time.</span></span><br><span class="line"><span class="comment"> * @cputime: the CPU time spent in involuntary wait</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="type">void</span> <span class="title function_">account_steal_time</span><span class="params">(u64 cputime)</span></span><br><span class="line">&#123;</span><br><span class="line">u64 *cpustat = kcpustat_this_cpu-&gt;cpustat;</span><br><span class="line"></span><br><span class="line">cpustat[CPUTIME_STEAL] += cputime;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>函数<code>paravirt_steal_clock</code>在<code>arch/arm64/include/asm/paravirt.h</code>中定义，最终实际是通过一个结构体<code>pv_time_ops</code>中的函数<code>steal_clock</code>调用获取客户机的<code>steal</code>时间。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_PARAVIRT</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">pv_time_ops</span> &#123;</span></span><br><span class="line"><span class="type">unsigned</span> <span class="type">long</span> <span class="title function_">long</span> <span class="params">(*steal_clock)</span><span class="params">(<span class="type">int</span> cpu)</span>;  <span class="comment">//获取steal时间的真正函数</span></span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">paravirt_patch_template</span> &#123;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">pv_time_ops</span> <span class="title">time</span>;</span></span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="keyword">extern</span> <span class="class"><span class="keyword">struct</span> <span class="title">paravirt_patch_template</span> <span class="title">pv_ops</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="keyword">inline</span> u64 <span class="title function_">paravirt_steal_clock</span><span class="params">(<span class="type">int</span> cpu)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="keyword">return</span> pv_ops.time.steal_clock(cpu);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="type">int</span> __init <span class="title function_">pv_time_init</span><span class="params">(<span class="type">void</span>)</span>;</span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">else</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">define</span> pv_time_init() do &#123;&#125; while (0)</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">endif</span> <span class="comment">// CONFIG_PARAVIRT</span></span></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>对于虚拟化平台来说，系统在初始化的时候，会通过<code>pv_time_init</code>函数对<code>pv_ops</code>进行初始化设置(<code>arch/arm64/kernel/paravirt.c</code>)：</p><ul><li><code>pv_time_init_stolen_time</code>: 初始化存放<code>steal</code>时间相关的变量内存区域（用于宿主机<code>hypervisor</code>与客户机进行数据共享）</li><li><code>pv_ops.time.steal_clock</code>: 对提供给外部获取<code>steal-time</code>的接口进行赋值</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">int</span> __init <span class="title function_">pv_time_init</span><span class="params">(<span class="type">void</span>)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="type">int</span> ret;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!has_pv_steal_clock())</span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line">ret = pv_time_init_stolen_time();</span><br><span class="line"><span class="keyword">if</span> (ret)</span><br><span class="line"><span class="keyword">return</span> ret;</span><br><span class="line"></span><br><span class="line">pv_ops.time.steal_clock = pv_steal_clock;</span><br><span class="line"></span><br><span class="line">static_key_slow_inc(&amp;paravirt_steal_enabled);</span><br><span class="line"><span class="keyword">if</span> (steal_acc)</span><br><span class="line">static_key_slow_inc(&amp;paravirt_steal_rq_enabled);</span><br><span class="line"></span><br><span class="line">pr_info(<span class="string">&quot;using stolen time PV\n&quot;</span>);</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>函数<code>pv_time_init_stolen_time</code>注册一个<code>CPU</code>热插拔的回调函数，等<code>CPU</code>状态变为<code>online</code>收到回调后，调用<code>stolen_time_cpu_online</code>函数，初始化<code>steal-time</code>相关的配置：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> __init <span class="title function_">pv_time_init_stolen_time</span><span class="params">(<span class="type">void</span>)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="type">int</span> ret;</span><br><span class="line"></span><br><span class="line">ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,</span><br><span class="line"><span class="string">&quot;hypervisor/arm/pvtime:online&quot;</span>,</span><br><span class="line">stolen_time_cpu_online,</span><br><span class="line">stolen_time_cpu_down_prepare);</span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>)</span><br><span class="line"><span class="keyword">return</span> ret;</span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>可以看到，<code>stolen_time_cpu_online</code>主要用于客户机与<code>hypervisor</code>协商一块固定的内存区域用于交换<code>steal-time</code>的时间，主要有几个步骤：</p><ul><li>首先通过一个<code>HVC</code>异常陷入指令(用于切换不同的Exception Level-EL)，让客户机从EL1(内核态)进入到EL2(虚拟机态)，获取到宿主机用于保存<code>steal-time</code>的内存地址，并通过<code>arm_smccc_res</code>返回给客户机</li><li>内核将拿到的地址映射到一块内存，这样在内核中就可以通过函数访问到这块内存区域，从而读取到<code>steam-time</code></li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> <span class="title function_">stolen_time_cpu_online</span><span class="params">(<span class="type">unsigned</span> <span class="type">int</span> cpu)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">pv_time_stolen_time_region</span> *<span class="title">reg</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">arm_smccc_res</span> <span class="title">res</span>;</span></span><br><span class="line"></span><br><span class="line">reg = this_cpu_ptr(&amp;stolen_time_region);</span><br><span class="line"></span><br><span class="line">arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_TIME_ST, &amp;res);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (res.a0 == SMCCC_RET_NOT_SUPPORTED)</span><br><span class="line"><span class="keyword">return</span> -EINVAL;</span><br><span class="line"></span><br><span class="line">reg-&gt;kaddr = memremap(res.a0,</span><br><span class="line">      <span class="keyword">sizeof</span>(<span class="keyword">struct</span> pvclock_vcpu_stolen_time),</span><br><span class="line">      MEMREMAP_WB);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!reg-&gt;kaddr) &#123;</span><br><span class="line">pr_warn(<span class="string">&quot;Failed to map stolen time data structure\n&quot;</span>);</span><br><span class="line"><span class="keyword">return</span> -ENOMEM;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (le32_to_cpu(reg-&gt;kaddr-&gt;revision) != <span class="number">0</span> ||</span><br><span class="line">    le32_to_cpu(reg-&gt;kaddr-&gt;attributes) != <span class="number">0</span>) &#123;</span><br><span class="line">pr_warn_once(<span class="string">&quot;Unexpected revision or attributes in stolen time data\n&quot;</span>);</span><br><span class="line"><span class="keyword">return</span> -ENXIO;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>这里以Linux的虚拟化方案<code>KVM（Kernel-based Virtual Machine）</code>为例来说明，在虚拟机层面如何处理<code>steal-time</code>的计算，并给到客户机系统的。在<code>KVM</code>接收到<code>HVC</code>的指令<code>ARM_SMCCC_HV_PV_TIME_ST</code>后，最终会调用对应的异常处理函数<code>kvm_hvc_call_handler</code>（<code>arch/arm64/kvm/hypercalls.c</code>）:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">int</span> <span class="title function_">kvm_hvc_call_handler</span><span class="params">(<span class="keyword">struct</span> kvm_vcpu *vcpu)</span></span><br><span class="line">&#123;</span><br><span class="line">u32 func_id = smccc_get_function(vcpu);</span><br><span class="line"><span class="type">long</span> val = SMCCC_RET_NOT_SUPPORTED;</span><br><span class="line">u32 feature;</span><br><span class="line"><span class="type">gpa_t</span> gpa;</span><br><span class="line"></span><br><span class="line"><span class="keyword">switch</span> (func_id) &#123;</span><br><span class="line"><span class="keyword">case</span> ARM_SMCCC_VERSION_FUNC_ID:</span><br><span class="line">val = ARM_SMCCC_VERSION_1_1;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> ARM_SMCCC_ARCH_FEATURES_FUNC_ID:</span><br><span class="line">feature = smccc_get_arg1(vcpu);</span><br><span class="line"><span class="keyword">switch</span> (feature) &#123;</span><br><span class="line"><span class="keyword">case</span> ARM_SMCCC_ARCH_WORKAROUND_1:</span><br><span class="line"><span class="keyword">switch</span> (arm64_get_spectre_v2_state()) &#123;</span><br><span class="line"><span class="keyword">case</span> SPECTRE_VULNERABLE:</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> SPECTRE_MITIGATED:</span><br><span class="line">val = SMCCC_RET_SUCCESS;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> SPECTRE_UNAFFECTED:</span><br><span class="line">val = SMCCC_ARCH_WORKAROUND_RET_UNAFFECTED;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> ARM_SMCCC_ARCH_WORKAROUND_2:</span><br><span class="line"><span class="keyword">switch</span> (arm64_get_spectre_v4_state()) &#123;</span><br><span class="line"><span class="keyword">case</span> SPECTRE_VULNERABLE:</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> SPECTRE_MITIGATED:</span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * SSBS everywhere: Indicate no firmware</span></span><br><span class="line"><span class="comment"> * support, as the SSBS support will be</span></span><br><span class="line"><span class="comment"> * indicated to the guest and the default is</span></span><br><span class="line"><span class="comment"> * safe.</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * Otherwise, expose a permanent mitigation</span></span><br><span class="line"><span class="comment"> * to the guest, and hide SSBS so that the</span></span><br><span class="line"><span class="comment"> * guest stays protected.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (cpus_have_final_cap(ARM64_SSBS))</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line">fallthrough;</span><br><span class="line"><span class="keyword">case</span> SPECTRE_UNAFFECTED:</span><br><span class="line">val = SMCCC_RET_NOT_REQUIRED;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> ARM_SMCCC_HV_PV_TIME_FEATURES:</span><br><span class="line">val = SMCCC_RET_SUCCESS;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> ARM_SMCCC_HV_PV_TIME_FEATURES:</span><br><span class="line">val = kvm_hypercall_pv_features(vcpu);</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> ARM_SMCCC_HV_PV_TIME_ST:</span><br><span class="line">gpa = kvm_init_stolen_time(vcpu);</span><br><span class="line"><span class="keyword">if</span> (gpa != GPA_INVALID)</span><br><span class="line">val = gpa;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">default</span>:</span><br><span class="line"><span class="keyword">return</span> kvm_psci_call(vcpu);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">smccc_set_retval(vcpu, val, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>);</span><br><span class="line"><span class="keyword">return</span> <span class="number">1</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>可以看到，在这里虚拟机<code>hypervisor</code>会调用<code>kvm_init_stolen_time</code>来初始化<code>steam-time</code>相关的配置：</p><ul><li>通过客户机对应的<code>vcpu</code>结构体<code>kvm_vcpu</code>获取到<code>steal</code>变量的基地址<code>base</code>（对应客户机的物理地址）</li><li>通过<code>kvm_write_guest</code>将一个<code>pvclock_vcpu_stolen_time</code>初始化值写入到<code>base</code>地址，正是在这个结构体中保存了客户机的<code>steal-time</code></li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">gpa_t</span> <span class="title function_">kvm_init_stolen_time</span><span class="params">(<span class="keyword">struct</span> kvm_vcpu *vcpu)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">pvclock_vcpu_stolen_time</span> <span class="title">init_values</span> =</span> &#123;&#125;;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">kvm</span> *<span class="title">kvm</span> =</span> vcpu-&gt;kvm;</span><br><span class="line">u64 base = vcpu-&gt;arch.steal.base;</span><br><span class="line"><span class="type">int</span> idx;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (base == GPA_INVALID)</span><br><span class="line"><span class="keyword">return</span> base;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Start counting stolen time from the time the guest requests</span></span><br><span class="line"><span class="comment"> * the feature enabled.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">vcpu-&gt;arch.steal.last_steal = current-&gt;sched_info.run_delay;</span><br><span class="line"></span><br><span class="line">idx = srcu_read_lock(&amp;kvm-&gt;srcu);</span><br><span class="line">kvm_write_guest(kvm, base, &amp;init_values, <span class="keyword">sizeof</span>(init_values));</span><br><span class="line">srcu_read_unlock(&amp;kvm-&gt;srcu, idx);</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> base;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>最后，<code>KVM</code>运行的时候，监听一个进程上下文切换的事件，在<code>vcpu</code>进程切换时主动调用<code>kvm_update_stolen_time</code>更新<code>steal-time</code>，从这里我们也可以看到，客户机的<code>steal-time</code>实际读取的是调度器的统计数据<code>run_delay</code>，就是<code>vcpu</code>调度的延迟-在调度队列里等待的时间。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">void</span> <span class="title function_">kvm_update_stolen_time</span><span class="params">(<span class="keyword">struct</span> kvm_vcpu *vcpu)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">kvm</span> *<span class="title">kvm</span> =</span> vcpu-&gt;kvm;</span><br><span class="line">u64 base = vcpu-&gt;arch.steal.base;</span><br><span class="line">u64 last_steal = vcpu-&gt;arch.steal.last_steal;</span><br><span class="line">u64 offset = offsetof(<span class="keyword">struct</span> pvclock_vcpu_stolen_time, stolen_time);</span><br><span class="line">u64 steal = <span class="number">0</span>;</span><br><span class="line"><span class="type">int</span> idx;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (base == GPA_INVALID)</span><br><span class="line"><span class="keyword">return</span>;</span><br><span class="line"></span><br><span class="line">idx = srcu_read_lock(&amp;kvm-&gt;srcu);</span><br><span class="line"><span class="keyword">if</span> (!kvm_get_guest(kvm, base + offset, steal)) &#123;</span><br><span class="line">steal = le64_to_cpu(steal);</span><br><span class="line">vcpu-&gt;arch.steal.last_steal = READ_ONCE(current-&gt;sched_info.run_delay);</span><br><span class="line">steal += vcpu-&gt;arch.steal.last_steal - last_steal;</span><br><span class="line">kvm_put_guest(kvm, base + offset, cpu_to_le64(steal));</span><br><span class="line">&#125;</span><br><span class="line">srcu_read_unlock(&amp;kvm-&gt;srcu, idx);</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a><strong>总结</strong></h2><p>这篇文章，我们基于一个虚拟化项目的CPU占用高的问题，分析了排查的手段与思路；以<code>KVM</code>为例，深入分析了虚拟化平台<code>CPU steal-time</code>的计算与更新流程，这样在后续碰到类似的问题时，可以更加得心应手了。在平常的项目实践中，遇到了一些疑难问题，如果有知识盲点，花点时间深入研究下背后的原理与机制，学习的效果比单纯的理论研究要好很多。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a><strong>参考资料</strong></h2><ul><li><a href="https://developer.arm.com/documentation/102412/0103/Exception-types/Synchronous-exceptions">https://developer.arm.com/documentation/102412/0103/Exception-types/Synchronous-exceptions</a></li><li><a href="https://www.redhat.com/en/topics/virtualization/what-is-KVM">https://www.redhat.com/en/topics/virtualization/what-is-KVM</a></li><li><a href="https://ubuntu.com/blog/kvm-hyphervisor">https://ubuntu.com/blog/kvm-hyphervisor</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;前两天碰到一个基于&lt;code&gt;QNX&lt;/code&gt;的虚拟化平台上的项目问题，同事反馈系统很卡，点击页面明显有延迟，卡顿严重。用&lt;code&gt;top&lt;/code&gt;看了下&lt;code&gt;Android&lt;/code&gt;系统的负载，还有&lt;code&gt;20%&lt;/code&gt;左右的空闲，其他的如用户态、内核态以及中断的占用都比较正常，唯独有一个&lt;code&gt;%host&lt;/code&gt;的占用特别高，最高能占到&lt;code&gt;60%&lt;/code&gt;以上。这个&lt;code&gt;host&lt;/code&gt;的占用是什么意思了？这篇文章，我们就基于这个问题，来详细阐述分析下虚拟化平台中&lt;code&gt;host&lt;/code&gt;占用高的问题以及在虚拟化平台&lt;code&gt;KVM&lt;/code&gt;是如何计算&lt;code&gt;host&lt;/code&gt;占用的。</summary>
    
    
    
    <category term="虚拟化" scheme="https://sniffer.site/categories/%E8%99%9A%E6%8B%9F%E5%8C%96/"/>
    
    
    <category term="Linux" scheme="https://sniffer.site/tags/Linux/"/>
    
    <category term="Hypervisor" scheme="https://sniffer.site/tags/Hypervisor/"/>
    
    <category term="CPU steal-time" scheme="https://sniffer.site/tags/CPU-steal-time/"/>
    
  </entry>
  
  <entry>
    <title>Linux内核模块签名那些事</title>
    <link href="https://sniffer.site/2025/01/24/Linux%E5%86%85%E6%A0%B8%E6%A8%A1%E5%9D%97%E7%AD%BE%E5%90%8D%E9%82%A3%E4%BA%9B%E4%BA%8B/"/>
    <id>https://sniffer.site/2025/01/24/Linux%E5%86%85%E6%A0%B8%E6%A8%A1%E5%9D%97%E7%AD%BE%E5%90%8D%E9%82%A3%E4%BA%9B%E4%BA%8B/</id>
    <published>2025-01-24T06:16:03.000Z</published>
    <updated>2025-08-05T10:19:38.811Z</updated>
    
    <content type="html"><![CDATA[<p>最近有同事反馈一个系统启动失败的问题，根因是系统的驱动模块加载失败导致<code>system_server</code>无法正常启动。<code>lsmod</code>查看，没有有任何的驱动加载，尝试<code>insmod /vendor/lib/modules/cnss2.ko</code>会提示：</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">insmod: failed to to load cnss2.ko : Key was rejected by service</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>说明对应模块的签名与内核不一致，导致安装失败了。这里我们就来看看内核模块具体是怎么签名的，模块签名又是如何验证的，以及如何通过工具进行模块的签名检验。</p><span id="more"></span><h2 id="内核模块签名"><a href="#内核模块签名" class="headerlink" title="内核模块签名"></a><strong>内核模块签名</strong></h2><p>内核为了增强安全，会在编译阶段生成一个签名并放在模块的末尾，在模块尝试加载时主动对模块的签名进行校验，如果签名不匹配则会拒绝安装。这样可以减少未签名或者恶意的模块安装到内核，避免威胁到内核的正常运行。</p><p>内核的签名遵从<code>PKCS7(Public Key Cryptography Standards)</code>加密规范，；加密秘钥支持<code>SHA1</code>&#x2F;<code>SHA224</code>&#x2F;<code>SHA256</code>等多种长度，可以通过内核配置<code>CONFIG_MODULE_SIG_HASH</code>进行配置。Linux内核从3.7版本开始支持模块的签名校验机制。如果想要开启此功能，需要打开如下几个配置：</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">CONFIG_MODULE_SIG=y</span><br><span class="line">CONFIG_MODULE_SIG_FORCE=y</span><br><span class="line">CONFIG_MODULE_SIG_ALL=y</span><br><span class="line">CONFIG_MODULE_SIG_SHA1=y</span><br><span class="line">#根据实际需要选择哈希算法</span><br><span class="line">#CONFIG_MODULE_SIG_SHA224=y</span><br><span class="line">#CONFIG_MODULE_SIG_SHA256=y</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>查看内核根目录的<code>Makefile</code>脚本可以看到，如果开启了<code>CONFIG_MODULE_SIG_ALL</code>&#x2F;<code>CONFIG_MODULE_SIG</code>两个配置，则会基于内核生成的证书对系统所有外部模块进行签名</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">ifdef CONFIG_MODULE_SIG_ALL</span><br><span class="line">$(<span class="built_in">eval</span> $(call config_filename,MODULE_SIG_KEY))</span><br><span class="line"></span><br><span class="line">mod_sign_cmd = scripts/sign-file $(CONFIG_MODULE_SIG_HASH) $(MODULE_SIG_KEY_SRCPREFIX)$(CONFIG_MODULE_SIG_KEY) certs/signing_key.x509</span><br><span class="line"><span class="keyword">else</span></span><br><span class="line">mod_sign_cmd = <span class="literal">true</span></span><br><span class="line">endif</span><br><span class="line"><span class="built_in">export</span> mod_sign_cmd</span><br><span class="line"></span><br><span class="line">ifeq ($(CONFIG_MODULE_SIG), y)</span><br><span class="line">PHONY += modules_sign</span><br><span class="line">modules_sign:</span><br><span class="line">$(Q)$(MAKE) -f $(srctree)/scripts/Makefile.modsign</span><br><span class="line">endif</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>具体的签名过程可以参考<code>scripts/sign-file.c</code>这个文件；签名完成后，同时会在模块的末尾加一串魔术字符<code>~Module signature appended~\n</code>。我们可以通过如下查看模块状态:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">strings gvm_spf_machine_dlkm.ko |<span class="built_in">tail</span> -n 1</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="模块签名校验"><a href="#模块签名校验" class="headerlink" title="模块签名校验"></a><strong>模块签名校验</strong></h2><p>通过<code>insmod</code>加载内核模块时，核心是通过系统调用<code>__NR_finit_module</code>尝试安装模块驱动，调用路径如下(<code>kernel/module.c</code>)：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">int rc = syscall(__NR_finit_module, fd.get(), options.c_str(), flags)</span><br><span class="line"><span class="comment"># 内核调用路径</span></span><br><span class="line">-&gt; finit_module</span><br><span class="line">-&gt; load_module</span><br><span class="line">    -&gt; module_sig_check</span><br><span class="line">        -&gt; mod_verify_sig</span><br><span class="line">            -&gt; verify_pkcs7_signature</span><br><span class="line">    -&gt; do_init_module</span><br><span class="line">        -&gt; do_mod_ctors</span><br><span class="line">        -&gt; do_one_initcall</span><br></pre></td></tr></table></figure><p>如果签名校验<code>mod_verify_sig</code>返回错误，内核会返回<code>EKEYREJECTED</code>错误码，这样驱动安装时会打印错误的信息<code>key was rejected by service</code>。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_MODULE_SIG</span></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> <span class="title function_">module_sig_check</span><span class="params">(<span class="keyword">struct</span> load_info *info, <span class="type">int</span> flags)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="type">int</span> err = -ENODATA;</span><br><span class="line"><span class="type">const</span> <span class="type">unsigned</span> <span class="type">long</span> markerlen = <span class="keyword">sizeof</span>(MODULE_SIG_STRING) - <span class="number">1</span>;</span><br><span class="line"><span class="type">const</span> <span class="type">char</span> *reason;</span><br><span class="line"><span class="type">const</span> <span class="type">void</span> *mod = info-&gt;hdr;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Require flags == 0, as a module with version information</span></span><br><span class="line"><span class="comment"> * removed is no longer the module that was signed</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (flags == <span class="number">0</span> &amp;&amp;</span><br><span class="line">    info-&gt;len &gt; markerlen &amp;&amp;</span><br><span class="line">    <span class="built_in">memcmp</span>(mod + info-&gt;len - markerlen, MODULE_SIG_STRING, markerlen) == <span class="number">0</span>) &#123;</span><br><span class="line"><span class="comment">/* We truncate the module to discard the signature */</span></span><br><span class="line">info-&gt;len -= markerlen;</span><br><span class="line">err = mod_verify_sig(mod, info);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">switch</span> (err) &#123;</span><br><span class="line"><span class="keyword">case</span> <span class="number">0</span>:</span><br><span class="line">info-&gt;sig_ok = <span class="literal">true</span>;</span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* We don&#x27;t permit modules to be loaded into trusted kernels</span></span><br><span class="line"><span class="comment"> * without a valid signature on them, but if we&#x27;re not</span></span><br><span class="line"><span class="comment"> * enforcing, certain errors are non-fatal.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">case</span> -ENODATA:</span><br><span class="line">reason = <span class="string">&quot;unsigned module&quot;</span>;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> -ENOPKG:</span><br><span class="line">reason = <span class="string">&quot;module with unsupported crypto&quot;</span>;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> -ENOKEY:</span><br><span class="line">reason = <span class="string">&quot;module with unavailable key&quot;</span>;</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* All other errors are fatal, including nomem, unparseable</span></span><br><span class="line"><span class="comment"> * signatures and signature check failures - even if signatures</span></span><br><span class="line"><span class="comment"> * aren&#x27;t required.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">default</span>:</span><br><span class="line"><span class="keyword">return</span> err;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (is_module_sig_enforced()) &#123;</span><br><span class="line">pr_notice(<span class="string">&quot;Loading of %s is rejected\n&quot;</span>, reason);</span><br><span class="line"><span class="keyword">return</span> -EKEYREJECTED;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> security_locked_down(LOCKDOWN_MODULE_SIGNATURE);</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="如何检查模块的签名"><a href="#如何检查模块的签名" class="headerlink" title="如何检查模块的签名"></a><strong>如何检查模块的签名</strong></h2><p>在实际的项目开发过程中，我们可能需要查看模块的签名状态，并校验模块的签名是否正常。可以通过<code>modinfo</code>查看某个模块的签名状态:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># modinfo cnss2.ko</span></span><br><span class="line">filename:       out/target/product/msmnile_gvmq/dlkm/lib/modules/cnss2.ko</span><br><span class="line">license:        GPL v2</span><br><span class="line">description:    CNSS2 Platform Driver</span><br><span class="line">vermagic:       5.4.219-g328405dec0e5-dirty SMP preempt mod_unload modversions aarch64</span><br><span class="line">name:           cnss2</span><br><span class="line">intree:         Y</span><br><span class="line">depends:        pci-msm-drv</span><br><span class="line"><span class="built_in">alias</span>:          of:N*T*Cqcom,cnss</span><br><span class="line"><span class="built_in">alias</span>:          of:N*T*Cqcom,cnssC*</span><br><span class="line"><span class="built_in">alias</span>:          of:N*T*Cqcom,cnss-qca6290</span><br><span class="line"><span class="built_in">alias</span>:          of:N*T*Cqcom,cnss-qca6290C*</span><br><span class="line"><span class="built_in">alias</span>:          of:N*T*Cqcom,cnss-qca6390</span><br><span class="line"><span class="built_in">alias</span>:          of:N*T*Cqcom,cnss-qca6390C*</span><br><span class="line"><span class="built_in">alias</span>:          of:N*T*Cqcom,cnss-qca6490</span><br><span class="line"><span class="built_in">alias</span>:          of:N*T*Cqcom,cnss-qca6490C*</span><br><span class="line"><span class="built_in">alias</span>:          of:N*T*Cqcom,cnss-kiwi</span><br><span class="line"><span class="built_in">alias</span>:          of:N*T*Cqcom,cnss-kiwiC*</span><br><span class="line"><span class="built_in">alias</span>:          of:N*T*Cqcom,cnss-qca-converged</span><br><span class="line"><span class="built_in">alias</span>:          of:N*T*Cqcom,cnss-qca-convergedC*</span><br><span class="line"><span class="built_in">alias</span>:          pci:v0000168Cd0000003Esv*sd*bc*sc*i*</span><br><span class="line"><span class="built_in">alias</span>:          pci:v000017CBd00001100sv*sd*bc*sc*i*</span><br><span class="line"><span class="built_in">alias</span>:          pci:v000017CBd00001101sv*sd*bc*sc*i*</span><br><span class="line"><span class="built_in">alias</span>:          pci:v000017CBd00001102sv*sd*bc*sc*i*</span><br><span class="line"><span class="built_in">alias</span>:          pci:v000017CBd00001103sv*sd*bc*sc*i*</span><br><span class="line"><span class="built_in">alias</span>:          pci:v000017CBd00001107sv*sd*bc*sc*i*</span><br><span class="line">sig_id:         PKCS<span class="comment">#7</span></span><br><span class="line">signer:         Build time autogenerated kernel key</span><br><span class="line">sig_key:        AD:D1:68:DF:11:5E:02:AE</span><br><span class="line">sig_hashalgo:   sha1</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>如果要检验某个驱动与当前的内核系统签名是否匹配，可以通过一个<code>perl</code>脚本<a href="https://github.com/runningforlife/CodingExamples/shell/check_mod_sig.pl"><code>check_mod_sig.pl</code></a>来进行校验：</p><ul><li>首先，需要通过内核提供的<code>scripts/extract-sys-certs.pl</code>脚本从<code>vmlinux</code>中提取证书（也可以从编译产物中获取）</li><li>然后跟进证书对模块的签名进行校验，如果返回<code>OK</code>则表面模块驱动的签名与内核镜像的一致</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">perl extract-sys-certs.pl vmlinux ./vmlinux.x509</span><br><span class="line"></span><br><span class="line">perl check_mod_sig.pl ./vmlinux.x509 cnss.ko</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>如果我们要去掉内核模块中的签名，可以使用<code>strip</code>(<code>Android</code>平台需要使用<code>arm</code>平台的命令工具<code>prebuilts/gcc/linux-x86/aarch64/aarch64-linux-android-4.9/aarch64-linux-android/bin/strip</code>)将签名去除，然后通过<code>sign-file</code>这个工具重新进行签名：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">#需要加--strip-debug，否则可能会将KO中的某些符号删除，导致模块无法加载</span></span><br><span class="line">strip --strip-debug --keep-file-symbols cnss2.ko</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>仿照内核的<code>Makefile</code>进行模块的手动签名，签名成功后可以看到文件末尾多了<code>~Module signature appended~</code>这个字符串:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"># mod_sign_cmd = scripts/sign-file $(CONFIG_MODULE_SIG_HASH) $(MODULE_SIG_KEY_SRCPREFIX)$(CONFIG_MODULE_SIG_KEY) certs/signing_key.x509</span><br><span class="line"></span><br><span class="line">./sign-file sha1 certs/signing_key.pem certs/signing_key.x509 cnss2.ko</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="参考文献"><a href="#参考文献" class="headerlink" title="参考文献"></a><strong>参考文献</strong></h2><ul><li><a href="https://sysprog21.github.io/lkmpg/">https://sysprog21.github.io/lkmpg/</a></li><li><a href="https://docs.kernel.org/kbuild/modules.html">https://docs.kernel.org/kbuild/modules.html</a></li><li><a href="https://unix.stackexchange.com/questions/493170/how-to-verify-a-kernel-module-signature">https://unix.stackexchange.com/questions/493170/how-to-verify-a-kernel-module-signature</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;最近有同事反馈一个系统启动失败的问题，根因是系统的驱动模块加载失败导致&lt;code&gt;system_server&lt;/code&gt;无法正常启动。&lt;code&gt;lsmod&lt;/code&gt;查看，没有有任何的驱动加载，尝试&lt;code&gt;insmod /vendor/lib/modules/cnss2.ko&lt;/code&gt;会提示：&lt;/p&gt;
&lt;figure class=&quot;highlight plaintext&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre&gt;&lt;span class=&quot;line&quot;&gt;1&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;2&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;3&lt;/span&gt;&lt;br&gt;&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;insmod: failed to to load cnss2.ko : Key was rejected by service&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;br&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/figure&gt;

&lt;p&gt;说明对应模块的签名与内核不一致，导致安装失败了。这里我们就来看看内核模块具体是怎么签名的，模块签名又是如何验证的，以及如何通过工具进行模块的签名检验。&lt;/p&gt;</summary>
    
    
    
    <category term="Linux" scheme="https://sniffer.site/categories/Linux/"/>
    
    
    <category term="Linux" scheme="https://sniffer.site/tags/Linux/"/>
    
    <category term="内核" scheme="https://sniffer.site/tags/%E5%86%85%E6%A0%B8/"/>
    
    <category term="模块签名" scheme="https://sniffer.site/tags/%E6%A8%A1%E5%9D%97%E7%AD%BE%E5%90%8D/"/>
    
  </entry>
  
  <entry>
    <title>龙年年终总结</title>
    <link href="https://sniffer.site/2024/12/31/%E9%BE%99%E5%B9%B4%E5%B9%B4%E7%BB%88%E6%80%BB%E7%BB%93/"/>
    <id>https://sniffer.site/2024/12/31/%E9%BE%99%E5%B9%B4%E5%B9%B4%E7%BB%88%E6%80%BB%E7%BB%93/</id>
    <published>2024-12-31T09:01:05.000Z</published>
    <updated>2025-01-05T09:54:04.857Z</updated>
    
    <content type="html"><![CDATA[<blockquote><p>当人类对自己的理性有致命自负的时候，也就走向了奴役之路</p><pre><code>冯-哈耶克</code></pre></blockquote><p><img src="https://unsplash.com/photos/Y2pzmNYinu0/download?ixid=M3wxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNzM1NzM4MjQzfA&auto=format&fit=crop&w=1920&q=80" alt="new year-2025"></p><span id="more"></span><p>要写年终总结时，才发现24年是龙年，也是我的本命年，一眨眼间研究生毕业都十年了。十年里，从孑然一身到如今上有老、下有小，经历了不少事情，但很多东西好像并没有想的非常清楚，还是停留在一个模糊的阶段，没有一个清晰的思路，比如对于财务的规划，对于如何处理好与亲朋好友的关系，对于生命的终结，都没有深入的想过，完全是凭借自己的内心的初衷与本能在应对，所以真正碰到问题的时候才发现有些捉襟见肘。今年，面对内心的困惑与迷茫，系统的看了一些书籍，也算有点收获，但书看的越多，才恍然发现以前的很多想法其实非常的经不起考验、推敲，存在很多缺漏的地方；简单来说，还是书读的有点少了，无知的地方太多，没有保持足够的好奇心，内心有点封闭，导致很多观念与想法常年没有刷新，停留在初始的阶段，没有真正落到执行层面；以为自己懂了，明白了，实际上还停留在朦胧看不清楚的阶段。</p><p>真正的理解与懂，是知行合一，是把想法与观念转变成为生活、现实中的实践与行为的标准，并以此不断向前探索。立足现在，看向未来，总结几点，算是对未来的期许。</p><h2 id="保持好奇心"><a href="#保持好奇心" class="headerlink" title="保持好奇心"></a><strong>保持好奇心</strong></h2><p>观察5岁的女儿，她的脑袋里总会时不时冒出很多稀奇古怪的想法，表达方式上也跟大人有明显的区别，跟她相比，我的很多想法就显得非常循规蹈矩；我早就习惯了很多身边的事情，习以为常，无论是在思维模式，还是行为方式上都固定可循，难以跳出现有的套路。寻根究底，是因为在长年累月的社会化过程中，大脑里的思维模式（神经网络的物理结构）已经逐渐固定，对很多事物见多了，认知都渐渐固化，加上每日忙着生计、需要保持自己作为某个身份的权威形象，对这些已有的观念都很少去点检，也懒得去质疑与思考了。</p><p>渐渐地，大脑会建立一个明显的认知边界，对于自己不感兴趣的，无法理解的，会产生自然的排斥，不假思索的排斥，就好像人体的免疫系统对外来无法识别细胞的产生自然免疫反应一般，大脑会抗拒<code>认知</code>范围之外的观念与想法。这给人造成一种认知上的安全感，让人觉得非常安逸舒服，但实际上只是给自己的大脑制造了一个坚固的牢笼而已。在这个牢笼之下，面对汹涌的信息输入，我们也只是被动的接收，没法真正的批判吸收。</p><p>怎么办？大脑需要放空，需要把那些固有的观念清空，把自然、不假思索的观念检点一遍，反思，用新的观念挑战自我；遇到跟自己不一样、让人不舒服，有点抗拒的想法，应当想一想，为什么对方会这么想？对自己无法理解的想法或者新鲜观点，保持好奇心，保持开放的心态，想办法追根溯源，系统性的学习思考。如果头脑中的想法少，新鲜的观点少，那就不断地去阅读，看各个行业的牛人是怎么思考、学习的；读一本不懂，没关系，持续下去，继续阅读，不断地输入，大脑中的想法就会被冲刷，那些固有、陈旧的观念会不断刷新，像海水不断冲刷沙滩一样，最后新的观念就会涌现。涌现，这个就是chatGPT在吸收、萃取了大量数据之后出现的能力，人的大脑也可以通过类似、大量有用的输入训练得到重塑。</p><h2 id="健康第一"><a href="#健康第一" class="headerlink" title="健康第一"></a><strong>健康第一</strong></h2><p>前段时间跟前同事聊天，之前的部门领导因为重疾不治，不到半年时间人就去世了，让人慨然而伤感；还记得几年前在公司时，他还十分意气风发，很有干劲，待人也温和有礼。没料到才几年的时间，人就没有了，听说家里还有两个小孩，让人有些伤感，充满无奈。生老病死，谁都没法逃过，到生命结束后，谁又记得那些过往？那些曾经让我们刻骨铭心、难以释怀的记忆，就像风一样，消失得无影无踪，悄无声息。但内心里还是希望每一个生命能逃过意外的劫难，在结束时，能绽放的更加精彩、完全。</p><p>IT行业，常年竞争压力大，加班、996都是常态，很多人都因为工作太忙而忽视了健康，没有时间去锻炼身体。长期来看，这无疑杀鸡取卵，让自己的身体面临着巨大的风险；很多公司也完全没有意识到，员工的身体健康对于企业的发展有多重要的价值。说到底，无论是对于个人还是企业，市场竞争都是一个持久的战役，不太可能依靠短期的投入就能赢得胜利。我们需要构建一个可持续的发展路径，而健康的身体是可持续竞争力的基础。</p><p>保持健康，不仅仅是一个口号，更需要我们自爱、自律。坚持锻炼，即使再忙也要花时间运动；定期体检，身体有任何的不适都需要重视，咨询专业医师的意见。</p><h2 id="不要试图改变一个成年人"><a href="#不要试图改变一个成年人" class="headerlink" title="不要试图改变一个成年人"></a><strong>不要试图改变一个成年人</strong></h2><p>今年在新的职位上，负有领导的责任，在跟人沟通上，遇到了一些困难与矛盾。开始，我一直试图去纠正、改变对方的某些观念，证明对方的想法有问题，彼此完全不在同一个立场、角度思考问题，自说自话，互相无法说服对方，结果可想而知，两方都会很累，难以达成一致。我感到很难受，明明对方的想法不靠谱、站不住脚，为什么他还如此坚持了？即便事后证明，对方的观点有问题，是错误的，他依然一副若无其事，淡定坦然的样子。这个让我有点迷惑。但认真想一想，我自身的做法就有问题，我犯了一个很严重的错误，就是试图改变一个人，尤其是成年人。</p><p>曾经听过一个说法，<code>成年人不能被影响、改变，只能被筛选</code>；所以，如果你认为某个人的观点有问题，或者价值观不一致，最好的办法不是去尝试改变或者影响他，而是什么都不要做，只是纯粹的进行表达、陈述事实与观点即可，如果对方具备很好的自我意识，能进行个人观念的反思与点检，那么他自然会理解；相反，如果对方缺乏内省的能力，没法自我观察，没有自我驱动力，你即便让释迦牟尼或者耶稣进行说教，也无法奏效。</p><p>一个缺乏内省与反思能力的成年人，他的思维模式恐怕早就固化，大脑的神经回路已经很难再去调优，外部的驱动力已经很难对他的观念进行刷新，所以不要试图改变他，更不用与其争执、吵闹，淡然一笑，让时间去证明一切即可。</p><p>新年快乐，2025年从心出发。</p>]]></content>
    
    
    <summary type="html">&lt;blockquote&gt;&lt;p&gt;当人类对自己的理性有致命自负的时候，也就走向了奴役之路&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;冯-哈耶克
&lt;/code&gt;&lt;/pre&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;img src=&quot;https://unsplash.com/photos/Y2pzmNYinu0/download?ixid=M3wxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNzM1NzM4MjQzfA&amp;auto=format&amp;fit=crop&amp;w=1920&amp;q=80&quot; alt=&quot;new year-2025&quot;&gt;&lt;/p&gt;</summary>
    
    
    
    <category term="思考" scheme="https://sniffer.site/categories/%E6%80%9D%E8%80%83/"/>
    
    
    <category term="成长" scheme="https://sniffer.site/tags/%E6%88%90%E9%95%BF/"/>
    
    <category term="探索" scheme="https://sniffer.site/tags/%E6%8E%A2%E7%B4%A2/"/>
    
  </entry>
  
  <entry>
    <title>深入理解Android进程冻结</title>
    <link href="https://sniffer.site/2024/12/24/%E6%B7%B1%E5%85%A5%E7%90%86%E8%A7%A3Android%E8%BF%9B%E7%A8%8B%E5%86%BB%E7%BB%93/"/>
    <id>https://sniffer.site/2024/12/24/%E6%B7%B1%E5%85%A5%E7%90%86%E8%A7%A3Android%E8%BF%9B%E7%A8%8B%E5%86%BB%E7%BB%93/</id>
    <published>2024-12-24T11:02:29.000Z</published>
    <updated>2025-06-05T06:26:50.637Z</updated>
    
    <content type="html"><![CDATA[<p><code>Google</code>从<code>Android11</code>系统开始支持应用冻结功能，可以将后台长时间未运行的任务暂缓执行，通过将对应的进程迁移到对应的<code>cgroup</code>分组来冻结对应的后台缓存应用，这样可以减少如CPU、内存等资源占用，减少业务在后台的不当行为，尽可能减少功耗。本文将对<code>Android</code>的进程冻结的实现原理、冻结策略进行详细的介绍与阐述，争取把相关的策略与机制都讲述清楚，主要分为以下几个部分 :</p><ul><li><code>Android</code>进程冻结的大致框架：主要介绍进程冻结的总体框架与思路</li><li><code>Android</code>进程冻结的实现原理：介绍<code>Android</code>如何实现进程冻结</li><li><code>Android</code>进程冻结的冻结策略：进程冻结的具体策略</li></ul><span id="more"></span><h2 id="Android进程冻结整体框架"><a href="#Android进程冻结整体框架" class="headerlink" title="Android进程冻结整体框架"></a><strong>Android进程冻结整体框架</strong></h2><p><code>Android</code>中每个应用都有一个<code>oom_adj(out of memory ajustment)</code>值，用来标记应用的优先级状态；在应用创建、前后台切换、广播接收、服务绑定以及进程崩溃等事件（具体可以参考如下调整的原因）时，<a href="https://gityuan.com/2018/05/19/android-process-adj/">会触发<code>oom_adj</code>的变化</a>，<code>oom_adj</code>的变化会导致<code>Android</code>系统执行某些特定的策略，比如调整进程所在的<code>cgroup</code>分组，回收应用或者系统内存，或者执行进程冻结，以减少CPU、内存的占用。</p><figure class="highlight java"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"># OomAdjuster.java OOM_ADJ调整的原因</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_METHOD</span> <span class="operator">=</span> <span class="string">&quot;updateOomAdj&quot;</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_NONE</span> <span class="operator">=</span> OOM_ADJ_REASON_METHOD + <span class="string">&quot;_meh&quot;</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_ACTIVITY</span> <span class="operator">=</span> OOM_ADJ_REASON_METHOD + <span class="string">&quot;_activityChange&quot;</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_FINISH_RECEIVER</span> <span class="operator">=</span> OOM_ADJ_REASON_METHOD + <span class="string">&quot;_finishReceiver&quot;</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_START_RECEIVER</span> <span class="operator">=</span> OOM_ADJ_REASON_METHOD + <span class="string">&quot;_startReceiver&quot;</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_BIND_SERVICE</span> <span class="operator">=</span> OOM_ADJ_REASON_METHOD + <span class="string">&quot;_bindService&quot;</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_UNBIND_SERVICE</span> <span class="operator">=</span> OOM_ADJ_REASON_METHOD + <span class="string">&quot;_unbindService&quot;</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_START_SERVICE</span> <span class="operator">=</span> OOM_ADJ_REASON_METHOD + <span class="string">&quot;_startService&quot;</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_GET_PROVIDER</span> <span class="operator">=</span> OOM_ADJ_REASON_METHOD + <span class="string">&quot;_getProvider&quot;</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_REMOVE_PROVIDER</span> <span class="operator">=</span> OOM_ADJ_REASON_METHOD + <span class="string">&quot;_removeProvider&quot;</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_UI_VISIBILITY</span> <span class="operator">=</span> OOM_ADJ_REASON_METHOD + <span class="string">&quot;_uiVisibility&quot;</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_ALLOWLIST</span> <span class="operator">=</span> OOM_ADJ_REASON_METHOD + <span class="string">&quot;_allowlistChange&quot;</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_PROCESS_BEGIN</span> <span class="operator">=</span> OOM_ADJ_REASON_METHOD + <span class="string">&quot;_processBegin&quot;</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">final</span> <span class="type">String</span> <span class="variable">OOM_ADJ_REASON_PROCESS_END</span> <span class="operator">=</span> OOM_ADJ_REASON_METHOD + <span class="string">&quot;_processEnd&quot;</span>;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>Android</code>系统进程的冻结主要通过内核中<code>cgroup</code>冻结(<code>freezer</code>)子系统来实现的，对应是下述框图中的右侧区域；如果冻结的进程提供了<code>binder</code>接口，首先需要通过<code>binder</code>接口设置当前服务进程处于冻结状态，这样客户端调用相关的接口时，主动返回错误，而不至于阻塞客户端进程。</p><ul><li><code>ActivityManagerService(AMS)</code>系统的核心服务，主要负责应用的创建与状态管理，<code>AMS</code>会通过<code>OomAjduster</code>的接口来调整进程的优先级状态</li><li><code>OomAjduster</code>主要用来计算、调整进程的状态与优先级，为内存回收、进程冻结提供参考依据</li><li><code>CachedAppOptimizer</code>提供内存回收与进程冻结的能力，对长时间处于后台的应用进行相应的优化处理</li><li><code>Process</code>用于管理应用进程，提供如进程创建，进程优先级调整，进程分组等接口</li></ul><p>进程冻结实际会分为两个具体的步骤：</p><ul><li>首先通过<code>freezeBinder</code>发送命令给<code>binder</code>驱动尝试冻结服务端的进程，<code>binder</code>驱动会冻结对应<code>pid</code>的服务，后续请求都会直接返回一个错误</li><li><code>binder</code>服务冻结后，需要通过<code>cgroup</code>冻结子系统执行冻结；进程冻结完成后，进程状态变为<code>S</code>，执行的路径会阻塞在<code>do_freezer_trap</code></li></ul><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/android-process-freezer-flow.png" alt="Android进程冻结流程"></p><h2 id="Android进程冻结实现原理"><a href="#Android进程冻结实现原理" class="headerlink" title="Android进程冻结实现原理"></a><strong>Android进程冻结实现原理</strong></h2><h3 id="进程冻结分组挂载"><a href="#进程冻结分组挂载" class="headerlink" title="进程冻结分组挂载"></a>进程冻结分组挂载</h3><p><code>Android</code>冻结的核心原理是基于<code>cgroup</code>中的冻结子系统来完成任务的冻结与解冻；<code>cgroup</code>是最开始是<code>Google</code>工程师引入，是内核用于控制资源比如<code>CPU</code>，内存，<code>IO</code>等的一种非常有效的手段。在<code>Android</code>初始化过程中，会通过解析系统中的<code>cgroups.json</code>文件，将常用的分组挂载到系统中：</p><ul><li>进程冻结分组<code>freezer</code>会挂载到<code>/sys/fs/cgroup</code>节点</li><li><code>cpu</code>关联的分组有两个，一个是<code>/dev/cpuctl</code>，主要用于控制<code>CPU</code>的调度，一个是<code>/dev/cpuset</code>，主要用于控制<code>CPU</code>的亲和性、大小核绑定</li><li><code>memory</code>对应的分组是<code>/dev/memcg</code>，主要用于控制<code>内存</code>的分配</li><li><code>io</code>对应的分组是<code>/dev/blkio</code>，主要用于控制<code>IO</code>的调度</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">int</span> <span class="title function_">SecondStageMain</span><span class="params">(<span class="type">int</span> argc, <span class="type">char</span>** argv)</span> &#123;</span><br><span class="line">    <span class="keyword">if</span> (REBOOT_BOOTLOADER_ON_PANIC) &#123;</span><br><span class="line">        InstallRebootSignalHandlers();</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    boot_clock::time_point start_time = boot_clock::now();</span><br><span class="line"></span><br><span class="line">    trigger_shutdown = [](<span class="type">const</span> <span class="built_in">std</span>::<span class="built_in">string</span>&amp; command) &#123; shutdown_state.TriggerShutdown(command); &#125;;</span><br><span class="line"></span><br><span class="line">    SetStdioToDevNull(argv);</span><br><span class="line">    InitKernelLogging(argv);</span><br><span class="line">    LOG(INFO) &lt;&lt; <span class="string">&quot;init second stage started!&quot;</span>;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Update $PATH in the case the second stage init is newer than first stage init, where it is</span></span><br><span class="line">    <span class="comment">// first set.</span></span><br><span class="line">    <span class="keyword">if</span> (setenv(<span class="string">&quot;PATH&quot;</span>, _PATH_DEFPATH, <span class="number">1</span>) != <span class="number">0</span>) &#123;</span><br><span class="line">        PLOG(FATAL) &lt;&lt; <span class="string">&quot;Could not set $PATH to &#x27;&quot;</span> &lt;&lt; _PATH_DEFPATH &lt;&lt; <span class="string">&quot;&#x27; in second stage&quot;</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Init should not crash because of a dependence on any other process, therefore we ignore</span></span><br><span class="line">    <span class="comment">// SIGPIPE and handle EPIPE at the call site directly.  Note that setting a signal to SIG_IGN</span></span><br><span class="line">    <span class="comment">// is inherited across exec, but custom signal handlers are not.  Since we do not want to</span></span><br><span class="line">    <span class="comment">// ignore SIGPIPE for child processes, we set a no-op function for the signal handler instead.</span></span><br><span class="line">    &#123;</span><br><span class="line">        <span class="class"><span class="keyword">struct</span> <span class="title">sigaction</span> <span class="title">action</span> =</span> &#123;.sa_flags = SA_RESTART&#125;;</span><br><span class="line">        action.sa_handler = [](<span class="type">int</span>) &#123;&#125;;</span><br><span class="line">        sigaction(SIGPIPE, &amp;action, nullptr);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Set init and its forked children&#x27;s oom_adj.</span></span><br><span class="line">    <span class="keyword">if</span> (<span class="keyword">auto</span> result =</span><br><span class="line">                WriteFile(<span class="string">&quot;/proc/1/oom_score_adj&quot;</span>, StringPrintf(<span class="string">&quot;%d&quot;</span>, DEFAULT_OOM_SCORE_ADJUST));</span><br><span class="line">        !result.ok()) &#123;</span><br><span class="line">        LOG(ERROR) &lt;&lt; <span class="string">&quot;Unable to write &quot;</span> &lt;&lt; DEFAULT_OOM_SCORE_ADJUST</span><br><span class="line">                   &lt;&lt; <span class="string">&quot; to /proc/1/oom_score_adj: &quot;</span> &lt;&lt; result.error();</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Set up a session keyring that all processes will have access to. It</span></span><br><span class="line">    <span class="comment">// will hold things like FBE encryption keys. No process should override</span></span><br><span class="line">    <span class="comment">// its session keyring.</span></span><br><span class="line">    keyctl_get_keyring_ID(KEY_SPEC_SESSION_KEYRING, <span class="number">1</span>);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Indicate that booting is in progress to background fw loaders, etc.</span></span><br><span class="line">    close(open(<span class="string">&quot;/dev/.booting&quot;</span>, O_WRONLY | O_CREAT | O_CLOEXEC, <span class="number">0000</span>));</span><br><span class="line"></span><br><span class="line">    <span class="comment">// See if need to load debug props to allow adb root, when the device is unlocked.</span></span><br><span class="line">    <span class="type">const</span> <span class="type">char</span>* force_debuggable_env = getenv(<span class="string">&quot;INIT_FORCE_DEBUGGABLE&quot;</span>);</span><br><span class="line">    <span class="type">bool</span> load_debug_prop = <span class="literal">false</span>;</span><br><span class="line">    <span class="keyword">if</span> (force_debuggable_env &amp;&amp; AvbHandle::IsDeviceUnlocked()) &#123;</span><br><span class="line">        load_debug_prop = <span class="string">&quot;true&quot;</span>s == force_debuggable_env;</span><br><span class="line">    &#125;</span><br><span class="line">    unsetenv(<span class="string">&quot;INIT_FORCE_DEBUGGABLE&quot;</span>);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Umount the debug ramdisk so property service doesn&#x27;t read .prop files from there, when it</span></span><br><span class="line">    <span class="comment">// is not meant to.</span></span><br><span class="line">    <span class="keyword">if</span> (!load_debug_prop) &#123;</span><br><span class="line">        UmountDebugRamdisk();</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    PropertyInit();</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Umount second stage resources after property service has read the .prop files.</span></span><br><span class="line">    UmountSecondStageRes();</span><br><span class="line"></span><br><span class="line">    ...</span><br><span class="line">    <span class="comment">// 将SetupCgroupsAction添加到队列中，用于初始化cgroup</span></span><br><span class="line">    am.QueueBuiltinAction(SetupCgroupsAction, <span class="string">&quot;SetupCgroups&quot;</span>);</span><br><span class="line">    am.QueueBuiltinAction(SetKptrRestrictAction, <span class="string">&quot;SetKptrRestrict&quot;</span>);</span><br><span class="line">    am.QueueBuiltinAction(TestPerfEventSelinuxAction, <span class="string">&quot;TestPerfEventSelinux&quot;</span>);</span><br><span class="line">    am.QueueEventTrigger(<span class="string">&quot;early-init&quot;</span>);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Queue an action that waits for coldboot done so we know ueventd has set up all of /dev...</span></span><br><span class="line">    am.QueueBuiltinAction(wait_for_coldboot_done_action, <span class="string">&quot;wait_for_coldboot_done&quot;</span>);</span><br><span class="line">    ...</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Trigger all the boot actions to get us started.</span></span><br><span class="line">    am.QueueEventTrigger(<span class="string">&quot;init&quot;</span>);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Don&#x27;t mount filesystems or start core system services in charger mode.</span></span><br><span class="line">    <span class="built_in">std</span>::<span class="built_in">string</span> bootmode = GetProperty(<span class="string">&quot;ro.bootmode&quot;</span>, <span class="string">&quot;&quot;</span>);</span><br><span class="line">    <span class="keyword">if</span> (bootmode == <span class="string">&quot;charger&quot;</span>) &#123;</span><br><span class="line">        am.QueueEventTrigger(<span class="string">&quot;charger&quot;</span>);</span><br><span class="line">    &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">        am.QueueEventTrigger(<span class="string">&quot;late-init&quot;</span>);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    ...</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>Android</code>系统中，<code>cgroups.json</code>文件位于<code>/system/etc/cgroups.json</code>，文件内容如下：</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="punctuation">&#123;</span></span><br><span class="line">  <span class="attr">&quot;Cgroups&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;blkio&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;/dev/blkio&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Mode&quot;</span><span class="punctuation">:</span> <span class="string">&quot;0755&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;UID&quot;</span><span class="punctuation">:</span> <span class="string">&quot;system&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;GID&quot;</span><span class="punctuation">:</span> <span class="string">&quot;system&quot;</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cpu&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;/dev/cpuctl&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Mode&quot;</span><span class="punctuation">:</span> <span class="string">&quot;0755&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;UID&quot;</span><span class="punctuation">:</span> <span class="string">&quot;system&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;GID&quot;</span><span class="punctuation">:</span> <span class="string">&quot;system&quot;</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cpuset&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;/dev/cpuset&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Mode&quot;</span><span class="punctuation">:</span> <span class="string">&quot;0755&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;UID&quot;</span><span class="punctuation">:</span> <span class="string">&quot;system&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;GID&quot;</span><span class="punctuation">:</span> <span class="string">&quot;system&quot;</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;memory&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;/dev/memcg&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Mode&quot;</span><span class="punctuation">:</span> <span class="string">&quot;0700&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;UID&quot;</span><span class="punctuation">:</span> <span class="string">&quot;root&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;GID&quot;</span><span class="punctuation">:</span> <span class="string">&quot;system&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Optional&quot;</span><span class="punctuation">:</span> <span class="keyword">true</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;Cgroups2&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">    <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;/sys/fs/cgroup&quot;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;Mode&quot;</span><span class="punctuation">:</span> <span class="string">&quot;0755&quot;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;UID&quot;</span><span class="punctuation">:</span> <span class="string">&quot;system&quot;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;GID&quot;</span><span class="punctuation">:</span> <span class="string">&quot;system&quot;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;Controllers&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">      </span><br><span class="line">      <span class="punctuation">&#123;</span></span><br><span class="line">        <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;freezer&quot;</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;.&quot;</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">&quot;Mode&quot;</span><span class="punctuation">:</span> <span class="string">&quot;0755&quot;</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">&quot;UID&quot;</span><span class="punctuation">:</span> <span class="string">&quot;system&quot;</span><span class="punctuation">,</span></span><br><span class="line">        <span class="attr">&quot;GID&quot;</span><span class="punctuation">:</span> <span class="string">&quot;system&quot;</span></span><br><span class="line">      <span class="punctuation">&#125;</span></span><br><span class="line">    <span class="punctuation">]</span></span><br><span class="line">  <span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>cgroup</code>挂载完成后，通过<code>adb</code>的指令<code>mount</code>可以查看挂载的<code>cgroup</code>信息：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># mount -t cgroup</span></span><br><span class="line">none on /dev/blkio <span class="built_in">type</span> cgroup (rw,nosuid,nodev,noexec,relatime,blkio)</span><br><span class="line">none on /sys/fs/cgroup <span class="built_in">type</span> cgroup2 (rw,nosuid,nodev,noexec,relatime)</span><br><span class="line">none on /dev/cpuctl <span class="built_in">type</span> cgroup (rw,nosuid,nodev,noexec,relatime,cpu)</span><br><span class="line">none on /dev/cpuset <span class="built_in">type</span> cgroup (rw,nosuid,nodev,noexec,relatime,cpuset,noprefix,release_agent=/sbin/cpuset_release_agent)</span><br><span class="line">none on /dev/memcg <span class="built_in">type</span> cgroup (rw,nosuid,nodev,noexec,relatime,memory)</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>后续在应用启动创建进程的过程中，<code>AMS</code>会调用<code>ProcessList.startProcess</code>通过<code>Process.createProcessGroup</code>的接口来创建对应用户<code>UID</code>的冻结<code>cgroup</code>分组：</p><figure class="highlight java"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="keyword">private</span> Process.ProcessStartResult <span class="title function_">startProcess</span><span class="params">(HostingRecord hostingRecord, String entryPoint,</span></span><br><span class="line"><span class="params">        ProcessRecord app, <span class="type">int</span> uid, <span class="type">int</span>[] gids, <span class="type">int</span> runtimeFlags, <span class="type">int</span> zygotePolicyFlags,</span></span><br><span class="line"><span class="params">        <span class="type">int</span> mountExternal, String seInfo, String requiredAbi, String instructionSet,</span></span><br><span class="line"><span class="params">        String invokeWith, <span class="type">long</span> startTime)</span> &#123;</span><br><span class="line">    <span class="keyword">try</span> &#123;</span><br><span class="line">        Trace.traceBegin(Trace.TRACE_TAG_ACTIVITY_MANAGER, <span class="string">&quot;Start proc: &quot;</span> +</span><br><span class="line">                app.processName);</span><br><span class="line">        checkSlow(startTime, <span class="string">&quot;startProcess: asking zygote to start proc&quot;</span>);</span><br><span class="line">        <span class="keyword">final</span> <span class="type">boolean</span> <span class="variable">isTopApp</span> <span class="operator">=</span> hostingRecord.isTopApp();</span><br><span class="line">        <span class="keyword">if</span> (isTopApp) &#123;</span><br><span class="line">            <span class="comment">// Use has-foreground-activities as a temporary hint so the current scheduling</span></span><br><span class="line">            <span class="comment">// group won&#x27;t be lost when the process is attaching. The actual state will be</span></span><br><span class="line">            <span class="comment">// refreshed when computing oom-adj.</span></span><br><span class="line">            app.mState.setHasForegroundActivities(<span class="literal">true</span>);</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        Map&lt;String, Pair&lt;String, Long&gt;&gt; pkgDataInfoMap;</span><br><span class="line">        Map&lt;String, Pair&lt;String, Long&gt;&gt; allowlistedAppDataInfoMap;</span><br><span class="line">        <span class="type">boolean</span> <span class="variable">bindMountAppStorageDirs</span> <span class="operator">=</span> <span class="literal">false</span>;</span><br><span class="line">        <span class="type">boolean</span> <span class="variable">bindMountAppsData</span> <span class="operator">=</span> mAppDataIsolationEnabled</span><br><span class="line">                &amp;&amp; (UserHandle.isApp(app.uid) || UserHandle.isIsolated(app.uid))</span><br><span class="line">                &amp;&amp; mPlatformCompat.isChangeEnabled(APP_DATA_DIRECTORY_ISOLATION, app.info);</span><br><span class="line"></span><br><span class="line">        <span class="comment">// Get all packages belongs to the same shared uid. sharedPackages is empty array</span></span><br><span class="line">        <span class="comment">// if it doesn&#x27;t have shared uid.</span></span><br><span class="line">        <span class="keyword">final</span> <span class="type">PackageManagerInternal</span> <span class="variable">pmInt</span> <span class="operator">=</span> mService.getPackageManagerInternal();</span><br><span class="line">        <span class="keyword">final</span> String[] sharedPackages = pmInt.getSharedUserPackagesForPackage(</span><br><span class="line">                app.info.packageName, app.userId);</span><br><span class="line">        <span class="keyword">final</span> String[] targetPackagesList = sharedPackages.length == <span class="number">0</span></span><br><span class="line">                ? <span class="keyword">new</span> <span class="title class_">String</span>[]&#123;app.info.packageName&#125; : sharedPackages;</span><br><span class="line"></span><br><span class="line">        pkgDataInfoMap = getPackageAppDataInfoMap(pmInt, targetPackagesList, uid);</span><br><span class="line">        <span class="keyword">if</span> (pkgDataInfoMap == <span class="literal">null</span>) &#123;</span><br><span class="line">            <span class="comment">// TODO(b/152760674): Handle inode == 0 case properly, now we just give it a</span></span><br><span class="line">            <span class="comment">// tmp free pass.</span></span><br><span class="line">            bindMountAppsData = <span class="literal">false</span>;</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        ...</span><br><span class="line"></span><br><span class="line">        <span class="comment">// If it&#x27;s an isolated process, it should not even mount its own app data directories,</span></span><br><span class="line">        <span class="comment">// since it has no access to them anyway.</span></span><br><span class="line">        <span class="keyword">if</span> (app.isolated) &#123;</span><br><span class="line">            pkgDataInfoMap = <span class="literal">null</span>;</span><br><span class="line">            allowlistedAppDataInfoMap = <span class="literal">null</span>;</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="keyword">final</span> Process.ProcessStartResult startResult;</span><br><span class="line">        <span class="type">boolean</span> <span class="variable">regularZygote</span> <span class="operator">=</span> <span class="literal">false</span>;</span><br><span class="line">        <span class="keyword">if</span> (hostingRecord.usesWebviewZygote()) &#123;</span><br><span class="line">            startResult = startWebView(entryPoint,</span><br><span class="line">                    app.processName, uid, uid, gids, runtimeFlags, mountExternal,</span><br><span class="line">                    app.info.targetSdkVersion, seInfo, requiredAbi, instructionSet,</span><br><span class="line">                    app.info.dataDir, <span class="literal">null</span>, app.info.packageName,</span><br><span class="line">                    app.getDisabledCompatChanges(),</span><br><span class="line">                    <span class="keyword">new</span> <span class="title class_">String</span>[]&#123;PROC_START_SEQ_IDENT + app.getStartSeq()&#125;);</span><br><span class="line">        &#125; <span class="keyword">else</span> <span class="keyword">if</span> (hostingRecord.usesAppZygote()) &#123;</span><br><span class="line">            <span class="keyword">final</span> <span class="type">AppZygote</span> <span class="variable">appZygote</span> <span class="operator">=</span> createAppZygoteForProcessIfNeeded(app);</span><br><span class="line"></span><br><span class="line">            <span class="comment">// We can&#x27;t isolate app data and storage data as parent zygote already did that.</span></span><br><span class="line">            startResult = appZygote.getProcess().start(entryPoint,</span><br><span class="line">                    app.processName, uid, uid, gids, runtimeFlags, mountExternal,</span><br><span class="line">                    app.info.targetSdkVersion, seInfo, requiredAbi, instructionSet,</span><br><span class="line">                    app.info.dataDir, <span class="literal">null</span>, app.info.packageName,</span><br><span class="line">                    <span class="comment">/*zygotePolicyFlags=*/</span> ZYGOTE_POLICY_FLAG_EMPTY, isTopApp,</span><br><span class="line">                    app.getDisabledCompatChanges(), pkgDataInfoMap, allowlistedAppDataInfoMap,</span><br><span class="line">                    <span class="literal">false</span>, <span class="literal">false</span>,</span><br><span class="line">                    <span class="keyword">new</span> <span class="title class_">String</span>[]&#123;PROC_START_SEQ_IDENT + app.getStartSeq()&#125;);</span><br><span class="line">        &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">            regularZygote = <span class="literal">true</span>;</span><br><span class="line">            startResult = Process.start(entryPoint,</span><br><span class="line">                    app.processName, uid, uid, gids, runtimeFlags, mountExternal,</span><br><span class="line">                    app.info.targetSdkVersion, seInfo, requiredAbi, instructionSet,</span><br><span class="line">                    app.info.dataDir, invokeWith, app.info.packageName, zygotePolicyFlags,</span><br><span class="line">                    isTopApp, app.getDisabledCompatChanges(), pkgDataInfoMap,</span><br><span class="line">                    allowlistedAppDataInfoMap, bindMountAppsData, bindMountAppStorageDirs,</span><br><span class="line">                    <span class="keyword">new</span> <span class="title class_">String</span>[]&#123;PROC_START_SEQ_IDENT + app.getStartSeq()&#125;);</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> (!regularZygote) &#123;</span><br><span class="line">            <span class="comment">// 创建进程分组</span></span><br><span class="line">            <span class="comment">// webview and app zygote don&#x27;t have the permission to create the nodes</span></span><br><span class="line">            <span class="keyword">if</span> (Process.createProcessGroup(uid, startResult.pid) &lt; <span class="number">0</span>) &#123;</span><br><span class="line">                Slog.e(ActivityManagerService.TAG, <span class="string">&quot;Unable to create process group for &quot;</span></span><br><span class="line">                        + app.processName + <span class="string">&quot; (&quot;</span> + startResult.pid + <span class="string">&quot;)&quot;</span>);</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="comment">// This runs after Process.start() as this method may block app process starting time</span></span><br><span class="line">        <span class="comment">// if dir is not cached. Running this method after Process.start() can make it</span></span><br><span class="line">        <span class="comment">// cache the dir asynchronously, so zygote can use it without waiting for it.</span></span><br><span class="line">        <span class="keyword">if</span> (bindMountAppStorageDirs) &#123;</span><br><span class="line">            storageManagerInternal.prepareStorageDirs(userId, pkgDataInfoMap.keySet(),</span><br><span class="line">                    app.processName);</span><br><span class="line">        &#125;</span><br><span class="line">        checkSlow(startTime, <span class="string">&quot;startProcess: returned from zygote!&quot;</span>);</span><br><span class="line"></span><br><span class="line">        <span class="keyword">return</span> startResult;</span><br><span class="line">    &#125; <span class="keyword">finally</span> &#123;</span><br><span class="line">        Trace.traceEnd(Trace.TRACE_TAG_ACTIVITY_MANAGER);</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>Process.createProcessGroup</code>实际是一个<code>native</code>方法，<code>android_os_Process_createProcessGroup</code>方法最终调用<code>processgroup.cpp</code>中的<code>createProcessGroupInternal</code>函数，这个函数最终做两件事情：</p><ul><li>根据进程的<code>uid</code>与<code>pid</code>在<code>/sys/fs/cgroup/</code>目录下创建对应的<code>cgroup</code>分组</li><li>将进程的<code>pid</code>写入到<code>cgroup</code>分组的<code>procs</code>文件中</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> <span class="title function_">createProcessGroupInternal</span><span class="params">(<span class="type">uid_t</span> uid, <span class="type">int</span> initialPid, <span class="built_in">std</span>::<span class="built_in">string</span> cgroup)</span> &#123;</span><br><span class="line">    <span class="keyword">auto</span> uid_path = ConvertUidToPath(cgroup.c_str(), uid);</span><br><span class="line"></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">stat</span> <span class="title">cgroup_stat</span>;</span></span><br><span class="line">    <span class="type">mode_t</span> cgroup_mode = <span class="number">0750</span>;</span><br><span class="line">    <span class="type">gid_t</span> cgroup_uid = AID_SYSTEM;</span><br><span class="line">    <span class="type">uid_t</span> cgroup_gid = AID_SYSTEM;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (stat(cgroup.c_str(), &amp;cgroup_stat) == <span class="number">1</span>) &#123;</span><br><span class="line">        PLOG(ERROR) &lt;&lt; <span class="string">&quot;Failed to get stats for &quot;</span> &lt;&lt; cgroup;</span><br><span class="line">    &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">        cgroup_mode = cgroup_stat.st_mode;</span><br><span class="line">        cgroup_uid = cgroup_stat.st_uid;</span><br><span class="line">        cgroup_gid = cgroup_stat.st_gid;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (!MkdirAndChown(uid_path, cgroup_mode, cgroup_uid, cgroup_gid)) &#123;</span><br><span class="line">        PLOG(ERROR) &lt;&lt; <span class="string">&quot;Failed to make and chown &quot;</span> &lt;&lt; uid_path;</span><br><span class="line">        <span class="keyword">return</span> -errno;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">auto</span> uid_pid_path = ConvertUidPidToPath(cgroup.c_str(), uid, initialPid);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (!MkdirAndChown(uid_pid_path, cgroup_mode, cgroup_uid, cgroup_gid)) &#123;</span><br><span class="line">        PLOG(ERROR) &lt;&lt; <span class="string">&quot;Failed to make and chown &quot;</span> &lt;&lt; uid_pid_path;</span><br><span class="line">        <span class="keyword">return</span> -errno;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">auto</span> uid_pid_procs_file = uid_pid_path + PROCESSGROUP_CGROUP_PROCS_FILE;</span><br><span class="line"></span><br><span class="line">    <span class="type">int</span> ret = <span class="number">0</span>;</span><br><span class="line">    <span class="keyword">if</span> (!WriteStringToFile(<span class="built_in">std</span>::to_string(initialPid), uid_pid_procs_file)) &#123;</span><br><span class="line">        ret = -errno;</span><br><span class="line">        PLOG(ERROR) &lt;&lt; <span class="string">&quot;Failed to write &#x27;&quot;</span> &lt;&lt; initialPid &lt;&lt; <span class="string">&quot;&#x27; to &quot;</span> &lt;&lt; uid_pid_procs_file;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> ret;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>等系统正常启动完成后，我们可以到<code>/sys/fs/cgroup/</code>目录下查看对应的<code>cgroup</code>分组状态：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">rk3588m_car:/sys/fs/cgroup <span class="comment"># ls -al</span></span><br><span class="line">total 0</span><br><span class="line">drwxr-xr-x 47 system system 0 2024-12-16 19:11 .</span><br><span class="line">drwxr-xr-x 11 root   root   0 1970-01-01 08:00 ..</span><br><span class="line">-rwxr-xr-x  1 system system 0 1970-01-01 08:00 cgroup.controllers</span><br><span class="line">-rwxr-xr-x  1 system system 0 1970-01-01 08:00 cgroup.max.depth</span><br><span class="line">-rwxr-xr-x  1 system system 0 1970-01-01 08:00 cgroup.max.descendants</span><br><span class="line">-rwxr-xr-x  1 system system 0 1970-01-01 08:00 cgroup.procs</span><br><span class="line">-rwxr-xr-x  1 system system 0 1970-01-01 08:00 cgroup.stat</span><br><span class="line">-rwxr-xr-x  1 system system 0 1970-01-01 08:00 cgroup.subtree_control</span><br><span class="line">-rwxr-xr-x  1 system system 0 1970-01-01 08:00 cgroup.threads</span><br><span class="line">-rwxr-xr-x  1 system system 0 1970-01-01 08:00 cpu.pressure</span><br><span class="line">-rwxr-xr-x  1 system system 0 1970-01-01 08:00 cpu.stat</span><br><span class="line">-rwxr-xr-x  1 system system 0 1970-01-01 08:00 io.pressure</span><br><span class="line">-rwxr-xr-x  1 system system 0 1970-01-01 08:00 memory.pressure</span><br><span class="line">drwxr-xr-x 29 system system 0 2024-12-16 19:31 uid_0</span><br><span class="line">drwxr-xr-x 98 system system 0 2024-12-16 19:11 uid_1000</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:10 uid_10004</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:11 uid_10005</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:11 uid_10007</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:11 uid_10009</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:11 uid_10010</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:11 uid_10011</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:11 uid_10012</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:11 uid_1002</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:11 uid_10020</span><br><span class="line">...</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:11 uid_10029</span><br><span class="line">drwxr-xr-x  2 system system 0 2024-12-16 19:10 uid_1003</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:10 uid_10033</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:11 uid_10037</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:11 uid_10038</span><br><span class="line">drwxr-xr-x  4 system system 0 2024-12-16 19:10 uid_1010</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:10 uid_1020</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:11 uid_1036</span><br><span class="line">drwxr-xr-x  2 system system 0 2024-12-16 19:10 uid_1037</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:10 uid_1040</span><br><span class="line">drwxr-xr-x  6 system system 0 2024-12-16 19:10 uid_1041</span><br><span class="line">drwxr-xr-x  7 system system 0 2024-12-16 19:10 uid_1046</span><br><span class="line">drwxr-xr-x  3 system system 0 2024-12-16 19:10 uid_1047</span><br><span class="line"></span><br></pre></td></tr></table></figure><h3 id="进程冻结实现原理"><a href="#进程冻结实现原理" class="headerlink" title="进程冻结实现原理"></a><strong>进程冻结实现原理</strong></h3><p>在文章开始我们提到<code>Android</code>进程冻结的核心原理是基于<code>cgroup</code>中的冻结子系统来完成任务的冻结与解冻；具体来说，<code>Android</code>进程冻结分为两个步骤：</p><ul><li>首先通过<code>IPCThreadState.freeze</code>发送命令给<code>binder</code>驱动尝试冻结服务端的进程，<code>binder</code>驱动会冻结对应<code>pid</code>的服务，后续请求都会直接返回一个错误</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">status_t</span> <span class="title function_">IPCThreadState::freeze</span><span class="params">(<span class="type">pid_t</span> pid, <span class="type">bool</span> enable, <span class="type">uint32_t</span> timeout_ms)</span> &#123;</span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">binder_freeze_info</span> <span class="title">info</span>;</span></span><br><span class="line">    <span class="type">int</span> ret = <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line">    info.pid = pid;</span><br><span class="line">    info.enable = enable;</span><br><span class="line">    info.timeout_ms = timeout_ms;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">if</span> defined(__ANDROID__)</span></span><br><span class="line">    <span class="keyword">if</span> (ioctl(self()-&gt;mProcess-&gt;mDriverFD, BINDER_FREEZE, &amp;info) &lt; <span class="number">0</span>)</span><br><span class="line">        ret = -errno;</span><br><span class="line"><span class="meta">#<span class="keyword">endif</span></span></span><br><span class="line"></span><br><span class="line">    <span class="comment">//</span></span><br><span class="line">    <span class="comment">// ret==-EAGAIN indicates that transactions have not drained.</span></span><br><span class="line">    <span class="comment">// Call again to poll for completion.</span></span><br><span class="line">    <span class="comment">//</span></span><br><span class="line">    <span class="keyword">return</span> ret;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>binder</code>驱动接收到冻结指令<code>BINDER_FREEZE</code>后，会将对应的<code>binder</code>服务进程设置为<code>frozen</code>状态，后续请求都会直接返回一个<code>BR_FROZEN_REPLY</code>错误码，表示<code>binder</code>服务已经被冻结；如果设置了<code>timeout_ms</code>，则需要等待<code>binder</code>服务完成所有客户端的请求后再返回。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> <span class="title function_">binder_ioctl_freeze</span><span class="params">(<span class="keyword">struct</span> binder_freeze_info *info,</span></span><br><span class="line"><span class="params">       <span class="keyword">struct</span> binder_proc *target_proc)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="type">int</span> ret = <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!info-&gt;enable) &#123;</span><br><span class="line">binder_inner_proc_lock(target_proc);</span><br><span class="line">target_proc-&gt;sync_recv = <span class="literal">false</span>;</span><br><span class="line">target_proc-&gt;async_recv = <span class="literal">false</span>;</span><br><span class="line">target_proc-&gt;is_frozen = <span class="literal">false</span>;</span><br><span class="line">binder_inner_proc_unlock(target_proc);</span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Freezing the target. Prevent new transactions by</span></span><br><span class="line"><span class="comment"> * setting frozen state. If timeout specified, wait</span></span><br><span class="line"><span class="comment"> * for transactions to drain.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">binder_inner_proc_lock(target_proc);</span><br><span class="line">target_proc-&gt;sync_recv = <span class="literal">false</span>;</span><br><span class="line">target_proc-&gt;async_recv = <span class="literal">false</span>;</span><br><span class="line">target_proc-&gt;is_frozen = <span class="literal">true</span>;</span><br><span class="line">binder_inner_proc_unlock(target_proc);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (info-&gt;timeout_ms &gt; <span class="number">0</span>)</span><br><span class="line">ret = wait_event_interruptible_timeout(</span><br><span class="line">target_proc-&gt;freeze_wait,</span><br><span class="line">(!target_proc-&gt;outstanding_txns),</span><br><span class="line">msecs_to_jiffies(info-&gt;timeout_ms));</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Check pending transactions that wait for reply */</span></span><br><span class="line"><span class="keyword">if</span> (ret &gt;= <span class="number">0</span>) &#123;</span><br><span class="line">binder_inner_proc_lock(target_proc);</span><br><span class="line"><span class="keyword">if</span> (binder_txns_pending_ilocked(target_proc))</span><br><span class="line">ret = -EAGAIN;</span><br><span class="line">binder_inner_proc_unlock(target_proc);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (ret &lt; <span class="number">0</span>) &#123;</span><br><span class="line">binder_inner_proc_lock(target_proc);</span><br><span class="line">target_proc-&gt;is_frozen = <span class="literal">false</span>;</span><br><span class="line">binder_inner_proc_unlock(target_proc);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> ret;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><ul><li><code>binder</code>服务冻结后，需要通过<code>android_os_Process_setProcessFrozen</code>接口通过<code>cgroup</code>冻结子系统执行冻结；进程冻结完成后，进程状态变为<code>S</code>，执行的路径会阻塞在<code>do_freezer_trap</code></li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">// android_os_Process_setProcessFrozen</span></span><br><span class="line"><span class="type">void</span> <span class="title function_">android_os_Process_setProcessFrozen</span><span class="params">(</span></span><br><span class="line"><span class="params">        JNIEnv *env, jobject clazz, jint pid, jint uid, jboolean freeze)</span></span><br><span class="line">&#123;</span><br><span class="line">    <span class="type">bool</span> success = <span class="literal">true</span>;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (freeze) &#123;</span><br><span class="line">        success = SetProcessProfiles(uid, pid, &#123;<span class="string">&quot;Frozen&quot;</span>&#125;);</span><br><span class="line">    &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">        success = SetProcessProfiles(uid, pid, &#123;<span class="string">&quot;Unfrozen&quot;</span>&#125;);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (!success) &#123;</span><br><span class="line">        signalExceptionForGroupError(env, EINVAL, pid);</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>Android</code>进程<code>cgroup</code>相关的配置文件有两个：一个是<code>controller</code>相关的<code>cgroups.json</code>，另一个是<code>profiles</code>相关的<code>task_profiles.json</code>。在<code>task_profiles.json</code>中，<code>Frozen</code>与<code>Unfrozen</code>两个<code>profiles</code>分别对应<code>FreezerState</code>的<code>1</code>与<code>0</code>，而<code>FreezerState</code>对应的是控制器<code>freezer</code>的<code>cgroup.freeze</code>文件。</p><blockquote><p>有关<code>cgroup</code>的详细介绍可以参考<a href="https://sniffer.site/2024/04/15/%E5%A6%82%E4%BD%95%E5%88%A9%E7%94%A8cgroup%E4%BC%98%E5%8C%96android%E7%B3%BB%E7%BB%9F%E6%80%A7%E8%83%BD">如何利用cgroups优化Android系统性能</a></p></blockquote><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">//task_profiles.json</span></span><br><span class="line"><span class="punctuation">&#123;</span></span><br><span class="line">  <span class="attr">&quot;Attributes&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;LowCapacityCPUs&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cpuset&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;File&quot;</span><span class="punctuation">:</span> <span class="string">&quot;background/cpus&quot;</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    ...</span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;FreezerState&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;freezer&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;File&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cgroup.freeze&quot;</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line">  <span class="attr">&quot;Profiles&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;HighEnergySaving&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Actions&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;JoinCgroup&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cpu&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;background&quot;</span></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;Frozen&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Actions&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SetAttribute&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;FreezerState&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Value&quot;</span><span class="punctuation">:</span> <span class="string">&quot;1&quot;</span></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;Unfrozen&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Actions&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SetAttribute&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;FreezerState&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Value&quot;</span><span class="punctuation">:</span> <span class="string">&quot;0&quot;</span></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    ...</span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line">  <span class="attr">&quot;AggregateProfiles&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SCHED_SP_BACKGROUND&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Profiles&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span> <span class="string">&quot;HighEnergySaving&quot;</span><span class="punctuation">,</span> <span class="string">&quot;LowIoPriority&quot;</span><span class="punctuation">,</span> <span class="string">&quot;TimerSlackHigh&quot;</span> <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SCHED_SP_FOREGROUND&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Profiles&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span> <span class="string">&quot;HighPerformance&quot;</span><span class="punctuation">,</span> <span class="string">&quot;HighIoPriority&quot;</span><span class="punctuation">,</span> <span class="string">&quot;TimerSlackNormal&quot;</span> <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SCHED_SP_TOP_APP&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Profiles&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span> <span class="string">&quot;MaxPerformance&quot;</span><span class="punctuation">,</span> <span class="string">&quot;MaxIoPriority&quot;</span><span class="punctuation">,</span> <span class="string">&quot;TimerSlackNormal&quot;</span> <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    ...</span><br><span class="line">  <span class="punctuation">]</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>SetProcessProfiles</code>调用<code>TaskProfiles.SetProcessProfiles</code>函数来完成进程的冻结:<code>SetProcessProfiles</code>函数首先遍历系统中存在的所有<code>profiles</code>,找到对应名字为<code>Frozen</code>的<code>profile</code>，然后调用<code>TaskProfile.ExecuteForProcess</code>来完成进程的冻结。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">//processgroup.cpp</span></span><br><span class="line"><span class="type">bool</span> <span class="title function_">SetProcessProfiles</span><span class="params">(<span class="type">uid_t</span> uid, <span class="type">pid_t</span> pid, <span class="type">const</span> <span class="built_in">std</span>::<span class="built_in">vector</span>&lt;<span class="built_in">std</span>::<span class="built_in">string</span>&gt;&amp; profiles)</span> &#123;</span><br><span class="line">    <span class="keyword">return</span> TaskProfiles::GetInstance().SetProcessProfiles(uid, pid, profiles, <span class="literal">false</span>);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">//task_profiles.cpp</span></span><br><span class="line"><span class="type">bool</span> <span class="title function_">TaskProfiles::SetProcessProfiles</span><span class="params">(<span class="type">uid_t</span> uid, <span class="type">pid_t</span> pid,</span></span><br><span class="line"><span class="params">                                      <span class="type">const</span> <span class="built_in">std</span>::<span class="built_in">vector</span>&lt;<span class="built_in">std</span>::<span class="built_in">string</span>&gt;&amp; profiles, <span class="type">bool</span> use_fd_cache)</span> &#123;</span><br><span class="line">    <span class="keyword">for</span> (<span class="type">const</span> <span class="keyword">auto</span>&amp; name : profiles) &#123;</span><br><span class="line">        TaskProfile* profile = GetProfile(name);</span><br><span class="line">        <span class="keyword">if</span> (profile != nullptr) &#123;</span><br><span class="line">            <span class="keyword">if</span> (use_fd_cache) &#123;</span><br><span class="line">                profile-&gt;EnableResourceCaching(ProfileAction::RCT_PROCESS);</span><br><span class="line">            &#125;</span><br><span class="line">            <span class="keyword">if</span> (!profile-&gt;ExecuteForProcess(uid, pid)) &#123;</span><br><span class="line">                PLOG(WARNING) &lt;&lt; <span class="string">&quot;Failed to apply &quot;</span> &lt;&lt; name &lt;&lt; <span class="string">&quot; process profile&quot;</span>;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">            PLOG(WARNING) &lt;&lt; <span class="string">&quot;Failed to find &quot;</span> &lt;&lt; name &lt;&lt; <span class="string">&quot;process profile&quot;</span>;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>ExecuteForTask</code>首先需要通过对应的<code>ProfileAttribute</code>获取到对应的<code>cgroup</code>路径，然后通过<code>WriteStringToFile</code>将<code>FreezerState</code>的值写入到对应的<code>cgroup.freeze</code>文件中:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">//task_profile.cpp</span></span><br><span class="line"><span class="type">bool</span> <span class="title function_">SetAttributeAction::ExecuteForTask</span><span class="params">(<span class="type">int</span> tid)</span> <span class="type">const</span> &#123;</span><br><span class="line">    <span class="built_in">std</span>::<span class="built_in">string</span> path;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (!attribute_-&gt;GetPathForTask(tid, &amp;path)) &#123;</span><br><span class="line">        LOG(ERROR) &lt;&lt; <span class="string">&quot;Failed to find cgroup for tid &quot;</span> &lt;&lt; tid;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (!WriteStringToFile(value_, path)) &#123;</span><br><span class="line">        PLOG(ERROR) &lt;&lt; <span class="string">&quot;Failed to write &#x27;&quot;</span> &lt;&lt; value_ &lt;&lt; <span class="string">&quot;&#x27; to &quot;</span> &lt;&lt; path;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>GetPathForTask</code>函数通过<code>controller()-&gt;GetTaskGroup</code>获取到对应的<code>cgroup</code>路径，然后通过<code>StringPrintf</code>将<code>cgroup.freeze</code>文件的路径拼接起来，最终对应的路径为<code>/sys/fs/cgroup/&lt;uid&gt;/&lt;pid&gt;/cgroup.freeze</code>: 在该路径下写入<code>1</code>表示进程被冻结，写入<code>0</code>表示进程被解冻。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">//task_profile.cpp</span></span><br><span class="line"><span class="type">bool</span> <span class="title function_">ProfileAttribute::GetPathForTask</span><span class="params">(<span class="type">int</span> tid, <span class="built_in">std</span>::<span class="built_in">string</span>* path)</span> <span class="type">const</span> &#123;</span><br><span class="line">    <span class="built_in">std</span>::<span class="built_in">string</span> subgroup;</span><br><span class="line">    <span class="keyword">if</span> (!controller()-&gt;GetTaskGroup(tid, &amp;subgroup)) &#123;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (path == nullptr) &#123;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (subgroup.empty()) &#123;</span><br><span class="line">        *path = StringPrintf(<span class="string">&quot;%s/%s&quot;</span>, controller()-&gt;path(), file_name_.c_str());</span><br><span class="line">    &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">        *path = StringPrintf(<span class="string">&quot;%s/%s/%s&quot;</span>, controller()-&gt;path(), subgroup.c_str(),</span><br><span class="line">                             file_name_.c_str());</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>GetTaskGroup</code>首先根据进程<code>pid</code>找到对应的<code>cgroup</code>所属的分组信息：冻结分组比较特殊，以<code>0::</code>开头，其余分组的则通过<code>1:</code>的形式开头。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">bool</span> <span class="title function_">CgroupController::GetTaskGroup</span><span class="params">(<span class="type">int</span> tid, <span class="built_in">std</span>::<span class="built_in">string</span>* group)</span> <span class="type">const</span> &#123;</span><br><span class="line">    <span class="built_in">std</span>::<span class="built_in">string</span> file_name = StringPrintf(<span class="string">&quot;/proc/%d/cgroup&quot;</span>, tid);</span><br><span class="line">    <span class="built_in">std</span>::<span class="built_in">string</span> content;</span><br><span class="line">    <span class="keyword">if</span> (!android::base::ReadFileToString(file_name, &amp;content)) &#123;</span><br><span class="line">        PLOG(ERROR) &lt;&lt; <span class="string">&quot;Failed to read &quot;</span> &lt;&lt; file_name;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// if group is null and tid exists return early because</span></span><br><span class="line">    <span class="comment">// user is not interested in cgroup membership</span></span><br><span class="line">    <span class="keyword">if</span> (group == nullptr) &#123;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="built_in">std</span>::<span class="built_in">string</span> cg_tag;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (version() == <span class="number">2</span>) &#123;</span><br><span class="line">        cg_tag = <span class="string">&quot;0::&quot;</span>;</span><br><span class="line">    &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">        cg_tag = StringPrintf(<span class="string">&quot;:%s:&quot;</span>, name());</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="type">size_t</span> start_pos = content.find(cg_tag);</span><br><span class="line">    <span class="keyword">if</span> (start_pos == <span class="built_in">std</span>::<span class="built_in">string</span>::npos) &#123;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    start_pos += cg_tag.length() + <span class="number">1</span>;  <span class="comment">// skip &#x27;/&#x27;</span></span><br><span class="line">    <span class="type">size_t</span> end_pos = content.find(<span class="string">&#x27;\n&#x27;</span>, start_pos);</span><br><span class="line">    <span class="keyword">if</span> (end_pos == <span class="built_in">std</span>::<span class="built_in">string</span>::npos) &#123;</span><br><span class="line">        *group = content.substr(start_pos, <span class="built_in">std</span>::<span class="built_in">string</span>::npos);</span><br><span class="line">    &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">        *group = content.substr(start_pos, end_pos - start_pos);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>写入<code>cgroup.freeze</code>文件后，对应调用到内核函数<code>cgroup_freeze_write</code>，实际通过<code>cgroup_freeze</code>将该分组下面的搜友子分组对应的所有任务都设置为<code>FROZEN</code>状态:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">ssize_t</span> <span class="title function_">cgroup_freeze_write</span><span class="params">(<span class="keyword">struct</span> kernfs_open_file *of,</span></span><br><span class="line"><span class="params">   <span class="type">char</span> *buf, <span class="type">size_t</span> nbytes, <span class="type">loff_t</span> off)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">cgroup</span> *<span class="title">cgrp</span>;</span></span><br><span class="line"><span class="type">ssize_t</span> ret;</span><br><span class="line"><span class="type">int</span> freeze;</span><br><span class="line"></span><br><span class="line">ret = kstrtoint(strstrip(buf), <span class="number">0</span>, &amp;freeze);</span><br><span class="line"><span class="keyword">if</span> (ret)</span><br><span class="line"><span class="keyword">return</span> ret;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (freeze &lt; <span class="number">0</span> || freeze &gt; <span class="number">1</span>)</span><br><span class="line"><span class="keyword">return</span> -ERANGE;</span><br><span class="line"></span><br><span class="line">cgrp = cgroup_kn_lock_live(of-&gt;kn, <span class="literal">false</span>);</span><br><span class="line"><span class="keyword">if</span> (!cgrp)</span><br><span class="line"><span class="keyword">return</span> -ENOENT;</span><br><span class="line"></span><br><span class="line">cgroup_freeze(cgrp, freeze);</span><br><span class="line"></span><br><span class="line">cgroup_kn_unlock(of-&gt;kn);</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> nbytes;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>对于单个任务的冻结，都是通过函数<code>cgroup_freeze_task</code>来完成，该函数通过设置<code>task-&gt;jobctl</code>的<code>JOBCTL_TRAP_FREEZE</code>位来完成任务的冻结，通过清除<code>task-&gt;jobctl</code>的<code>JOBCTL_TRAP_FREEZE</code>位来完成任务的解冻。可以看到，内核实现任务的冻结并没有直接通过向对应的任务发送信号，而是首先设置一个<code>JOBCTL_TRAP_FREEZE</code>位；并通过<code>set_tsk_thread_flag</code>来标记当前任务有需要处理的信号，然后通过<code>signal_wake_up</code>函数唤醒对应的任务。任务唤醒后会返回到用户空间，然后在返回的路径上处理任务阻塞的信号，最终调用到<code>get_signal</code>函数来完成进程的冻结。</p><blockquote><p>详细的内核冻结流程可以参考<a href="https://kernel.meizu.com/2024/07/12/sub-system-cgroup-freezer-in-Linux-kernel/">深入探究 Linux 内核中的 cgroup freezer 子系统</a></p></blockquote><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">//kernel/cgroup/freezer.c</span></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Freeze or unfreeze the task by setting or clearing the JOBCTL_TRAP_FREEZE</span></span><br><span class="line"><span class="comment"> * jobctl bit.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="type">static</span> <span class="type">void</span> <span class="title function_">cgroup_freeze_task</span><span class="params">(<span class="keyword">struct</span> task_struct *task, <span class="type">bool</span> freeze)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="type">unsigned</span> <span class="type">long</span> flags;</span><br><span class="line"></span><br><span class="line"><span class="comment">/* If the task is about to die, don&#x27;t bother with freezing it. */</span></span><br><span class="line"><span class="keyword">if</span> (!lock_task_sighand(task, &amp;flags))</span><br><span class="line"><span class="keyword">return</span>;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (freeze) &#123;</span><br><span class="line">task-&gt;jobctl |= JOBCTL_TRAP_FREEZE;</span><br><span class="line">signal_wake_up(task, <span class="literal">false</span>);</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">task-&gt;jobctl &amp;= ~JOBCTL_TRAP_FREEZE;</span><br><span class="line">wake_up_process(task);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">unlock_task_sighand(task, &amp;flags);</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>get_signal</code>函数会检查当前进程是否需要处理信号，并检查<code>JOBCTL_TRAP_FREEZE</code>标志位，如果任务设置了该标志位，则调用<code>do_freezer_trap</code>函数来完成进程的冻结，这个函数也是冻结的任务最后执行的函数，在进程冻结后，我们可以通过查看进程的堆栈来确认这一点。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="type">bool</span> <span class="title function_">get_signal</span><span class="params">(<span class="keyword">struct</span> ksignal *ksig)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">sighand_struct</span> *<span class="title">sighand</span> =</span> current-&gt;sighand;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">signal_struct</span> *<span class="title">signal</span> =</span> current-&gt;signal;</span><br><span class="line"><span class="type">int</span> signr;</span><br><span class="line"></span><br><span class="line">    ...</span><br><span class="line"><span class="keyword">for</span> (;;) &#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">k_sigaction</span> *<span class="title">ka</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (unlikely(current-&gt;jobctl &amp; JOBCTL_STOP_PENDING) &amp;&amp;</span><br><span class="line">    do_signal_stop(<span class="number">0</span>))</span><br><span class="line"><span class="keyword">goto</span> relock;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (unlikely(current-&gt;jobctl &amp;</span><br><span class="line">     (JOBCTL_TRAP_MASK | JOBCTL_TRAP_FREEZE))) &#123;</span><br><span class="line"><span class="keyword">if</span> (current-&gt;jobctl &amp; JOBCTL_TRAP_MASK) &#123;</span><br><span class="line">do_jobctl_trap();</span><br><span class="line">spin_unlock_irq(&amp;sighand-&gt;siglock);</span><br><span class="line">            <span class="comment">//执行进程冻结的函数</span></span><br><span class="line">&#125; <span class="keyword">else</span> <span class="keyword">if</span> (current-&gt;jobctl &amp; JOBCTL_TRAP_FREEZE)</span><br><span class="line">do_freezer_trap();</span><br><span class="line"></span><br><span class="line"><span class="keyword">goto</span> relock;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * If the task is leaving the frozen state, let&#x27;s update</span></span><br><span class="line"><span class="comment"> * cgroup counters and reset the frozen bit.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">if</span> (unlikely(cgroup_task_frozen(current))) &#123;</span><br><span class="line">spin_unlock_irq(&amp;sighand-&gt;siglock);</span><br><span class="line">cgroup_leave_frozen(<span class="literal">false</span>);</span><br><span class="line"><span class="keyword">goto</span> relock;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">        ...</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> ksig-&gt;sig &gt; <span class="number">0</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>do_freezer_trap</code>实际就做了这么三件事情：</p><ul><li>将当前任务的状态设置为<code>TASK_INTERRUPTIBLE</code>，并清除<code>TIF_SIGPENDING</code>标志位</li><li>调用<code>cgroup_enter_frozen</code>设置当前任务为<code>FROZEN</code>状态，并更新对应分组的状态</li><li>调用<code>freezable_schedule</code>启动调度，冻结的任务会移除调度队列，任务处于睡眠状态，切换其他任务执行</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">/**</span></span><br><span class="line"><span class="comment"> * do_freezer_trap - handle the freezer jobctl trap</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="type">static</span> <span class="type">void</span> <span class="title function_">do_freezer_trap</span><span class="params">(<span class="type">void</span>)</span></span><br><span class="line">__<span class="title function_">releases</span><span class="params">(&amp;current-&gt;sighand-&gt;siglock)</span></span><br><span class="line">&#123;</span><br><span class="line"></span><br><span class="line">    ...</span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * Now we&#x27;re sure that there is no pending fatal signal and no</span></span><br><span class="line"><span class="comment"> * pending traps. Clear TIF_SIGPENDING to not get out of schedule()</span></span><br><span class="line"><span class="comment"> * immediately (if there is a non-fatal signal pending), and</span></span><br><span class="line"><span class="comment"> * put the task into sleep.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">__set_current_state(TASK_INTERRUPTIBLE);</span><br><span class="line">clear_thread_flag(TIF_SIGPENDING);</span><br><span class="line">spin_unlock_irq(&amp;current-&gt;sighand-&gt;siglock);</span><br><span class="line">cgroup_enter_frozen();</span><br><span class="line">freezable_schedule();</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>进程完全冻结后，我们通过<code>ps -A</code>命令查看进程状态，可以看到进程的状态为<code>S</code>，任务的等待通道（<code>wait channel</code>）为<code>do_freezer_trap</code>；查看进程的堆栈，可以看到进程确实是通过信号处理函数进入了冻结状态。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">#ps -A|grep -i rknn</span></span><br><span class="line">root           873     1 10972640  3448 do_freezer_trap     0 S rknn_server</span><br><span class="line"></span><br><span class="line"><span class="comment"># cat /proc/873/stack</span></span><br><span class="line">[&lt;0&gt;] __switch_to+0x118/0x148</span><br><span class="line">[&lt;0&gt;] do_freezer_trap+0x64/0xbc</span><br><span class="line">[&lt;0&gt;] get_signal+0x370/0x77c</span><br><span class="line">[&lt;0&gt;] do_signal+0xa0/0x298</span><br><span class="line">[&lt;0&gt;] do_notify_resume+0xac/0x218</span><br><span class="line">[&lt;0&gt;] work_pending+0xc/0x76c</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="Android进程冻结策略"><a href="#Android进程冻结策略" class="headerlink" title="Android进程冻结策略"></a><strong>Android进程冻结策略</strong></h2><p><code>Android</code>系统会在进程启动、服务绑定、应用前后台切换、发送&#x2F;接收广播等场景会主动更新系统所有应用的<code>adj</code>值，<code>adj</code>值越小，表示进程优先级越高，对应的存活时间越久，越不容易被系统杀死。一个应用处于后台，如果长时间没有活动，系统会调整<code>adj</code>值，在系统资源紧张（比如内存不足时），会主动清理（冻结或者杀死）这些<code>adj</code>值较大（<code>CACHED_APP_MIN_ADJ(900)&lt;=adj&lt;=CACHED_APP_MAX_ADJ(999)</code>）的进程。</p><p>应用调整<code>adj</code>值的核心逻辑都在<code>OomAdjuster</code>类中实现；更新完所有应用的<code>adj</code>值后，如果发现该进程的<code>adj</code>值大于<code>CACHED_APP_MIN_ADJ</code>，则会尝试调用<code>CachedAppOptimizer.freezeAppAsyncLSP</code>冻结该进程。其调用的链路大致如下：</p><figure class="highlight java"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">//OomAdjuster.java</span></span><br><span class="line">updateOomAdjLocked -&gt; updateOomAdjLSP -&gt; performUpdateOomAdjLSP </span><br><span class="line">-&gt; updateOomAdjInnerLSP -&gt; updateAndTrimProcessLSP -&gt; applyOomAdjLSP</span><br><span class="line">-&gt; updateAppFreezeStateLSP</span><br><span class="line"><span class="comment">//CachedAppOptimizer.java</span></span><br><span class="line">-&gt; freezeAppAsyncLSP</span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>updateAppFreezeStateLSP</code>函数首先会判断系统是否开启了进程冻结功能，该功能默认是开启的，具体的值可以通过设置两个配置项来开关（全局数据库的配置优先级更高）：</p><ul><li>全局数据库<code>Settings.Global.CACHED_APPS_FREEZER_ENABLED</code>：存放在系统数据库中的开关项，比如<code>adb shell settings put global cached_apps_freezer 1</code></li><li>设备配置<code>DeviceConfig</code>中的<code>use_freezer</code>项来设置，比如<code>adb shell device_config put activity_manager_native_boot use_freezer true</code></li></ul><p>如果未两个配置项都未开启，则说明系统不支持进程冻结，直接返回；否则如果进程的<code>adj</code>值大于等于<code>CACHED_APP_MIN_ADJ</code>且未被冻结过，则调用<code>freezeAppAsyncLSP</code>函数来冻结进程。</p><figure class="highlight java"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">//CachedAppOptimizer.java</span></span><br><span class="line"><span class="keyword">private</span> <span class="keyword">void</span> <span class="title function_">updateAppFreezeStateLSP</span><span class="params">(ProcessRecord app)</span> &#123;</span><br><span class="line">    <span class="keyword">if</span> (!mCachedAppOptimizer.useFreezer()) &#123;</span><br><span class="line">        <span class="keyword">return</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (app.mOptRecord.isFreezeExempt()) &#123;</span><br><span class="line">        <span class="keyword">return</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">final</span> <span class="type">ProcessCachedOptimizerRecord</span> <span class="variable">opt</span> <span class="operator">=</span> app.mOptRecord;</span><br><span class="line">    <span class="comment">// if an app is already frozen and shouldNotFreeze becomes true, immediately unfreeze</span></span><br><span class="line">    <span class="keyword">if</span> (opt.isFrozen() &amp;&amp; opt.shouldNotFreeze()) &#123;</span><br><span class="line">        mCachedAppOptimizer.unfreezeAppLSP(app);</span><br><span class="line">        <span class="keyword">return</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">final</span> <span class="type">ProcessStateRecord</span> <span class="variable">state</span> <span class="operator">=</span> app.mState;</span><br><span class="line">    <span class="comment">// Use current adjustment when freezing, set adjustment when unfreezing.</span></span><br><span class="line">    <span class="keyword">if</span> (state.getCurAdj() &gt;= ProcessList.CACHED_APP_MIN_ADJ &amp;&amp; !opt.isFrozen()</span><br><span class="line">            &amp;&amp; !opt.shouldNotFreeze()) &#123;</span><br><span class="line">        mCachedAppOptimizer.freezeAppAsyncLSP(app);</span><br><span class="line">    &#125; <span class="keyword">else</span> <span class="keyword">if</span> (state.getSetAdj() &lt; ProcessList.CACHED_APP_MIN_ADJ) &#123;</span><br><span class="line">        mCachedAppOptimizer.unfreezeAppLSP(app);</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>freezeAppAsyncLSP</code>并不会立即执行进程的冻结，而是通过<code>mFreezeHandler</code>发送一个延迟<code>10</code>分钟的<code>SET_FROZEN_PROCESS_MSG</code>消息，如果在此期间，系统的<code>adj</code>没有变小，则执行进程的冻结。</p><figure class="highlight java"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">//CachedAppOptimizer.java   </span></span><br><span class="line"><span class="meta">@GuardedBy(&#123;&quot;mAm&quot;, &quot;mProcLock&quot;&#125;)</span></span><br><span class="line"><span class="keyword">void</span> <span class="title function_">freezeAppAsyncLSP</span><span class="params">(ProcessRecord app)</span> &#123;</span><br><span class="line">    <span class="keyword">final</span> <span class="type">ProcessCachedOptimizerRecord</span> <span class="variable">opt</span> <span class="operator">=</span> app.mOptRecord;</span><br><span class="line">    <span class="keyword">if</span> (opt.isPendingFreeze()) &#123;</span><br><span class="line">        <span class="comment">// Skip redundant DO_FREEZE message</span></span><br><span class="line">        <span class="keyword">return</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    mFreezeHandler.sendMessageDelayed(</span><br><span class="line">            mFreezeHandler.obtainMessage(</span><br><span class="line">                SET_FROZEN_PROCESS_MSG, DO_FREEZE, <span class="number">0</span>, app),</span><br><span class="line">            mFreezerDebounceTimeout);</span><br><span class="line">    opt.setPendingFreeze(<span class="literal">true</span>);</span><br><span class="line">    <span class="keyword">if</span> (DEBUG_FREEZER) &#123;</span><br><span class="line">        Slog.d(TAG_AM, <span class="string">&quot;Async freezing &quot;</span> + app.getPid() + <span class="string">&quot; &quot;</span> + app.processName);</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a><strong>总结</strong></h2><p>进程冻结的核心目标是在<code>Android</code>内存紧张时，主动冻结长时间不活动的后台应用，释放内存资源，从而节省功耗，提升系统性能。但目前来说，<code>Android</code>进程冻结的实现并不完善，还存在一些可以改善的地方，比如：</p><ul><li>进程冻结只考虑到了内存资源情况，没有考虑到如CPU、IO等其他系统资源的占用情况</li><li>进程冻结目前只支持<code>Java</code>层的应用，对于<code>Native</code>的进程并不支持冻结</li></ul><h2 id="参考文献"><a href="#参考文献" class="headerlink" title="参考文献"></a><strong>参考文献</strong></h2><ul><li><a href="https://gityuan.com/2018/05/19/android-process-adj/">https://gityuan.com/2018/05/19/android-process-adj/</a></li><li><a href="https://android.googlesource.com/platform/frameworks/base/+/master/services/core/java/com/android/server/am/OomAdjuster.md">https://android.googlesource.com/platform/frameworks/base/+/master/services/core/java/com/android/server/am/OomAdjuster.md</a></li><li><a href="https://sniffer.site/2024/04/15/%E5%A6%82%E4%BD%95%E5%88%A9%E7%94%A8cgroup%E4%BC%98%E5%8C%96android%E7%B3%BB%E7%BB%9F%E6%80%A7%E8%83%BD">https://sniffer.site/2024/04/15/%E5%A6%82%E4%BD%95%E5%88%A9%E7%94%A8cgroup%E4%BC%98%E5%8C%96android%E7%B3%BB%E7%BB%9F%E6%80%A7%E8%83%BD</a></li><li><a href="https://kernel.meizu.com/2024/07/12/sub-system-cgroup-freezer-in-Linux-kernel/">https://kernel.meizu.com/2024/07/12/sub-system-cgroup-freezer-in-Linux-kernel/</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;&lt;code&gt;Google&lt;/code&gt;从&lt;code&gt;Android11&lt;/code&gt;系统开始支持应用冻结功能，可以将后台长时间未运行的任务暂缓执行，通过将对应的进程迁移到对应的&lt;code&gt;cgroup&lt;/code&gt;分组来冻结对应的后台缓存应用，这样可以减少如CPU、内存等资源占用，减少业务在后台的不当行为，尽可能减少功耗。本文将对&lt;code&gt;Android&lt;/code&gt;的进程冻结的实现原理、冻结策略进行详细的介绍与阐述，争取把相关的策略与机制都讲述清楚，主要分为以下几个部分 :&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Android&lt;/code&gt;进程冻结的大致框架：主要介绍进程冻结的总体框架与思路&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Android&lt;/code&gt;进程冻结的实现原理：介绍&lt;code&gt;Android&lt;/code&gt;如何实现进程冻结&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Android&lt;/code&gt;进程冻结的冻结策略：进程冻结的具体策略&lt;/li&gt;
&lt;/ul&gt;</summary>
    
    
    
    <category term="Android" scheme="https://sniffer.site/categories/Android/"/>
    
    
    <category term="Android" scheme="https://sniffer.site/tags/Android/"/>
    
    <category term="进程冻结" scheme="https://sniffer.site/tags/%E8%BF%9B%E7%A8%8B%E5%86%BB%E7%BB%93/"/>
    
    <category term="cgroup" scheme="https://sniffer.site/tags/cgroup/"/>
    
  </entry>
  
  <entry>
    <title>Linux实时调度踩到的那些坑</title>
    <link href="https://sniffer.site/2024/11/25/Linux%E5%AE%9E%E6%97%B6%E8%B0%83%E5%BA%A6%E8%B8%A9%E5%9D%91%E7%BB%8F%E9%AA%8C%E4%B9%8B%E8%B0%88/"/>
    <id>https://sniffer.site/2024/11/25/Linux%E5%AE%9E%E6%97%B6%E8%B0%83%E5%BA%A6%E8%B8%A9%E5%9D%91%E7%BB%8F%E9%AA%8C%E4%B9%8B%E8%B0%88/</id>
    <published>2024-11-25T11:03:31.000Z</published>
    <updated>2025-07-09T11:32:02.719Z</updated>
    
    <content type="html"><![CDATA[<p>早期Linux内核的调度更多考虑的是系统调度的公平与吞吐量，对于实时性的支持并不友好。为了改善系统的响应时间，降低某些场景下实时任务的调度延迟，从<code>2.6</code>版本开始支持了实时调度与抢占功能，开发人员为此专门建立了一个<a href="https://wiki.linuxfoundation.org/realtime/start">实时Linux的网站</a>，上面提供了实时内核的一些历史状态与补丁信息。实时调度对于音视频、UI渲染等对时间非常敏感的任务来说，非常必要。比如对于<code>Android</code>平台，会将音频、渲染相关的一些核心任务的调度策略设置为实时调度，这样可以减少系统调度延迟与任务抢占带来的延时。Linux内核中的实时调度主要有两种调度策略：</p><ul><li><code>SCHED_FIFO</code>: 先入先出，即优先级高的任务优先执行，不会被其他任务抢占，直到对应的任务阻塞或者主动释放CPU</li><li><code>SCHED_RR</code>: 轮询（也称随机轮盘）调度，相同优先级的任务轮流执行相同的时间片，时间片用完后会调度其他的任务</li></ul><blockquote><p>本文基于Linux内核5.10版本分析</p></blockquote><span id="more"></span><p>我们可以通过<code>top -H</code>命令查看系统实时任务的情况，其中<code>PR</code>列为<code>RT</code>的即为实时调度的任务。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">1216 audioserver  RT   0 119M  62M  10M S  4.6   0.5   0:08.70 DSP00Task0      android.hardware.audio.service</span><br><span class="line">226 root         RT   0    0    0    0 S  1.3   0.0   0:03.86 irq/135-asm330l [irq/135-asm330l]</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>尽管实时调度对于一些时间敏感的任务来说非常合适，但是对于一个多任务系统来说，如果系统中存在很多的进程，负载比较高，有可能会出现一些负面的效应；比较常见的问题有如下两类：</p><ul><li>实时进程会抢占其他非实时任务的CPU，长时间占据CPU，导致系统吞吐量下降，引起性能问题</li><li>由于优先级设置不当，高优先级的实时任务会抢占低优先级实时任务的CPU，导致某些任务处理延迟</li></ul><p>接下来我们就一起来看看这两类问题的表现，以及如何在实际开发中避免。在此之前，首先来简单看一看Linux内核中实时调度策略的实现。</p><h2 id="Linux内核的实时调度"><a href="#Linux内核的实时调度" class="headerlink" title="Linux内核的实时调度"></a><strong>Linux内核的实时调度</strong></h2><p>除了常规的公平调度<code>CFS(Complete Fair Scheduling)</code>之外，Linux内核还支持两类实时调度类型：</p><ul><li>随机轮盘调度(<code>Round-robin</code>， <code>SCHED_RR</code>)：该调度策略的事实任务有固定的时间片（默认是<code>100ms</code>）,任务执行完一段时间后，时间片减少；时间片用完后，进程换出，会放入到运行队列末尾，等待下一轮调度；这样确保相同优先级的任务可以轮流执行</li><li>先进先出调度（<code>First-In, First-Out</code>, <code>SCHED_FIFO</code>）：该调度策略没有时间片的限制，一旦调度执行会一直占用CPU；如果该任务的代码有问题导致阻塞，就可能出现CPU被长时间占用而无法换出的问题。</li></ul><p>从内核代码可以看到，内核执行任务调度时，从高优先级调度类开始选择任务，再到低优先级调度类-实时调度类<code>rt_sched_class</code>高于公平调度类（<code>SCHED_NORMAL</code>）<code>fair_sched_class</code>：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">// vmlinux.lds.h</span></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * The order of the sched class addresses are important, as they are</span></span><br><span class="line"><span class="comment"> * used to determine the order of the priority of each sched class in</span></span><br><span class="line"><span class="comment"> * relation to each other.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="meta">#<span class="keyword">define</span> SCHED_DATA\</span></span><br><span class="line"><span class="meta">STRUCT_ALIGN();\</span></span><br><span class="line"><span class="meta">__begin_sched_classes = .;\</span></span><br><span class="line"><span class="meta">*(__idle_sched_class)\</span></span><br><span class="line"><span class="meta">*(__fair_sched_class)\</span></span><br><span class="line"><span class="meta">*(__rt_sched_class)\</span></span><br><span class="line"><span class="meta">*(__dl_sched_class)\</span></span><br><span class="line"><span class="meta">*(__stop_sched_class)\</span></span><br><span class="line"><span class="meta">__end_sched_classes = .;</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">define</span> sched_class_highest (__end_sched_classes - 1)</span></span><br><span class="line"><span class="meta">#<span class="keyword">define</span> sched_class_lowest  (__begin_sched_classes - 1)</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">define</span> for_each_class(class) \</span></span><br><span class="line"><span class="meta">for_class_range(class, sched_class_highest, sched_class_lowest)</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>Linux内核调度为每个调度类型都提供了一个关键的调度类，实时调度类如下：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">DEFINE_SCHED_CLASS(rt) = &#123;</span><br><span class="line"></span><br><span class="line">.enqueue_task= enqueue_task_rt,</span><br><span class="line">.dequeue_task= dequeue_task_rt,</span><br><span class="line">.yield_task= yield_task_rt,</span><br><span class="line"></span><br><span class="line">.check_preempt_curr= check_preempt_curr_rt,</span><br><span class="line"></span><br><span class="line">.pick_next_task= pick_next_task_rt,</span><br><span class="line">.put_prev_task= put_prev_task_rt,</span><br><span class="line">.set_next_task          = set_next_task_rt,</span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_SMP</span></span><br><span class="line">.balance= balance_rt,</span><br><span class="line">.select_task_rq= select_task_rq_rt,</span><br><span class="line">.set_cpus_allowed       = set_cpus_allowed_common,</span><br><span class="line">.rq_online              = rq_online_rt,</span><br><span class="line">.rq_offline             = rq_offline_rt,</span><br><span class="line">.task_woken= task_woken_rt,</span><br><span class="line">.switched_from= switched_from_rt,</span><br><span class="line">.find_lock_rq= find_lock_lowest_rq,</span><br><span class="line"><span class="meta">#<span class="keyword">endif</span></span></span><br><span class="line"></span><br><span class="line">.task_tick= task_tick_rt,</span><br><span class="line"></span><br><span class="line">.get_rr_interval= get_rr_interval_rt,</span><br><span class="line"></span><br><span class="line">.prio_changed= prio_changed_rt,</span><br><span class="line">.switched_to= switched_to_rt,</span><br><span class="line"></span><br><span class="line">.update_curr= update_curr_rt,</span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">ifdef</span> CONFIG_UCLAMP_TASK</span></span><br><span class="line">.uclamp_enabled= <span class="number">1</span>,</span><br><span class="line"><span class="meta">#<span class="keyword">endif</span></span></span><br><span class="line">&#125;;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>实时调度的调度队列是一个双向链表，所有优先级相同的任务都放入到<code>active.queue[prio]</code>这个队列里（实时调度的最大优先级<code>MAX_RT_PRIO</code>），<code>active.bitmap</code>用于记录哪个优先级对应的队列有任务；对实时调度实现原理感兴趣的可以研究下内核的代码<code>kernel/sched/rt.c</code>。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">rt_prio_array</span> &#123;</span></span><br><span class="line">  DECLARE_BITMAP(bitmap, MAX_RT_PRIO+<span class="number">1</span>); <span class="comment">/* include 1 bit for delimiter */</span></span><br><span class="line">  <span class="class"><span class="keyword">struct</span> <span class="title">list_head</span> <span class="title">queue</span>[<span class="title">MAX_RT_PRIO</span>];</span></span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">rt_rq</span> &#123;</span></span><br><span class="line">  <span class="class"><span class="keyword">struct</span> <span class="title">rt_prio_array</span> <span class="title">active</span>;</span></span><br><span class="line">&#125;;</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="实时调度坑之一-实时任务长时间占据CPU"><a href="#实时调度坑之一-实时任务长时间占据CPU" class="headerlink" title="实时调度坑之一-实时任务长时间占据CPU"></a><strong>实时调度坑之一-实时任务长时间占据CPU</strong></h2><p>之前在一个项目开发过程中碰到一个问题：系统中一个跟摄像头相关的实时任务长时间占用了<code>CPU0</code>，持续运行了<code>100+ms</code>，而音频相关的软中断恰好也在<code>CPU0</code>上处理（物理中断默认绑定在<code>CPU0</code>上，对应的软中断会跟物理中断在同一个CPU上处理），导致音频的软中断无法抢占到<code>CPU</code>，发生响应延迟，导致音频卡顿。</p><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/linux-rt-process-long-run.jpg" alt="RT线程执行时间过长"></p><p>那么，软中断为啥没能竞争过用户空间的实时任务了？根因在于内核中的软中断<code>softirqd</code>线程创建时默认使用<code>SCHED_NORMAL</code>公平调度策略，因此优先级是低于实时调度(<code>RT</code>)的，这也能解释为为什么软中断无法抢占到<code>CPU</code>，导致音频卡顿。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="meta"># kthread.c</span></span><br><span class="line"><span class="type">static</span> <span class="type">int</span> <span class="title function_">kthread</span><span class="params">(<span class="type">void</span> *_create)</span></span><br><span class="line">&#123;</span><br><span class="line">    <span class="type">static</span> <span class="type">const</span> <span class="class"><span class="keyword">struct</span> <span class="title">sched_param</span> <span class="title">param</span> =</span> &#123; .sched_priority = <span class="number">0</span> &#125;;</span><br><span class="line">    <span class="comment">/* Copy data: it&#x27;s on kthread&#x27;s stack */</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">kthread_create_info</span> *<span class="title">create</span> =</span> _create;</span><br><span class="line">    <span class="type">int</span> (*threadfn)(<span class="type">void</span> *data) = create-&gt;threadfn;</span><br><span class="line">    <span class="type">void</span> *data = create-&gt;data;</span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">completion</span> *<span class="title">done</span>;</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">kthread</span> *<span class="title">self</span>;</span></span><br><span class="line">    <span class="type">int</span> ret;</span><br><span class="line"></span><br><span class="line">    self = to_kthread(current);</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* Release the structure when caller killed by a fatal signal. */</span></span><br><span class="line">    done = xchg(&amp;create-&gt;done, <span class="literal">NULL</span>);</span><br><span class="line">    <span class="keyword">if</span> (!done) &#123;</span><br><span class="line">        kfree(create-&gt;full_name);</span><br><span class="line">        kfree(create);</span><br><span class="line">        kthread_exit(-EINTR);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    self-&gt;full_name = create-&gt;full_name;</span><br><span class="line">    self-&gt;threadfn = threadfn;</span><br><span class="line">    self-&gt;data = data;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/*</span></span><br><span class="line"><span class="comment">     * The new thread inherited kthreadd&#x27;s priority and CPU mask. Reset</span></span><br><span class="line"><span class="comment">     * back to default in case they have been changed.</span></span><br><span class="line"><span class="comment">     */</span></span><br><span class="line">    sched_setscheduler_nocheck(current, SCHED_NORMAL, &amp;param);</span><br><span class="line">    set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* OK, tell user we&#x27;re spawned, wait for stop or wakeup */</span></span><br><span class="line">    __set_current_state(TASK_UNINTERRUPTIBLE);</span><br><span class="line">    create-&gt;result = current;</span><br><span class="line">    <span class="comment">/*</span></span><br><span class="line"><span class="comment">     * Thread is going to call schedule(), do not preempt it,</span></span><br><span class="line"><span class="comment">     * or the creator may spend more time in wait_task_inactive().</span></span><br><span class="line"><span class="comment">     */</span></span><br><span class="line">    preempt_disable();</span><br><span class="line">    complete(done);</span><br><span class="line">    schedule_preempt_disabled();</span><br><span class="line">    preempt_enable();</span><br><span class="line">    ...</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>这类问题要解决有两个方法，一个是直接将用户空间的线程调度策略设置为普通公平调度，一个是开启物理中断的<code>CPU</code>亲和性，确保软中断处理不绑定到特定的<code>CPU</code>上，从而错开与实时调度任务的执行，也可以将软中断设置为实时调度策略（这个影响较大，不推荐）。</p><h2 id="实时调度坑之二-优先级设置不当"><a href="#实时调度坑之二-优先级设置不当" class="headerlink" title="实时调度坑之二-优先级设置不当"></a><strong>实时调度坑之二-优先级设置不当</strong></h2><p>与问题一不一样的是，问题二是两个实时任务的竞争引起的音频卡顿的问题（<code>Android</code>中大部分的实时调度任务都是音频）：一个应用进入前台后（Android中前台进程是<code>top-app</code>，绑定<code>CPU0~3</code>，我们开启了Android的一个特定<code>sys.use_fifo_ui</code>，会使得应用的<code>UI</code>线程使用实时调度策略），会偶现音频播放出现杂音。通过复现抓到的<code>trace</code>可以看到，内核音频线程(<code>297</code>)有好几处长时间的(大于<code>13ms</code>以上)休眠，此时同一个<code>CPU0</code>上运行的就是前台的任务的主线程<code>6600</code>，可以看到只有等主线程执行完成释放<code>CPU0</code>，音频内核线程<code>297</code>才会唤醒，而此时音频可能已经出现了丢帧，从而出现杂音的问题。</p><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/rt-task-cpu-contention.jpg" alt="RT实时任务竞争CPU"></p><p>那么，为啥同样是实时调度的任务，内核的音频线程没法抢占到<code>CPU</code>呢？从实时调度的原理来看，可以推测是内核实时线程的优先级低于前台任务的主线程，实际在设备确认发现，音频线程的优先级与主线程优先级恰好相等，都是<code>98</code>(实时线程的最大优先级是<code>100</code>)。</p><p>与上一个问题类似，要解决问题二，要么关闭<code>Android</code>的主线程优化<code>sys.use_fifo_ui</code>，将其设置为<code>0</code>，从而避开与内核音频线程的竞争；要么提高内核音频线程的优先级，在创建线程时将内核线程的<code>nice</code>值降低（优先级提高）。实测发现方案二有效，最终我们也采用了方案二来解决问题。</p><h2 id="如何限制实时任务的执行时间"><a href="#如何限制实时任务的执行时间" class="headerlink" title="如何限制实时任务的执行时间"></a><strong>如何限制实时任务的执行时间</strong></h2><p>从实时调度的调度策略来看，如果实时进程的代码存在问题，就很有可能导致CPU长时间被占用，系统卡住。内核为了解决该问题，针对实时任务的执行时间进行限定。在<code>proc</code>目录下有两个参数：</p><ul><li><code>sched_rt_period_us</code>： 表示最大的调度时长（可以理解为100%的CPU带宽），大小范围从<code>-1</code>到<code>INT_MAX-1</code>，默认是1s，</li><li><code>sched_rt_runtime_us</code>: 表示实时任务最大可运行时长，默认是0.95s,表示实时任务可以使用0.95s的CPU时间，而其他调度类的进程可以使用余下的0.05s</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># cat /proc/sys/kernel/sched_rt_period_us</span></span><br><span class="line">1000000</span><br><span class="line"><span class="comment"># cat /proc/sys/kernel/sched_rt_runtime_us</span></span><br><span class="line">950000</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>通过调节这两个参数，我们可以限制实时类调度任务的时间片分配，从而确保其他任务可以执行。除此之外，通过设定<code>cgroup</code>的配置<code>CONFIG_RT_GROUP_SCHED</code>，也可以通过控制分组来限定某些分组的实时任务占用的时间片；启动该配置后，在对应的cgroup目录可以看到如下配置：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">cpu.rt_period_us</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>通过设定该参数，可以在不同的分组采用不一样的实时任务分配策略，确保某些分组比如后台的实时任务不长时间占用CPU，从而解决其他任务无法抢占到CPU的问题。另外，内核为了确保实时任务的低延迟，通过一个调度配置<code>RT_RUNTIME_SHARE</code>来开启各个CPU的时间片共享，就是说，如果当前实时任务队列的时间片用完后，可以向其他CPU借用时间片，从而保证该实时任务能执行，而不至于被其他常规的任务抢占CPU被阻塞。</p><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a><strong>总结</strong></h2><p>对<code>Linux</code>这种通用的操作系统来说，进程的调度要考虑的因素非常多：既要考虑到低延迟的任务处理，降低响应延时，比如音频、UI的渲染，同时要考虑并发处理多个任务，保持系统的高吞吐量，这两个目标通常是相互冲突的，需要在仔细权衡。在使用实时调度策略的时候，我们还是要谨慎处理，避免实时任务竞争<code>CPU</code>引起的问题。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a><strong>参考资料</strong></h2><ul><li><a href="https://wiki.linuxfoundation.org/realtime/documentation/technical_basics/sched_rt_throttling">https://wiki.linuxfoundation.org/realtime/documentation/technical_basics/sched_rt_throttling</a></li><li><a href="https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt">https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt</a></li><li><a href="https://wiki.linuxfoundation.org/realtime/start">https://wiki.linuxfoundation.org/realtime/start</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;早期Linux内核的调度更多考虑的是系统调度的公平与吞吐量，对于实时性的支持并不友好。为了改善系统的响应时间，降低某些场景下实时任务的调度延迟，从&lt;code&gt;2.6&lt;/code&gt;版本开始支持了实时调度与抢占功能，开发人员为此专门建立了一个&lt;a href=&quot;https://wiki.linuxfoundation.org/realtime/start&quot;&gt;实时Linux的网站&lt;/a&gt;，上面提供了实时内核的一些历史状态与补丁信息。实时调度对于音视频、UI渲染等对时间非常敏感的任务来说，非常必要。比如对于&lt;code&gt;Android&lt;/code&gt;平台，会将音频、渲染相关的一些核心任务的调度策略设置为实时调度，这样可以减少系统调度延迟与任务抢占带来的延时。Linux内核中的实时调度主要有两种调度策略：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;SCHED_FIFO&lt;/code&gt;: 先入先出，即优先级高的任务优先执行，不会被其他任务抢占，直到对应的任务阻塞或者主动释放CPU&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SCHED_RR&lt;/code&gt;: 轮询（也称随机轮盘）调度，相同优先级的任务轮流执行相同的时间片，时间片用完后会调度其他的任务&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;本文基于Linux内核5.10版本分析&lt;/p&gt;
&lt;/blockquote&gt;</summary>
    
    
    
    <category term="Linux" scheme="https://sniffer.site/categories/Linux/"/>
    
    
    <category term="Kernel" scheme="https://sniffer.site/tags/Kernel/"/>
    
    <category term="实时调度" scheme="https://sniffer.site/tags/%E5%AE%9E%E6%97%B6%E8%B0%83%E5%BA%A6/"/>
    
    <category term="进程优先级" scheme="https://sniffer.site/tags/%E8%BF%9B%E7%A8%8B%E4%BC%98%E5%85%88%E7%BA%A7/"/>
    
  </entry>
  
  <entry>
    <title>高通QNX平台DCVS介绍</title>
    <link href="https://sniffer.site/2024/07/24/%E9%AB%98%E9%80%9AQNX%E5%B9%B3%E5%8F%B0DCVS%E4%BB%8B%E7%BB%8D/"/>
    <id>https://sniffer.site/2024/07/24/%E9%AB%98%E9%80%9AQNX%E5%B9%B3%E5%8F%B0DCVS%E4%BB%8B%E7%BB%8D/</id>
    <published>2024-07-24T11:17:15.000Z</published>
    <updated>2025-06-20T08:30:45.511Z</updated>
    
    <content type="html"><![CDATA[<p>目前智能座舱领域的方案中，高通的两大平台<code>8155</code>&#x2F;<code>8295</code>占据了大部分的市场份额，这两个硬件平台都是基于<code>QNX</code>系统的虚拟化方案实现的，就是说中控域与仪表域都跑在一个系统上了-座舱通常是Android系统，实际是QNX上的一个虚拟机；而仪表通常是运行在QNX侧。跟传统的单Android系统比较来看，<code>QNX</code>虚拟化平台有很多的变化，比如很多物理驱动与系统服务都跑在了QNX上，而Android上看到的只是一个虚拟的设备，或者压根就去掉了，比如本文要讲到的动态调频与调压功能<code>DCVS(Dynamic Clock and Voltage Scaling)</code>就是一个例子，这个功能在Android上已经没有了，所有的调频与调压功能都在QNX上实现。</p><blockquote><p><code>DCVS</code>也可将其称为<code>DVFS(Dynamic Voltage and Frequency Scaling)</code>实际都是根据系统负载动态调整CPU&#x2F;GPU&#x2F;DDR等工作频率与电压，从而减少功耗</p></blockquote><p>接下来，我们就一起看看高通QNX平台的<code>DCVS</code>功能是如何实现的，以及如何在QNX平台查看<code>CPU, GPU，UFS，DDR</code>的频率。</p> <span id="more"></span><h2 id="QNX的DCVS实现原理"><a href="#QNX的DCVS实现原理" class="headerlink" title="QNX的DCVS实现原理"></a><strong>QNX的DCVS实现原理</strong></h2><h3 id="CPU-DCVS"><a href="#CPU-DCVS" class="headerlink" title="CPU DCVS"></a><strong>CPU DCVS</strong></h3><p><code>DCVS</code>是一种电源管理的策略，用于根据系统负载状态动态的调整系统核心频率，以减少功耗，节省电能，这个在如手机这样的移动平台使用的最为广泛。下图是<code>CPU DCVS</code>的的原理框图，其核心的功能都是在一个后台服务<code>dcvs_service</code>中实现的，该服务负责与其他模块如<code>kernel</code>, <code>qcore</code>和<code>io_service</code>进行交互：</p><ul><li>根据系统负载来选择相应频率的算法是在QNX内核实现的；一旦决定选择某个频率，内核会发送请求给<code>qcore</code>进程来进行频率的设定</li><li>如果系统负载过高导致触发高温保护，此时调频策略完全由<code>LMH(Limit Management Hardware)</code>模块负责执行；而等到系统温度降低到设定的阈值，则<code>DCVS</code>重新交由内核进行处理；<code>dcvs_service</code>服务注册热管理模块(<code>Thermal LMH</code>)的事件回调；在收到开启与关闭事件回调的时间窗口内，会关闭内核的动态调频功能，由<code>LMH</code>硬件负责管理<code>CPU</code>的频率</li><li><code>QNX</code>内核中包含一个负责调皮策略的管理者(<code>governor</code>), 其负责监控系统负载，一旦某个CPU簇(<code>cluster</code>, 8285上有两个簇，对应大小两个核心)的负载在一定的时间内持续超过了设定的阈值-高负载(<code>overflow</code>)或者低负载(<code>underflow</code>)，内核的管理者就会触发一个事件；<code>DCVS</code>服务收到该事件后，会主动将频率调整为内核推荐的频率</li></ul><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/qcom-qnx-dcvs.png" alt="qnx-dcvs"></p><blockquote><ul><li>qcore是QNX上的电源管理核心服务</li><li>DLPMP(Dynamic power performance levels and multiprocessor control)：动态调整性能模式</li><li>LMH(Limit Management Hardware): 用于系统高低温保护</li></ul></blockquote><h3 id="如何进行DCVS参数设定"><a href="#如何进行DCVS参数设定" class="headerlink" title="如何进行DCVS参数设定"></a><strong>如何进行DCVS参数设定</strong></h3><p>以<code>8295</code>平台为例，有两个<code>CPU</code>簇(分别有4个大核，4个小核)，每个簇的频率共有10个等级（如下表所示）:</p><table><thead><tr><th>SA8295 clusters</th><th>Levels</th></tr></thead><tbody><tr><td>cluster0</td><td>10(1017MHZ~2131MHZ)</td></tr><tr><td>cluster1</td><td>10(1280MHZ~2380MHZ)</td></tr></tbody></table><p><code>QNX</code>提供了<code>pdbg</code>接口用于<code>DCVS</code>的参数设定，每个频点主要有如下4个重要的参数:</p><ul><li><code>up_pct_thr</code>: 当前频率的CPU使用率上限阈值，对<code>freq10</code>来说，默认值是<code>90</code></li><li><code>up_time_thr_ms</code>: 当前频率的CPU使用率（持续）时间上限阈值，对<code>freq10</code>来说，默认值是<code>100(ms)</code></li><li><code>down_pct_thr</code>: 当前频率的CPU使用率下限阈值，对<code>freq10</code>来说，默认值是<code>10</code></li><li><code>down_time_thr_ms</code>: 当前频率的CPU使用率（持续）时间下限阈值，对<code>freq10</code>来说，默认值是<code>200(ms)</code></li></ul><p>如果想要更好的性能，尽量将<code>up_pct_thr</code>的阈值降低点，确保系统负载超过该阈值后可以快速调频；反之，如果更多的考虑是节省功耗，则应该将<code>down_pct_thr</code>的阈值提高一点，这样可以在系统负载降低时能触发降频，减少能耗。例如，在车载上更多的考虑是性能，我们可以适当提升<code>CPU</code>调频的上限阈值:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="built_in">echo</span> 80 &gt; /dev/pdbg/qcore/dcvs/kdcvs/cluster0/freq10/up_pct_thr</span><br><span class="line"><span class="built_in">echo</span> 80 &gt; /dev/pdbg/qcore/dcvs/kdcvs/cluster1/freq10/up_pct_thr</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>如果要关闭<code>DCVS</code>功能，将<code>CPU</code>设定在最高或者最低的频点，可以通过如下接口设定:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># 设置为最高频</span></span><br><span class="line"><span class="built_in">echo</span> 1 &gt; /dev/pdbg/qcore/dcvs/force_max_freqency</span><br><span class="line"></span><br><span class="line"><span class="comment"># 设置为最低频</span></span><br><span class="line"><span class="built_in">echo</span> 1 &gt; /dev/pdbg/qcore/dcvs/force_min_freqency</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>另外，在分析问题时，可以通过<code>slog2info</code>查看<code>DCVS</code>的日志信息:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># slog2info -w |grep dcvs</span></span><br><span class="line">Jan 01 08:00:26.274                frpc_lib.233538             frpc_lib  16135  cdsp_service[fastrpc_farf.c:409]: CDSP:fastrpc_kpower.c:1440:0xdc:6: fastrpc_kpower_set: Request = HAP_power_set_DCVS_v2, Client ID = 15, Client ID <span class="keyword">for</span> adsp dcvs = 7</span><br><span class="line">Jan 01 08:00:26.279                frpc_lib.233538             frpc_lib  16135  cdsp_service[fastrpc_farf.c:409]: CDSP:fastrpc_kpower.c:1440:0xdc:6: fastrpc_kpower_set: Request = HAP_power_set_DCVS_v2, Client ID = 14, Client ID <span class="keyword">for</span> adsp dcvs = 6</span><br><span class="line">Jan 01 08:00:26.279                frpc_lib.233538             frpc_lib  16135  cdsp_service[fastrpc_farf.c:409]: CDSP:fastrpc_kpower.c:1440:0xdc:6: fastrpc_kpower_set: Request = HAP_power_set_DCVS_v2, Client ID = 15, Client ID <span class="keyword">for</span> adsp dcvs = 7</span><br><span class="line">Jan 01 08:00:26.298                frpc_lib.233538             frpc_lib  16135  cdsp_service[fastrpc_farf.c:409]: CDSP:fastrpc_kpower.c:1440:0xdc:6: fastrpc_kpower_set: Request = HAP_power_set_DCVS_v2, Client ID = 14, Client ID <span class="keyword">for</span> adsp dcvs = 6</span><br><span class="line">Jan 01 08:00:26.298                frpc_lib.233538             frpc_lib  16135  cdsp_service[fastrpc_farf.c:409]: CDSP:fastrpc_kpower.c:1440:0xdc:6: fastrpc_kpower_set: Request = HAP_power_set_DCVS_v2, Client ID = 15, Client ID <span class="keyword">for</span> adsp dcvs = 7</span><br><span class="line">Jan 01 08:00:26.322                frpc_lib.233538             frpc_lib  16135  cdsp_service[fastrpc_farf.c:409]: CDSP:fastrpc_kpower.c:1440:0xdc:6: fastrpc_kpower_set: Request = HAP_power_set_DCVS_v2, Client ID = 14, Client ID <span class="keyword">for</span> adsp dcvs = 6</span><br><span class="line">Jan 01 08:00:26.322                frpc_lib.233538             frpc_lib  16135  cdsp_service[fastrpc_farf.c:409]: CDSP:fastrpc_kpower.c:1440:0xdc:6: fastrpc_kpower_set: Request = HAP_power_set_DCVS_v2, Client ID = 15, Client ID <span class="keyword">for</span> adsp dcvs = 7</span><br><span class="line"></span><br></pre></td></tr></table></figure><h3 id="GPU-DCVS"><a href="#GPU-DCVS" class="headerlink" title="GPU DCVS"></a><strong>GPU DCVS</strong></h3><p><code>GPU</code>的<code>DCVS</code>主要是通过一个工作队列线程来获取<code>GPU</code>负载信息从而实施<code>DCVS</code>动态调频-通过当前的负载信息来调整GPU的核心工作频率。与<code>CPU</code>的频点类似，<code>GPU</code>的工作频率也分位好几个档次，如<code>SA8155</code>平台对应的<code>GPU</code>工作频率有7个档次:</p><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/8155-gpu-freq-levels.jpeg" alt="8155 gpu frequency levels"></p><p>对<code>GPU</code>的<code>DCVS</code>来说都是通过内部的算法完成调频策略执行，无需进行调优；不过，<code>QNX</code>提供了开关<code>DCVS</code>的接口:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># enable performance governor</span></span><br><span class="line"><span class="built_in">echo</span> gpu_perf_governor 1 &gt; /dev/kgsl-control</span><br><span class="line"></span><br><span class="line"><span class="comment"># disable performance governor</span></span><br><span class="line"><span class="built_in">echo</span> gpu_perf_governor 0 &gt; /dev/kgsl-control</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>也可以选择将<code>GPU</code>频点设定在指定的档次：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># 设置GPU频率为500MHZ(level 3)</span></span><br><span class="line"><span class="built_in">echo</span> gpu_perf_governor 1 &gt; /dev/kgsl-control</span><br><span class="line"><span class="built_in">echo</span> gpu_gfx_core_clock_level 3 &gt; /dev/kgsl-control</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>如果需要实时查看当前<code>GPU</code>的工作频率与负载，可以通过设定日志等级，然后通过<code>slog2info</code>查看：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># 设置GPU日志等级</span></span><br><span class="line"><span class="built_in">echo</span> gpu_set_log_level 4 &gt; /dev/kgsl-control</span><br><span class="line"><span class="built_in">echo</span> gpubusystats 100 &gt; /dev/kgsl-control</span><br><span class="line"></span><br><span class="line"><span class="comment"># 查看单个进程的GPU占用</span></span><br><span class="line"><span class="built_in">echo</span> gpu_per_process_busy 1000 &gt;/dev/kgsl-control</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment"># 查看GPU实时负载</span></span><br><span class="line">slog2info -b KGSL -w |grep -i percentage</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="QNX中查看系统工作频率"><a href="#QNX中查看系统工作频率" class="headerlink" title="QNX中查看系统工作频率"></a><strong>QNX中查看系统工作频率</strong></h2><p>高通平台的<code>QNX</code>系统中有一个<code>clock.sh</code>脚本，可以用来读取、设定系统核心的工作频率，比如<code>CPU</code>, <code>GPU</code>, <code>DDR</code>, <code>UFS</code>:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># clock.sh -h</span></span><br><span class="line">clock debug tool</span><br><span class="line">usage: clock.sh &lt;<span class="built_in">command</span>&gt; &lt;[arg]&gt; [options]</span><br><span class="line"></span><br><span class="line">Provides an interface to the qcore clock driver. Results are stored</span><br><span class="line"><span class="keyword">in</span> /tmp/clockdebug_log, <span class="built_in">which</span> is <span class="keyword">then</span> printed to stdout.</span><br><span class="line"></span><br><span class="line">global options:</span><br><span class="line">  -b,--batch   <span class="built_in">enable</span> batch commands</span><br><span class="line"></span><br><span class="line">commands:</span><br><span class="line">  <span class="built_in">enable</span>       &lt;clock || powerdomain || dcvs || avs&gt;</span><br><span class="line">  <span class="built_in">disable</span>      &lt;clock || powerdomain || dcvs&gt; [--force]</span><br><span class="line">  getfreq      &lt;clock&gt; </span><br><span class="line">  setfreq      &lt;clock&gt; &lt;frequency (min KHz)&gt;</span><br><span class="line">  setdiv       &lt;clock&gt; &lt;div&gt;</span><br><span class="line">  setflags     &lt;clock || powerdomain || top&gt; &lt;mask&gt;</span><br><span class="line">  setlimit     &lt;clock&gt; [--min, --max (default)] &lt;frequency (KHz)&gt;</span><br><span class="line">  config       &lt;clock&gt; &lt;val&gt;</span><br><span class="line">  reset        &lt;clock&gt; [--assert, --deassert, --pulse (default)]</span><br><span class="line">  info         &lt;clock || powerdomain || top || list&gt; [--enabled, --on, --ref, --xovote, --cached]</span><br><span class="line">  freqplan     &lt;clock&gt; </span><br><span class="line">  maxperf      &lt;cluster_name&gt; </span><br><span class="line">  minperf      &lt;cluster_name&gt; </span><br><span class="line">  perfinfo     &lt;cluster_name&gt; </span><br><span class="line">  gpio         [--off]</span><br><span class="line">  debugmux     &lt;clocck_name&gt;</span><br><span class="line">  getrefcount  &lt;clock_name&gt;</span><br><span class="line">  <span class="built_in">log</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>具体查看各个系统的频率的方法如下，这个可以看到所有系统核心域的频率信息：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># cpu</span></span><br><span class="line">clock.sh info|grep apcs</span><br><span class="line"></span><br><span class="line"><span class="comment"># gpu</span></span><br><span class="line">clock.sh info|grep gpu</span><br><span class="line"></span><br><span class="line"><span class="comment"># ufs</span></span><br><span class="line">clock.sh info|grep ufs</span><br><span class="line"></span><br><span class="line"><span class="comment"># ufs</span></span><br><span class="line">clock.sh info|grep ddr</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>如果我们要查看某个具体系统时钟的频率信息，可以通过如下命令：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># clock.sh info gpu_cc_gx_gfx3d_clk</span></span><br><span class="line">Clock               State      Freq (Hz)    EN  RST Flags              VDD_CX/MMCX                      Sources</span><br><span class="line">--------------------------------------------------------------------</span><br><span class="line">gpu_cc_gx_gfx3d_clk ON (0)     730995848    1   n/a 0x0                OFF                              /pmic/client/xo</span><br><span class="line"></span><br></pre></td></tr></table></figure>]]></content>
    
    
    <summary type="html">&lt;p&gt;目前智能座舱领域的方案中，高通的两大平台&lt;code&gt;8155&lt;/code&gt;&amp;#x2F;&lt;code&gt;8295&lt;/code&gt;占据了大部分的市场份额，这两个硬件平台都是基于&lt;code&gt;QNX&lt;/code&gt;系统的虚拟化方案实现的，就是说中控域与仪表域都跑在一个系统上了-座舱通常是Android系统，实际是QNX上的一个虚拟机；而仪表通常是运行在QNX侧。跟传统的单Android系统比较来看，&lt;code&gt;QNX&lt;/code&gt;虚拟化平台有很多的变化，比如很多物理驱动与系统服务都跑在了QNX上，而Android上看到的只是一个虚拟的设备，或者压根就去掉了，比如本文要讲到的动态调频与调压功能&lt;code&gt;DCVS(Dynamic Clock and Voltage Scaling)&lt;/code&gt;就是一个例子，这个功能在Android上已经没有了，所有的调频与调压功能都在QNX上实现。&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;DCVS&lt;/code&gt;也可将其称为&lt;code&gt;DVFS(Dynamic Voltage and Frequency Scaling)&lt;/code&gt;实际都是根据系统负载动态调整CPU&amp;#x2F;GPU&amp;#x2F;DDR等工作频率与电压，从而减少功耗&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;接下来，我们就一起看看高通QNX平台的&lt;code&gt;DCVS&lt;/code&gt;功能是如何实现的，以及如何在QNX平台查看&lt;code&gt;CPU, GPU，UFS，DDR&lt;/code&gt;的频率。&lt;/p&gt;</summary>
    
    
    
    <category term="虚拟化" scheme="https://sniffer.site/categories/%E8%99%9A%E6%8B%9F%E5%8C%96/"/>
    
    
    <category term="QNX" scheme="https://sniffer.site/tags/QNX/"/>
    
    <category term="DCVS" scheme="https://sniffer.site/tags/DCVS/"/>
    
    <category term="高通" scheme="https://sniffer.site/tags/%E9%AB%98%E9%80%9A/"/>
    
  </entry>
  
  <entry>
    <title>生命与宇宙-看《时间简史》的感想</title>
    <link href="https://sniffer.site/2024/07/18/%E7%94%9F%E5%91%BD-%E6%97%B6%E9%97%B4%E4%B8%8E%E5%AE%87%E5%AE%99/"/>
    <id>https://sniffer.site/2024/07/18/%E7%94%9F%E5%91%BD-%E6%97%B6%E9%97%B4%E4%B8%8E%E5%AE%87%E5%AE%99/</id>
    <published>2024-07-18T11:42:51.000Z</published>
    <updated>2024-08-20T01:37:54.190Z</updated>
    
    <content type="html"><![CDATA[<blockquote><p>我们发现自己处于令人困惑的世界中。我们要理解周围所看到的一切的含义，并且询问：宇宙的本质是什么？我们在其中的位置如何，以及宇宙和我们从何而来？宇宙为何是这个样子？</p><pre><code>    史蒂芬-霍金</code></pre></blockquote><p> <img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/Corona-black-hole.webp" alt="black-hole"></p> <span id="more"></span><p>大学的时候，就开始对大脑的奥秘、宇宙的起源还有人类的诞生特别感兴趣，也一直尝试看相关的书籍，认识并不深。最近断断续续看完了史蒂芬-霍金的《时间简史》，有一种很久未曾有过的震撼-这种感觉还是好多年前在看《追寻记忆的痕迹》才有过。霍金用直白简洁的语言阐述了宇宙的起源与历史，看起来十分让人着迷。书中的部分内容理解起来有点吃力，但好像是打开了一扇窗口，让人看到了全新的世界。</p><p>时间，宇宙，人类，这些命题无法让人看得清晰，有时觉得理解了一点，有时却又变得非常陌生，始终带着模糊而朦胧的美。看完全书，不得不感叹相比起宇宙的浩瀚无边，人类还是太过渺小了；但人类本身就带着神迹，本身就是一种奇迹般的存在，是偶然中诞生的上帝的化身，有兽性，也有光辉的人性。人要睁眼看清宇宙，也要睁眼看看自己；人类努力发现宇宙深处的奥秘，想要在短暂的生命时光中探索到真谛，想要努力摆脱自身的枷锁，尝试下自由的滋味，究竟太难了。宇宙本身只是一个存在，无言无语的存在，而人类却要努力去探索这个存在的价值与意义，去探索宇宙的本质，去发现存在的真理。</p><p>实在是太难。也只有这些伟大的人物才能真正有勇气、有毅力去探求这些灼热人心的奥秘，而芸芸众生不过是得过且过，尝试在世俗的生活里找到一点乐趣与快乐，然后赋予它们以价值与意义。我们太会自我安慰，所以大部分时候都会陷入到日常生活里无可自拔-欲望，享乐，权力，地位，财富，长相，年龄，穿着，这些东西消耗了我们太多的时间与精力，让我们陷入了一种自我陶醉而迷恋的状态。几百年来，人类社会高速发展，历史的车轮从未像现在这样急速的运转-人作为地球之王，已经早就忘了自己来自何方，成为没有家乡的孤魂野鬼。我们在数字网络的世界里，努力找到一点心灵的慰藉，尝试通过感官的愉悦掩盖内心深处的焦虑不安。欲望太多了，已经把整个人都包围的水泄不通；终日疲于奔命，不明白自己要去哪里。我们想要通过食色性来刺激自己的肉体，尝试获取一点昙花一现的自由，甘之如饴，实际不过多一个枷锁而已。</p><p>我们能否获取到自由？很难。人类自诞生起就没有自由过。我们只不过一直受欲望驱使，是欲望的奴隶而已。那所谓的自由只是我们构建的虚幻。可是，是不是我们没有了欲望，真正的自由就来了吗？有肉身，我们就会有欲望，除非我们的肉身彻底死亡，否则欲望会一直存在。我们注定要与内心的欲望度过一生，这就是人的命运，无法摆脱。但纵然如此，还是可以选择，可以尝试与欲望为友，认识它，而不是被它冲昏了头脑。只有真正把握了藏在肉体里的欲望，藏在内心的渴求，我们才有可能真正地获得些许自由，虽然不一定是真正的自由，但也能称得上一种无限逼近自由的状态。</p><p>不要逃避欲望，不要害怕。不要让欲望成为枷锁，而是成为一种探索内心自由的助推器，成为开启智慧的源泉。要自我克制，要懂得敬畏。克制意味着，我们要学会控制欲望的度，不要让任何一种欲望成为控制自我的枷锁，成为摧毁内心的魔鬼；敬畏意味着，我们要懂得谦卑，不要自我放纵，不要无所畏惧，不要肆无忌惮。我们既然带着神性，带着神迹，那么我们就要对被赋予的一切满怀感激。人能否安度一生，能否健康、快乐、富足过完一生，首先就是要认清自己内心的欲望，看清楚自己真正想要的东西。这样我们方才有动力真正去追寻到内心的安宁。</p><p>但如果只是纯有欲望而没有能力达成这些欲望，恐怕只会给自己增添苦恼，说不定还要带来无妄之灾。因此，欲望之外，更重要的其实是个人的能力，要有证明自我价值的能力。如果没有过人的天赋与天资，那就老老实实的去看书、学习，去探索智慧吧。依靠长期的积累，点滴的成长，不断的自我提升，或许我们还是无法做到五年就达到牛人的水准，但是或许十年后人就脱胎换骨了，这时我们就能发挥更大的价值，带来更多的影响。纵然，我们依然无法真正达成自己设定的目标，但至少可以让我们摆脱内心的焦虑与茫然，让自己成为一个真正内心平静与安宁的人，一个有智慧面对人生困难与挑战的人。这就够了。</p><p>归根结底，快乐与幸福需要我们内心有足够的智慧，能让我们平衡自己内心的欲望与个人的能力，让我们可以回归初心，找到生命本来的模样-不焦虑，不茫然，不懈怠。如果我们在能力不足的时候，学会放下内心过度的欲望，舍弃部分自我渴求，或许我们能更轻松，更淡定、安然；如果我们有足够的能力，那么要懂得克制，保持适度，而不是一味的被欲望所吞噬，让欲望成为自设的枷锁。</p>]]></content>
    
    
    <summary type="html">&lt;blockquote&gt;&lt;p&gt;我们发现自己处于令人困惑的世界中。我们要理解周围所看到的一切的含义，并且询问：宇宙的本质是什么？我们在其中的位置如何，以及宇宙和我们从何而来？宇宙为何是这个样子？&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;    史蒂芬-霍金
&lt;/code&gt;&lt;/pre&gt;
&lt;/blockquote&gt;

&lt;p&gt; &lt;img src=&quot;https://md-files.oss-cn-shenzhen.aliyuncs.com/Corona-black-hole.webp&quot; alt=&quot;black-hole&quot;&gt;&lt;/p&gt;</summary>
    
    
    
    <category term="读书笔记" scheme="https://sniffer.site/categories/%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0/"/>
    
    
    <category term="生命" scheme="https://sniffer.site/tags/%E7%94%9F%E5%91%BD/"/>
    
    <category term="宇宙" scheme="https://sniffer.site/tags/%E5%AE%87%E5%AE%99/"/>
    
    <category term="时间" scheme="https://sniffer.site/tags/%E6%97%B6%E9%97%B4/"/>
    
  </entry>
  
  <entry>
    <title>Android如何使用PSI管理内存</title>
    <link href="https://sniffer.site/2024/06/25/Android%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A8PSI%E7%AE%A1%E7%90%86%E5%86%85%E5%AD%98/"/>
    <id>https://sniffer.site/2024/06/25/Android%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A8PSI%E7%AE%A1%E7%90%86%E5%86%85%E5%AD%98/</id>
    <published>2024-06-25T09:46:33.000Z</published>
    <updated>2024-07-19T07:08:46.693Z</updated>
    
    <content type="html"><![CDATA[<p>早前，<code>Android</code>使用内核中的<code>lowmemorykiller</code>驱动模块来监控系统内存，在内存不足时会主动杀掉某些非关键性的进程或者应用，从而减少系统的内存压力；自从内核版本<code>4.12</code>之后，<code>lowmemorykiller</code>从内核中移除了，因此<code>Android</code>增加一个<code>lmkd(Low Memory Killer Daemon)</code>来替代内核驱动用以监控系统内存状态，在系统处于内存高压状态时，主动清理部分内存，确保内存水位处于可接受的状态。那么，<code>LMKD</code>又是如何获取系统内存压力状态的了？这个就要说到<code>PSI(Pressure stall information)</code>这个内核模块了。</p><span id="more"></span><p><code>Android10</code>开始在<code>LMKD</code>中引入了<code>PSI</code>（压力失速信息）来检测内存压力，简单来说，<code>PSI</code>通过检测由于内存不足导致的任务延迟，这些延迟可以用来表示系统内存压力状态；并提供了接口来给用户进程获取这些状态信息；早期<code>Android</code>的版本则使用<code>vmpressure</code>模块来获取系统内存压力状态。</p><p>这篇文章我们主要看看<code>Android</code>如何使用<code>PSI</code>来管理内存，在出现内存压力如何释放内存。</p><h2 id="PSI简介"><a href="#PSI简介" class="headerlink" title="PSI简介"></a>PSI简介</h2><p><code>PSI(Pressure stall information)</code>是内核中的一个模块，用来监控系统中CPU、内存、IO资源压力状态，目的是衡量系统整体的资源健康情况；当系统出现资源压力时（由于CPU、内存或IO资源的不足而导致的任务延迟），系统的运行效率会降低。通过<code>PSI</code>的接口，我们可以获取系统资源的压力状态，可以选择对应的资源管理策略进行调优，从而提升系统的效率。</p><p>要获取到<code>PSI</code>状态，可以通过<code>/proc/pressure</code>中的文件来查看，对应有三个接口:</p><ul><li><code>/proc/pressure/cpu</code>: 查看系统CPU压力状态</li><li><code>/proc/pressure/memory</code>: 查看系统内存压力状态</li><li><code>/proc/pressure/io</code>： 查看系统IO压力状态</li></ul><p>对应输出的数据都是统一的格式，具体来说，第一行<code>some</code>表示当前系统有一个或者多个进程有出现压力；第二行<code>full</code>表示所有的进程都出现了资源压力。<code>avg10</code>表示10s的平均值，<code>avg60</code>表示60s的平均值，<code>avg300</code>表示300s的平均值，<code>total</code>表示累计延迟的总时间（以ms为单位）。</p><blockquote><p>对CPU来说，不存在<code>full</code>状态，因为系统中始终存在一个可运行的进程</p></blockquote><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">some avg10=0.77 avg60=0.23 avg300=0.14 total=117592248</span><br><span class="line">full avg10=0.00 avg60=0.00 avg300=0.00 total=0</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>以下图为列，有两个进程执行的情况下，由于出现内存压力，导致<code>B</code>进程等待了30s; <code>A</code>进程正常执行没有延迟, 则<code>some</code>的值对应50%（0.5）；因此<code>some</code>从某种程度上表示了系统资源压力带来的延迟；</p><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/psi-some.png" alt="some crop"></p><p>类似地，如下图，两个进程<code>A</code>与<code>B</code>都因为内存压力出现了延迟等待，<code>A</code>等待了<code>10s</code>, <code>B</code>等待了<code>30s</code>，则此时<code>some</code>为<code>50%</code>, 而<code>full</code>为<code>16.67%</code>; 高<code>full</code>值表示由于系统资源压力导致的总吞吐量的损失。</p><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/psi-full.png" alt="full crop"></p><blockquote><p>打开<code>PSI</code>需要开启内核配置<code>CONFIG_PSI</code>，<code>CONFIG_PSI_LMKD</code>和<code>CONFIG_PSI_SYSCALL</code>三个配置项，默认都是开启的。</p></blockquote><p>内核一般会在系统发生进程切换或者内存分配发生回收时时主动通知<code>PSI</code>模块，从而统计当前资源的压力,对应的内核接口都可以在<code>linux/psi.h</code>中找到，主要用这么几个接口:</p><ul><li><code>psi_init</code>: 初始化<code>PSI</code>模块</li><li><code>psi_task_change</code>: 任务状态发生变化时通知<code>PSI</code>模块，更新统计数据</li><li><code>psi_task_switch</code>: 任务切换时通知<code>PSI</code>模块，更新统计数据</li><li><code>psi_memstall_tick</code>: 定时器中断产生时通知<code>PSI</code>模块，更新内存压力统计数据</li><li><code>psi_memstall_enter</code>: 内存分配出现压力开始时通知<code>PSI</code>模块，更新内存压力统计数据</li><li><code>psi_memstall_leave</code>: 内存分配压力结束时通知<code>PSI</code>模块，更新内存压力统计数据</li></ul><p>以内存为例，内核会在主要的几个内存路径， 如内存回收、内存整理等都会调用，<code>psi_memstall_enter</code>&#x2F;<code>psi_memstall_leave</code>函数，用于统计内存压力状态。比如在内存分配路径如果由于内存碎片化，启动了内存整理，则会告知<code>PSI</code>模块，统计内存碎片化整理带来的延迟：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">// page_alloc.c</span></span><br><span class="line"></span><br><span class="line">* Try memory compaction <span class="keyword">for</span> high-order allocations before reclaim */</span><br><span class="line"><span class="type">static</span> <span class="class"><span class="keyword">struct</span> <span class="title">page</span> *</span></span><br><span class="line"><span class="class">__<span class="title">alloc_pages_direct_compact</span>(<span class="title">gfp_t</span> <span class="title">gfp_mask</span>, <span class="title">unsigned</span> <span class="title">int</span> <span class="title">order</span>,</span></span><br><span class="line"><span class="class"><span class="title">unsigned</span> <span class="title">int</span> <span class="title">alloc_flags</span>, <span class="title">const</span> <span class="keyword">struct</span> <span class="title">alloc_context</span> *<span class="title">ac</span>,</span></span><br><span class="line"><span class="class"><span class="title">enum</span> <span class="title">compact_priority</span> <span class="title">prio</span>, <span class="title">enum</span> <span class="title">compact_result</span> *<span class="title">compact_result</span>)</span></span><br><span class="line"><span class="class">&#123;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">page</span> *<span class="title">page</span> =</span> <span class="literal">NULL</span>;</span><br><span class="line"><span class="type">unsigned</span> <span class="type">long</span> pflags;</span><br><span class="line"><span class="type">unsigned</span> <span class="type">int</span> noreclaim_flag;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (!order)</span><br><span class="line"><span class="keyword">return</span> <span class="literal">NULL</span>;</span><br><span class="line"></span><br><span class="line">  <span class="comment">// 开始内存整理</span></span><br><span class="line">psi_memstall_enter(&amp;pflags);</span><br><span class="line">noreclaim_flag = memalloc_noreclaim_save();</span><br><span class="line"></span><br><span class="line">*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,</span><br><span class="line">prio, &amp;page);</span><br><span class="line"></span><br><span class="line">memalloc_noreclaim_restore(noreclaim_flag);</span><br><span class="line"></span><br><span class="line">  <span class="comment">// 结束内存整理</span></span><br><span class="line">psi_memstall_leave(&amp;pflags);</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * At least in one zone compaction wasn&#x27;t deferred or skipped, so let&#x27;s</span></span><br><span class="line"><span class="comment"> * count a compaction stall</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">count_vm_event(COMPACTSTALL);</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Prep a captured page if available */</span></span><br><span class="line"><span class="keyword">if</span> (page)</span><br><span class="line">prep_new_page(page, order, gfp_mask, alloc_flags);</span><br><span class="line"></span><br><span class="line"><span class="comment">/* Try get a page from the freelist if available */</span></span><br><span class="line"><span class="keyword">if</span> (!page)</span><br><span class="line">page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (page) &#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">zone</span> *<span class="title">zone</span> =</span> page_zone(page);</span><br><span class="line"></span><br><span class="line">zone-&gt;compact_blockskip_flush = <span class="literal">false</span>;</span><br><span class="line">compaction_defer_reset(zone, order, <span class="literal">true</span>);</span><br><span class="line">count_vm_event(COMPACTSUCCESS);</span><br><span class="line"><span class="keyword">return</span> page;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * It&#x27;s bad if compaction run occurs and fails. The most likely reason</span></span><br><span class="line"><span class="comment"> * is that pages exist, but not enough to satisfy watermarks.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">count_vm_event(COMPACTFAIL);</span><br><span class="line"></span><br><span class="line">cond_resched();</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> <span class="literal">NULL</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>更多的技术细节可以参考<code>linux/stats.h</code>(CPU调度)以及<code>kernel/mm</code>(内存管理)相关的代码。下面我们就来看看<code>Android</code>中如何利用<code>PSI</code>来管理系统的内存，并在低内存状态时如何回收内存的。</p><h2 id="LMKD如何使用PSI监控内存状态"><a href="#LMKD如何使用PSI监控内存状态" class="headerlink" title="LMKD如何使用PSI监控内存状态"></a>LMKD如何使用PSI监控内存状态</h2><p>下图是<code>LMKD</code>内存管理服务的架构图，<code>LMKD</code>主要是为了解决低内存时系统内存回收的问题，通过<code>PSI</code>的信息获取到系统内存压力状态后，只要内存压力超过一定的水位，<code>LMKD</code>会主动选择杀掉一个低优先级任务（根据进程的<code>oom_score</code>的值），释放部分内存，确保系统内存达到正常水位。</p><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/lmkd-psi-architecture.png" alt="lmkd-psi-architecture"></p><p><code>LMKD</code>低内存管理服务在启动时，会主动注册一个内核事件，用于监听<code>PSI</code>内存压力状态(<code>Android10</code>以后默认使用<code>PSI</code>):</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="function"><span class="type">static</span> <span class="type">bool</span> <span class="title">init_monitors</span><span class="params">()</span> </span>&#123;</span><br><span class="line">    <span class="comment">/* Try to use psi monitor first if kernel has it */</span></span><br><span class="line">    use_psi_monitors = <span class="built_in">GET_LMK_PROPERTY</span>(<span class="type">bool</span>, <span class="string">&quot;use_psi&quot;</span>, <span class="literal">true</span>) &amp;&amp;</span><br><span class="line">        <span class="built_in">init_psi_monitors</span>();</span><br><span class="line">    <span class="comment">/* Fall back to vmpressure */</span></span><br><span class="line">    <span class="keyword">if</span> (!use_psi_monitors &amp;&amp;</span><br><span class="line">        (!<span class="built_in">init_mp_common</span>(VMPRESS_LEVEL_LOW) ||</span><br><span class="line">        !<span class="built_in">init_mp_common</span>(VMPRESS_LEVEL_MEDIUM) ||</span><br><span class="line">        !<span class="built_in">init_mp_common</span>(VMPRESS_LEVEL_CRITICAL))) &#123;</span><br><span class="line">        <span class="built_in">ALOGE</span>(<span class="string">&quot;Kernel does not support memory pressure events or in-kernel low memory killer&quot;</span>);</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">if</span> (use_psi_monitors) &#123;</span><br><span class="line">        <span class="built_in">ALOGI</span>(<span class="string">&quot;Using psi monitors for memory pressure detection&quot;</span>);</span><br><span class="line">    &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">        <span class="built_in">ALOGI</span>(<span class="string">&quot;Using vmpressure for memory pressure detection&quot;</span>);</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>初始化<code>PSI</code>监听模块会根据内存压力阈值进行初始化：内存压力分为<code>low</code>，<code>medium</code>和<code>critical</code>三个阈值，实际只使用了<code>VMPRESS_LEVEL_MEDIUM</code>与<code>VMPRESS_LEVEL_CRITICAL</code>两个值， 对应<code>PSI</code>中的<code>some</code>与<code>full</code>两种状态。</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="function"><span class="type">static</span> <span class="type">bool</span> <span class="title">init_psi_monitors</span><span class="params">()</span> </span>&#123;</span><br><span class="line">    <span class="comment">/*</span></span><br><span class="line"><span class="comment">     * When PSI is used on low-ram devices or on high-end devices without memfree levels</span></span><br><span class="line"><span class="comment">     * use new kill strategy based on zone watermarks, free swap and thrashing stats</span></span><br><span class="line"><span class="comment">     */</span></span><br><span class="line">    <span class="type">bool</span> use_new_strategy =</span><br><span class="line">        <span class="built_in">GET_LMK_PROPERTY</span>(<span class="type">bool</span>, <span class="string">&quot;use_new_strategy&quot;</span>, low_ram_device || !use_minfree_levels);</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* In default PSI mode override stall amounts using system properties */</span></span><br><span class="line">    <span class="keyword">if</span> (use_new_strategy) &#123;</span><br><span class="line">        <span class="comment">/* Do not use low pressure level */</span></span><br><span class="line">        psi_thresholds[VMPRESS_LEVEL_LOW].threshold_ms = <span class="number">0</span>;</span><br><span class="line">        psi_thresholds[VMPRESS_LEVEL_MEDIUM].threshold_ms = psi_partial_stall_ms;</span><br><span class="line">        psi_thresholds[VMPRESS_LEVEL_CRITICAL].threshold_ms = psi_complete_stall_ms;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (!<span class="built_in">init_mp_psi</span>(VMPRESS_LEVEL_LOW, use_new_strategy)) &#123;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">if</span> (!<span class="built_in">init_mp_psi</span>(VMPRESS_LEVEL_MEDIUM, use_new_strategy)) &#123;</span><br><span class="line">        <span class="built_in">destroy_mp_psi</span>(VMPRESS_LEVEL_LOW);</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">if</span> (!<span class="built_in">init_mp_psi</span>(VMPRESS_LEVEL_CRITICAL, use_new_strategy)) &#123;</span><br><span class="line">        <span class="built_in">destroy_mp_psi</span>(VMPRESS_LEVEL_MEDIUM);</span><br><span class="line">        <span class="built_in">destroy_mp_psi</span>(VMPRESS_LEVEL_LOW);</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>函数<code>mp_event_psi</code>最终会通过<code>PSI</code>的接口监听内存压力状态（<code>PSI</code>接口可以支持<code>poll</code>&#x2F;<code>epoll</code>等通用的方式进行监听），并注册一个监听的回调，在有状态变化时会调用<code>mp_event_psi</code>。</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="type">static</span> <span class="type">bool</span> <span class="title">init_mp_psi</span><span class="params">(<span class="keyword">enum</span> vmpressure_level level, <span class="type">bool</span> use_new_strategy)</span> </span>&#123;</span><br><span class="line">    <span class="type">int</span> fd;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* Do not register a handler if threshold_ms is not set */</span></span><br><span class="line">    <span class="keyword">if</span> (!psi_thresholds[level].threshold_ms) &#123;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    fd = <span class="built_in">init_psi_monitor</span>(psi_thresholds[level].stall_type,</span><br><span class="line">        psi_thresholds[level].threshold_ms * US_PER_MS,</span><br><span class="line">        PSI_WINDOW_SIZE_MS * US_PER_MS);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (fd &lt; <span class="number">0</span>) &#123;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    vmpressure_hinfo[level].handler = use_new_strategy ? mp_event_psi : mp_event_common;</span><br><span class="line">    vmpressure_hinfo[level].data = level;</span><br><span class="line">    <span class="keyword">if</span> (<span class="built_in">register_psi_monitor</span>(epollfd, fd, &amp;vmpressure_hinfo[level]) &lt; <span class="number">0</span>) &#123;</span><br><span class="line">        <span class="built_in">destroy_psi_monitor</span>(fd);</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">    &#125;</span><br><span class="line">    maxevents++;</span><br><span class="line">    mpevfd[level] = fd;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>注册<code>psi</code>接口的核心函数<code>init_psi_monitor</code>, 主要功能是打开<code>/proc/pressure/memory</code>, 然后按照标准格式写入一个配置文件，用于设定内存压力阈值和监听时间窗口，并返回一个文件描述符。</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"># &lt;some|full&gt; &lt;stall amount in us&gt; &lt;time window in us&gt;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="type">int</span> <span class="title">init_psi_monitor</span><span class="params">(<span class="keyword">enum</span> psi_stall_type stall_type,</span></span></span><br><span class="line"><span class="params"><span class="function">             <span class="type">int</span> threshold_us, <span class="type">int</span> window_us)</span> </span>&#123;</span><br><span class="line">    <span class="type">int</span> fd;</span><br><span class="line">    <span class="type">int</span> res;</span><br><span class="line">    <span class="type">char</span> buf[<span class="number">256</span>];</span><br><span class="line"></span><br><span class="line">    fd = <span class="built_in">TEMP_FAILURE_RETRY</span>(<span class="built_in">open</span>(PSI_PATH_MEMORY, O_WRONLY | O_CLOEXEC));</span><br><span class="line">    <span class="keyword">if</span> (fd &lt; <span class="number">0</span>) &#123;</span><br><span class="line">        <span class="built_in">ALOGE</span>(<span class="string">&quot;No kernel psi monitor support (errno=%d)&quot;</span>, errno);</span><br><span class="line">        <span class="keyword">return</span> <span class="number">-1</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">switch</span> (stall_type) &#123;</span><br><span class="line">    <span class="built_in">case</span> (PSI_SOME):</span><br><span class="line">    <span class="built_in">case</span> (PSI_FULL):</span><br><span class="line">        res = <span class="built_in">snprintf</span>(buf, <span class="built_in">sizeof</span>(buf), <span class="string">&quot;%s %d %d&quot;</span>,</span><br><span class="line">            stall_type_name[stall_type], threshold_us, window_us);</span><br><span class="line">        <span class="keyword">break</span>;</span><br><span class="line">    <span class="keyword">default</span>:</span><br><span class="line">        <span class="built_in">ALOGE</span>(<span class="string">&quot;Invalid psi stall type: %d&quot;</span>, stall_type);</span><br><span class="line">        errno = EINVAL;</span><br><span class="line">        <span class="keyword">goto</span> err;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (res &gt;= (<span class="type">ssize_t</span>)<span class="built_in">sizeof</span>(buf)) &#123;</span><br><span class="line">        <span class="built_in">ALOGE</span>(<span class="string">&quot;%s line overflow for psi stall type &#x27;%s&#x27;&quot;</span>,</span><br><span class="line">            PSI_PATH_MEMORY, stall_type_name[stall_type]);</span><br><span class="line">        errno = EINVAL;</span><br><span class="line">        <span class="keyword">goto</span> err;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    res = <span class="built_in">TEMP_FAILURE_RETRY</span>(<span class="built_in">write</span>(fd, buf, <span class="built_in">strlen</span>(buf) + <span class="number">1</span>));</span><br><span class="line">    <span class="keyword">if</span> (res &lt; <span class="number">0</span>) &#123;</span><br><span class="line">        <span class="built_in">ALOGE</span>(<span class="string">&quot;%s write failed for psi stall type &#x27;%s&#x27;; errno=%d&quot;</span>,</span><br><span class="line">            PSI_PATH_MEMORY, stall_type_name[stall_type], errno);</span><br><span class="line">        <span class="keyword">goto</span> err;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> fd;</span><br><span class="line"></span><br><span class="line">err:</span><br><span class="line">    <span class="built_in">close</span>(fd);</span><br><span class="line">    <span class="keyword">return</span> <span class="number">-1</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>内核一旦触发内存压力事件，会通过<code>mp_event_psi</code>注册的回调函数进行处理：首先根据当前设定的内存水位，判断系统是否处于低内存状态（剩余内存少或交换内存zram不足），如果是，则找出后台的低优先级任务，然后选择其中一个杀掉，释放部分内存。</p><blockquote><p>内存水位是根据每个内存区域的配置(<code>/proc/zoneinfo</code>)来判断的，有三个配置：<code>WMARK_MIN</code>、<code>WMARK_LOW</code> 和 <code>WMARK_HIGH</code>,分别对应内存压力的阈值，越低表示内存压力越大</p></blockquote><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br><span class="line">146</span><br><span class="line">147</span><br><span class="line">148</span><br><span class="line">149</span><br><span class="line">150</span><br><span class="line">151</span><br><span class="line">152</span><br><span class="line">153</span><br><span class="line">154</span><br><span class="line">155</span><br><span class="line">156</span><br><span class="line">157</span><br><span class="line">158</span><br><span class="line">159</span><br><span class="line">160</span><br><span class="line">161</span><br><span class="line">162</span><br><span class="line">163</span><br><span class="line">164</span><br><span class="line">165</span><br><span class="line">166</span><br><span class="line">167</span><br><span class="line">168</span><br><span class="line">169</span><br><span class="line">170</span><br><span class="line">171</span><br><span class="line">172</span><br><span class="line">173</span><br><span class="line">174</span><br><span class="line">175</span><br><span class="line">176</span><br><span class="line">177</span><br><span class="line">178</span><br><span class="line">179</span><br><span class="line">180</span><br><span class="line">181</span><br><span class="line">182</span><br><span class="line">183</span><br><span class="line">184</span><br><span class="line">185</span><br><span class="line">186</span><br><span class="line">187</span><br><span class="line">188</span><br><span class="line">189</span><br><span class="line">190</span><br><span class="line">191</span><br><span class="line">192</span><br><span class="line">193</span><br><span class="line">194</span><br><span class="line">195</span><br><span class="line">196</span><br><span class="line">197</span><br><span class="line">198</span><br><span class="line">199</span><br><span class="line">200</span><br><span class="line">201</span><br><span class="line">202</span><br><span class="line">203</span><br><span class="line">204</span><br><span class="line">205</span><br><span class="line">206</span><br><span class="line">207</span><br><span class="line">208</span><br><span class="line">209</span><br><span class="line">210</span><br><span class="line">211</span><br><span class="line">212</span><br><span class="line">213</span><br><span class="line">214</span><br><span class="line">215</span><br><span class="line">216</span><br><span class="line">217</span><br><span class="line">218</span><br><span class="line">219</span><br><span class="line">220</span><br><span class="line">221</span><br><span class="line">222</span><br><span class="line">223</span><br><span class="line">224</span><br><span class="line">225</span><br><span class="line">226</span><br><span class="line">227</span><br><span class="line">228</span><br><span class="line">229</span><br><span class="line">230</span><br><span class="line">231</span><br><span class="line">232</span><br><span class="line">233</span><br><span class="line">234</span><br><span class="line">235</span><br><span class="line">236</span><br><span class="line">237</span><br><span class="line">238</span><br><span class="line">239</span><br><span class="line">240</span><br><span class="line">241</span><br><span class="line">242</span><br><span class="line">243</span><br><span class="line">244</span><br><span class="line">245</span><br><span class="line">246</span><br><span class="line">247</span><br><span class="line">248</span><br><span class="line">249</span><br><span class="line">250</span><br><span class="line">251</span><br><span class="line">252</span><br><span class="line">253</span><br><span class="line">254</span><br><span class="line">255</span><br><span class="line">256</span><br><span class="line">257</span><br><span class="line">258</span><br><span class="line">259</span><br><span class="line">260</span><br><span class="line">261</span><br><span class="line">262</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="function"><span class="type">static</span> <span class="type">void</span> <span class="title">mp_event_psi</span><span class="params">(<span class="type">int</span> data, <span class="type">uint32_t</span> events, <span class="keyword">struct</span> polling_params *poll_params)</span> </span>&#123;</span><br><span class="line">    <span class="keyword">enum</span> <span class="title class_">reclaim_state</span> &#123;</span><br><span class="line">        NO_RECLAIM = <span class="number">0</span>,</span><br><span class="line">        KSWAPD_RECLAIM,</span><br><span class="line">        DIRECT_RECLAIM,</span><br><span class="line">    &#125;;</span><br><span class="line">    ...</span><br><span class="line">    <span class="keyword">if</span> (<span class="built_in">clock_gettime</span>(CLOCK_MONOTONIC_COARSE, &amp;curr_tm) != <span class="number">0</span>) &#123;</span><br><span class="line">        <span class="built_in">ALOGE</span>(<span class="string">&quot;Failed to get current time&quot;</span>);</span><br><span class="line">        <span class="keyword">return</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="built_in">record_wakeup_time</span>(&amp;curr_tm, events ? Event : Polling, &amp;wi);</span><br><span class="line"></span><br><span class="line">    <span class="type">bool</span> kill_pending = <span class="built_in">is_kill_pending</span>();</span><br><span class="line">    <span class="keyword">if</span> (kill_pending &amp;&amp; (kill_timeout_ms == <span class="number">0</span> ||</span><br><span class="line">        <span class="built_in">get_time_diff_ms</span>(&amp;last_kill_tm, &amp;curr_tm) &lt; <span class="built_in">static_cast</span>&lt;<span class="type">long</span>&gt;(kill_timeout_ms))) &#123;</span><br><span class="line">        <span class="comment">/* Skip while still killing a process */</span></span><br><span class="line">        wi.skipped_wakeups++;</span><br><span class="line">        <span class="keyword">goto</span> no_kill;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment">/*</span></span><br><span class="line"><span class="comment">     * Process is dead or kill timeout is over, stop waiting. This has no effect if pidfds are</span></span><br><span class="line"><span class="comment">     * supported and death notification already caused waiting to stop.</span></span><br><span class="line"><span class="comment">     */</span></span><br><span class="line">    <span class="built_in">stop_wait_for_proc_kill</span>(!kill_pending);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (<span class="built_in">vmstat_parse</span>(&amp;vs) &lt; <span class="number">0</span>) &#123;</span><br><span class="line">        <span class="built_in">ALOGE</span>(<span class="string">&quot;Failed to parse vmstat!&quot;</span>);</span><br><span class="line">        <span class="keyword">return</span>;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment">/* Starting 5.9 kernel workingset_refault vmstat field was renamed workingset_refault_file */</span></span><br><span class="line">    workingset_refault_file = vs.field.workingset_refault ? : vs.field.workingset_refault_file;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (<span class="built_in">meminfo_parse</span>(&amp;mi) &lt; <span class="number">0</span>) &#123;</span><br><span class="line">        <span class="built_in">ALOGE</span>(<span class="string">&quot;Failed to parse meminfo!&quot;</span>);</span><br><span class="line">        <span class="keyword">return</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* Reset states after process got killed */</span></span><br><span class="line">    <span class="keyword">if</span> (killing) &#123;</span><br><span class="line">        killing = <span class="literal">false</span>;</span><br><span class="line">        cycle_after_kill = <span class="literal">true</span>;</span><br><span class="line">        <span class="comment">/* Reset file-backed pagecache size and refault amounts after a kill */</span></span><br><span class="line">        base_file_lru = vs.field.nr_inactive_file + vs.field.nr_active_file;</span><br><span class="line">        init_ws_refault = workingset_refault_file;</span><br><span class="line">        thrashing_reset_tm = curr_tm;</span><br><span class="line">        prev_thrash_growth = <span class="number">0</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* Check free swap levels */</span></span><br><span class="line">    <span class="keyword">if</span> (swap_free_low_percentage) &#123;</span><br><span class="line">        <span class="keyword">if</span> (!swap_low_threshold) &#123;</span><br><span class="line">            swap_low_threshold = mi.field.total_swap * swap_free_low_percentage / <span class="number">100</span>;</span><br><span class="line">        &#125;</span><br><span class="line">        swap_is_low = mi.field.free_swap &lt; swap_low_threshold;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* Identify reclaim state */</span></span><br><span class="line">    <span class="keyword">if</span> (vs.field.pgscan_direct &gt; init_pgscan_direct) &#123;</span><br><span class="line">        init_pgscan_direct = vs.field.pgscan_direct;</span><br><span class="line">        init_pgscan_kswapd = vs.field.pgscan_kswapd;</span><br><span class="line">        reclaim = DIRECT_RECLAIM;</span><br><span class="line">    &#125; <span class="keyword">else</span> <span class="keyword">if</span> (vs.field.pgscan_kswapd &gt; init_pgscan_kswapd) &#123;</span><br><span class="line">        init_pgscan_kswapd = vs.field.pgscan_kswapd;</span><br><span class="line">        reclaim = KSWAPD_RECLAIM;</span><br><span class="line">    &#125; <span class="keyword">else</span> <span class="keyword">if</span> (workingset_refault_file == prev_workingset_refault) &#123;</span><br><span class="line">        <span class="comment">/*</span></span><br><span class="line"><span class="comment">         * Device is not thrashing and not reclaiming, bail out early until we see these stats</span></span><br><span class="line"><span class="comment">         * changing</span></span><br><span class="line"><span class="comment">         */</span></span><br><span class="line">        <span class="keyword">goto</span> no_kill;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    prev_workingset_refault = workingset_refault_file;</span><br><span class="line"></span><br><span class="line">     <span class="comment">/*</span></span><br><span class="line"><span class="comment">     * It&#x27;s possible we fail to find an eligible process to kill (ex. no process is</span></span><br><span class="line"><span class="comment">     * above oom_adj_min). When this happens, we should retry to find a new process</span></span><br><span class="line"><span class="comment">     * for a kill whenever a new eligible process is available. This is especially</span></span><br><span class="line"><span class="comment">     * important for a slow growing refault case. While retrying, we should keep</span></span><br><span class="line"><span class="comment">     * monitoring new thrashing counter as someone could release the memory to mitigate</span></span><br><span class="line"><span class="comment">     * the thrashing. Thus, when thrashing reset window comes, we decay the prev thrashing</span></span><br><span class="line"><span class="comment">     * counter by window counts. If the counter is still greater than thrashing limit,</span></span><br><span class="line"><span class="comment">     * we preserve the current prev_thrash counter so we will retry kill again. Otherwise,</span></span><br><span class="line"><span class="comment">     * we reset the prev_thrash counter so we will stop retrying.</span></span><br><span class="line"><span class="comment">     */</span></span><br><span class="line">    since_thrashing_reset_ms = <span class="built_in">get_time_diff_ms</span>(&amp;thrashing_reset_tm, &amp;curr_tm);</span><br><span class="line">    <span class="keyword">if</span> (since_thrashing_reset_ms &gt; THRASHING_RESET_INTERVAL_MS) &#123;</span><br><span class="line">        <span class="type">long</span> windows_passed;</span><br><span class="line">        <span class="comment">/* Calculate prev_thrash_growth if we crossed THRASHING_RESET_INTERVAL_MS */</span></span><br><span class="line">        prev_thrash_growth = (workingset_refault_file - init_ws_refault) * <span class="number">100</span></span><br><span class="line">                            / (base_file_lru + <span class="number">1</span>);</span><br><span class="line">        windows_passed = (since_thrashing_reset_ms / THRASHING_RESET_INTERVAL_MS);</span><br><span class="line">        <span class="comment">/*</span></span><br><span class="line"><span class="comment">         * Decay prev_thrashing unless over-the-limit thrashing was registered in the window we</span></span><br><span class="line"><span class="comment">         * just crossed, which means there were no eligible processes to kill. We preserve the</span></span><br><span class="line"><span class="comment">         * counter in that case to ensure a kill if a new eligible process appears.</span></span><br><span class="line"><span class="comment">         */</span></span><br><span class="line">        <span class="keyword">if</span> (windows_passed &gt; <span class="number">1</span> || prev_thrash_growth &lt; thrashing_limit) &#123;</span><br><span class="line">            prev_thrash_growth &gt;&gt;= windows_passed;</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="comment">/* Record file-backed pagecache size when crossing THRASHING_RESET_INTERVAL_MS */</span></span><br><span class="line">        base_file_lru = vs.field.nr_inactive_file + vs.field.nr_active_file;</span><br><span class="line">        init_ws_refault = workingset_refault_file;</span><br><span class="line">        thrashing_reset_tm = curr_tm;</span><br><span class="line">        thrashing_limit = thrashing_limit_pct;</span><br><span class="line">    &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">        <span class="comment">/* Calculate what % of the file-backed pagecache refaulted so far */</span></span><br><span class="line">        thrashing = (workingset_refault_file - init_ws_refault) * <span class="number">100</span> / (base_file_lru + <span class="number">1</span>);</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment">/* Add previous cycle&#x27;s decayed thrashing amount */</span></span><br><span class="line">    thrashing += prev_thrash_growth;</span><br><span class="line">    <span class="keyword">if</span> (max_thrashing &lt; thrashing) &#123;</span><br><span class="line">        max_thrashing = thrashing;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/*</span></span><br><span class="line"><span class="comment">     * Refresh watermarks once per min in case user updated one of the margins.</span></span><br><span class="line"><span class="comment">     * <span class="doctag">TODO:</span> b/140521024 replace this periodic update with an API for AMS to notify LMKD</span></span><br><span class="line"><span class="comment">     * that zone watermarks were changed by the system software.</span></span><br><span class="line"><span class="comment">     */</span></span><br><span class="line">    <span class="keyword">if</span> (watermarks.high_wmark == <span class="number">0</span> || <span class="built_in">get_time_diff_ms</span>(&amp;wmark_update_tm, &amp;curr_tm) &gt; <span class="number">60000</span>) &#123;</span><br><span class="line">        <span class="keyword">struct</span> <span class="title class_">zoneinfo</span> zi;</span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> (<span class="built_in">zoneinfo_parse</span>(&amp;zi) &lt; <span class="number">0</span>) &#123;</span><br><span class="line">            <span class="built_in">ALOGE</span>(<span class="string">&quot;Failed to parse zoneinfo!&quot;</span>);</span><br><span class="line">            <span class="keyword">return</span>;</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="built_in">calc_zone_watermarks</span>(&amp;zi, &amp;watermarks);</span><br><span class="line">        wmark_update_tm = curr_tm;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* Find out which watermark is breached if any */</span></span><br><span class="line">    wmark = <span class="built_in">get_lowest_watermark</span>(&amp;mi, &amp;watermarks);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (!<span class="built_in">psi_parse_mem</span>(&amp;psi_data)) &#123;</span><br><span class="line">        critical_stall = psi_data.mem_stats[PSI_FULL].avg10 &gt; (<span class="type">float</span>)stall_limit_critical;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment">/*</span></span><br><span class="line"><span class="comment">     * <span class="doctag">TODO:</span> move this logic into a separate function</span></span><br><span class="line"><span class="comment">     * Decide if killing a process is necessary and record the reason</span></span><br><span class="line"><span class="comment">     */</span></span><br><span class="line">    <span class="keyword">if</span> (cycle_after_kill &amp;&amp; wmark &lt; WMARK_LOW) &#123;</span><br><span class="line">        <span class="comment">/*</span></span><br><span class="line"><span class="comment">         * Prevent kills not freeing enough memory which might lead to OOM kill.</span></span><br><span class="line"><span class="comment">         * This might happen when a process is consuming memory faster than reclaim can</span></span><br><span class="line"><span class="comment">         * free even after a kill. Mostly happens when running memory stress tests.</span></span><br><span class="line"><span class="comment">         */</span></span><br><span class="line">        kill_reason = PRESSURE_AFTER_KILL;</span><br><span class="line">        <span class="built_in">strncpy</span>(kill_desc, <span class="string">&quot;min watermark is breached even after kill&quot;</span>, <span class="built_in">sizeof</span>(kill_desc));</span><br><span class="line">    &#125; <span class="keyword">else</span> <span class="keyword">if</span> (level == VMPRESS_LEVEL_CRITICAL &amp;&amp; events != <span class="number">0</span>) &#123;</span><br><span class="line">        <span class="comment">/*</span></span><br><span class="line"><span class="comment">         * Device is too busy reclaiming memory which might lead to ANR.</span></span><br><span class="line"><span class="comment">         * Critical level is triggered when PSI complete stall (all tasks are blocked because</span></span><br><span class="line"><span class="comment">         * of the memory congestion) breaches the configured threshold.</span></span><br><span class="line"><span class="comment">         */</span></span><br><span class="line">        kill_reason = NOT_RESPONDING;</span><br><span class="line">        <span class="built_in">strncpy</span>(kill_desc, <span class="string">&quot;device is not responding&quot;</span>, <span class="built_in">sizeof</span>(kill_desc));</span><br><span class="line">    &#125; <span class="keyword">else</span> <span class="keyword">if</span> (swap_is_low &amp;&amp; thrashing &gt; thrashing_limit_pct) &#123;</span><br><span class="line">        <span class="comment">/* Page cache is thrashing while swap is low */</span></span><br><span class="line">        kill_reason = LOW_SWAP_AND_THRASHING;</span><br><span class="line">        <span class="built_in">snprintf</span>(kill_desc, <span class="built_in">sizeof</span>(kill_desc), <span class="string">&quot;device is low on swap (%&quot;</span> PRId64</span><br><span class="line">            <span class="string">&quot;kB &lt; %&quot;</span> PRId64 <span class="string">&quot;kB) and thrashing (%&quot;</span> PRId64 <span class="string">&quot;%%)&quot;</span>,</span><br><span class="line">            mi.field.free_swap * page_k, swap_low_threshold * page_k, thrashing);</span><br><span class="line">        <span class="comment">/* Do not kill perceptible apps unless below min watermark or heavily thrashing */</span></span><br><span class="line">        <span class="keyword">if</span> (wmark &gt; WMARK_MIN &amp;&amp; thrashing &lt; thrashing_critical_pct) &#123;</span><br><span class="line">            min_score_adj = PERCEPTIBLE_APP_ADJ + <span class="number">1</span>;</span><br><span class="line">        &#125;</span><br><span class="line">        check_filecache = <span class="literal">true</span>;</span><br><span class="line">    &#125; <span class="keyword">else</span> <span class="keyword">if</span> (swap_is_low &amp;&amp; wmark &lt; WMARK_HIGH) &#123;</span><br><span class="line">        <span class="comment">/* Both free memory and swap are low */</span></span><br><span class="line">        kill_reason = LOW_MEM_AND_SWAP;</span><br><span class="line">        <span class="built_in">snprintf</span>(kill_desc, <span class="built_in">sizeof</span>(kill_desc), <span class="string">&quot;%s watermark is breached and swap is low (%&quot;</span></span><br><span class="line">            PRId64 <span class="string">&quot;kB &lt; %&quot;</span> PRId64 <span class="string">&quot;kB)&quot;</span>, wmark &lt; WMARK_LOW ? <span class="string">&quot;min&quot;</span> : <span class="string">&quot;low&quot;</span>,</span><br><span class="line">            mi.field.free_swap * page_k, swap_low_threshold * page_k);</span><br><span class="line">        <span class="comment">/* Do not kill perceptible apps unless below min watermark or heavily thrashing */</span></span><br><span class="line">        <span class="keyword">if</span> (wmark &gt; WMARK_MIN &amp;&amp; thrashing &lt; thrashing_critical_pct) &#123;</span><br><span class="line">            min_score_adj = PERCEPTIBLE_APP_ADJ + <span class="number">1</span>;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125; <span class="keyword">else</span> <span class="keyword">if</span> (wmark &lt; WMARK_HIGH &amp;&amp; swap_util_max &lt; <span class="number">100</span> &amp;&amp;</span><br><span class="line">               (swap_util = <span class="built_in">calc_swap_utilization</span>(&amp;mi)) &gt; swap_util_max) &#123;</span><br><span class="line">        <span class="comment">/*</span></span><br><span class="line"><span class="comment">         * Too much anon memory is swapped out but swap is not low.</span></span><br><span class="line"><span class="comment">         * Non-swappable allocations created memory pressure.</span></span><br><span class="line"><span class="comment">         */</span></span><br><span class="line">        kill_reason = LOW_MEM_AND_SWAP_UTIL;</span><br><span class="line">        <span class="built_in">snprintf</span>(kill_desc, <span class="built_in">sizeof</span>(kill_desc), <span class="string">&quot;%s watermark is breached and swap utilization&quot;</span></span><br><span class="line">            <span class="string">&quot; is high (%d%% &gt; %d%%)&quot;</span>, wmark &lt; WMARK_LOW ? <span class="string">&quot;min&quot;</span> : <span class="string">&quot;low&quot;</span>,</span><br><span class="line">            swap_util, swap_util_max);</span><br><span class="line">    &#125; <span class="keyword">else</span> <span class="keyword">if</span> (wmark &lt; WMARK_HIGH &amp;&amp; thrashing &gt; thrashing_limit) &#123;</span><br><span class="line">        <span class="comment">/* Page cache is thrashing while memory is low */</span></span><br><span class="line">        kill_reason = LOW_MEM_AND_THRASHING;</span><br><span class="line">        <span class="built_in">snprintf</span>(kill_desc, <span class="built_in">sizeof</span>(kill_desc), <span class="string">&quot;%s watermark is breached and thrashing (%&quot;</span></span><br><span class="line">            PRId64 <span class="string">&quot;%%)&quot;</span>, wmark &lt; WMARK_LOW ? <span class="string">&quot;min&quot;</span> : <span class="string">&quot;low&quot;</span>, thrashing);</span><br><span class="line">        cut_thrashing_limit = <span class="literal">true</span>;</span><br><span class="line">        <span class="comment">/* Do not kill perceptible apps unless thrashing at critical levels */</span></span><br><span class="line">        <span class="keyword">if</span> (thrashing &lt; thrashing_critical_pct) &#123;</span><br><span class="line">            min_score_adj = PERCEPTIBLE_APP_ADJ + <span class="number">1</span>;</span><br><span class="line">        &#125;</span><br><span class="line">        check_filecache = <span class="literal">true</span>;</span><br><span class="line">    &#125; <span class="keyword">else</span> <span class="keyword">if</span> (reclaim == DIRECT_RECLAIM &amp;&amp; thrashing &gt; thrashing_limit) &#123;</span><br><span class="line">        <span class="comment">/* Page cache is thrashing while in direct reclaim (mostly happens on lowram devices) */</span></span><br><span class="line">        kill_reason = DIRECT_RECL_AND_THRASHING;</span><br><span class="line">        <span class="built_in">snprintf</span>(kill_desc, <span class="built_in">sizeof</span>(kill_desc), <span class="string">&quot;device is in direct reclaim and thrashing (%&quot;</span></span><br><span class="line">            PRId64 <span class="string">&quot;%%)&quot;</span>, thrashing);</span><br><span class="line">        cut_thrashing_limit = <span class="literal">true</span>;</span><br><span class="line">        <span class="comment">/* Do not kill perceptible apps unless thrashing at critical levels */</span></span><br><span class="line">        <span class="keyword">if</span> (thrashing &lt; thrashing_critical_pct) &#123;</span><br><span class="line">            min_score_adj = PERCEPTIBLE_APP_ADJ + <span class="number">1</span>;</span><br><span class="line">        &#125;</span><br><span class="line">        check_filecache = <span class="literal">true</span>;</span><br><span class="line">    &#125; <span class="keyword">else</span> <span class="keyword">if</span> (check_filecache) &#123;</span><br><span class="line">        <span class="type">int64_t</span> file_lru_kb = (vs.field.nr_inactive_file + vs.field.nr_active_file) * page_k;</span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> (file_lru_kb &lt; filecache_min_kb) &#123;</span><br><span class="line">            <span class="comment">/* File cache is too low after thrashing, keep killing background processes */</span></span><br><span class="line">            kill_reason = LOW_FILECACHE_AFTER_THRASHING;</span><br><span class="line">            <span class="built_in">snprintf</span>(kill_desc, <span class="built_in">sizeof</span>(kill_desc),</span><br><span class="line">                <span class="string">&quot;filecache is low (%&quot;</span> PRId64 <span class="string">&quot;kB &lt; %&quot;</span> PRId64 <span class="string">&quot;kB) after thrashing&quot;</span>,</span><br><span class="line">                file_lru_kb, filecache_min_kb);</span><br><span class="line">            min_score_adj = PERCEPTIBLE_APP_ADJ + <span class="number">1</span>;</span><br><span class="line">        &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">            <span class="comment">/* File cache is big enough, stop checking */</span></span><br><span class="line">            check_filecache = <span class="literal">false</span>;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* Kill a process if necessary */</span></span><br><span class="line">    <span class="keyword">if</span> (kill_reason != NONE) &#123;</span><br><span class="line">        <span class="keyword">struct</span> <span class="title class_">kill_info</span> ki = &#123;</span><br><span class="line">            .kill_reason = kill_reason,</span><br><span class="line">            .kill_desc = kill_desc,</span><br><span class="line">            .thrashing = (<span class="type">int</span>)thrashing,</span><br><span class="line">            .max_thrashing = max_thrashing,</span><br><span class="line">        &#125;;</span><br><span class="line"></span><br><span class="line">        <span class="comment">/* Allow killing perceptible apps if the system is stalled */</span></span><br><span class="line">        <span class="keyword">if</span> (critical_stall) &#123;</span><br><span class="line">            min_score_adj = <span class="number">0</span>;</span><br><span class="line">        &#125;</span><br><span class="line">        <span class="built_in">psi_parse_io</span>(&amp;psi_data);</span><br><span class="line">        <span class="built_in">psi_parse_cpu</span>(&amp;psi_data);</span><br><span class="line">        <span class="type">int</span> pages_freed = <span class="built_in">find_and_kill_process</span>(min_score_adj, &amp;ki, &amp;mi, &amp;wi, &amp;curr_tm, &amp;psi_data);</span><br><span class="line">        <span class="keyword">if</span> (pages_freed &gt; <span class="number">0</span>) &#123;</span><br><span class="line">            killing = <span class="literal">true</span>;</span><br><span class="line">            max_thrashing = <span class="number">0</span>;</span><br><span class="line">            <span class="keyword">if</span> (cut_thrashing_limit) &#123;</span><br><span class="line">                <span class="comment">/*</span></span><br><span class="line"><span class="comment">                 * Cut thrasing limit by thrashing_limit_decay_pct percentage of the current</span></span><br><span class="line"><span class="comment">                 * thrashing limit until the system stops thrashing.</span></span><br><span class="line"><span class="comment">                 */</span></span><br><span class="line">                thrashing_limit = (thrashing_limit * (<span class="number">100</span> - thrashing_limit_decay_pct)) / <span class="number">100</span>;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    ...</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://source.android.com/docs/core/perf/lmkd?hl=zh-cn">https://source.android.com/docs/core/perf/lmkd?hl=zh-cn</a></li><li><a href="https://docs.kernel.org/accounting/psi.html">https://docs.kernel.org/accounting/psi.html</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;早前，&lt;code&gt;Android&lt;/code&gt;使用内核中的&lt;code&gt;lowmemorykiller&lt;/code&gt;驱动模块来监控系统内存，在内存不足时会主动杀掉某些非关键性的进程或者应用，从而减少系统的内存压力；自从内核版本&lt;code&gt;4.12&lt;/code&gt;之后，&lt;code&gt;lowmemorykiller&lt;/code&gt;从内核中移除了，因此&lt;code&gt;Android&lt;/code&gt;增加一个&lt;code&gt;lmkd(Low Memory Killer Daemon)&lt;/code&gt;来替代内核驱动用以监控系统内存状态，在系统处于内存高压状态时，主动清理部分内存，确保内存水位处于可接受的状态。那么，&lt;code&gt;LMKD&lt;/code&gt;又是如何获取系统内存压力状态的了？这个就要说到&lt;code&gt;PSI(Pressure stall information)&lt;/code&gt;这个内核模块了。&lt;/p&gt;</summary>
    
    
    
    <category term="Android" scheme="https://sniffer.site/categories/Android/"/>
    
    
    <category term="PSI" scheme="https://sniffer.site/tags/PSI/"/>
    
    <category term="内存管理" scheme="https://sniffer.site/tags/%E5%86%85%E5%AD%98%E7%AE%A1%E7%90%86/"/>
    
    <category term="LMKD" scheme="https://sniffer.site/tags/LMKD/"/>
    
  </entry>
  
  <entry>
    <title>如何利用cgroups优化Android系统性能</title>
    <link href="https://sniffer.site/2024/04/15/%E5%A6%82%E4%BD%95%E5%88%A9%E7%94%A8cgroup%E4%BC%98%E5%8C%96Android%E7%B3%BB%E7%BB%9F%E6%80%A7%E8%83%BD/"/>
    <id>https://sniffer.site/2024/04/15/%E5%A6%82%E4%BD%95%E5%88%A9%E7%94%A8cgroup%E4%BC%98%E5%8C%96Android%E7%B3%BB%E7%BB%9F%E6%80%A7%E8%83%BD/</id>
    <published>2024-04-15T08:45:56.000Z</published>
    <updated>2025-06-05T06:27:31.753Z</updated>
    
    <content type="html"><![CDATA[<p><code>cgroups(Control Groups)</code>即控制分组，是<code>Linux</code>中的一种进程资源分组访问控制机制，用于将系统中的进程划分为不同的分组(形成一种树状层级的结构)，利用不同的分组可以实现对各个进程的资源使用，如CPU、IO、内存、网络等系统资源进行优先级管理，可以确保在系统资源紧张的情况下，高优先级的进程可以获得更多的系统资源。简单来说，通过<code>cgroups</code>，我们可以实现对系统资源的分配、访问优先级、访问限制以及管理、监控等更精细的控制，从而提升系统的性能。本文主要介绍<code>Android</code>系统如何利用<code>cgroups</code>来改善系统性能，主要分为以下几个部分：</p><ul><li>简单介绍<code>cgroup</code>的实现原理</li><li>Android中的<code>cgroup</code>分组管理策略</li><li>如何利用<code>cgroup</code>优化Android系统性能</li></ul><span id="more"></span><h2 id="cgroup的实现原理"><a href="#cgroup的实现原理" class="headerlink" title="cgroup的实现原理"></a><strong>cgroup的实现原理</strong></h2><p><code>Linux</code>内核在初始化时，会初始化<code>cgroup</code>相关的配置，创建一个根<code>cgroup</code>分组，并注册一个虚拟的文件系统挂载到<code>/sys/fs/cgroup</code>目录下，这样用户空间执行<code>mount</code>之后就可以通过这些接口进行相关的操作。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">/**</span></span><br><span class="line"><span class="comment"> * cgroup_init_early - cgroup initialization at system boot</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * Initialize cgroups at system boot, and initialize any</span></span><br><span class="line"><span class="comment"> * subsystems that request early init.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="type">int</span> __init <span class="title function_">cgroup_init_early</span><span class="params">(<span class="type">void</span>)</span></span><br><span class="line">&#123;</span><br><span class="line"><span class="type">static</span> <span class="class"><span class="keyword">struct</span> <span class="title">cgroup_fs_context</span> __<span class="title">initdata</span> <span class="title">ctx</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">cgroup_subsys</span> *<span class="title">ss</span>;</span></span><br><span class="line"><span class="type">int</span> i;</span><br><span class="line"></span><br><span class="line">ctx.root = &amp;cgrp_dfl_root;</span><br><span class="line">init_cgroup_root(&amp;ctx);</span><br><span class="line">cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;</span><br><span class="line"></span><br><span class="line">RCU_INIT_POINTER(init_task.cgroups, &amp;init_css_set);</span><br><span class="line"></span><br><span class="line">for_each_subsys(ss, i) &#123;</span><br><span class="line">WARN(!ss-&gt;css_alloc || !ss-&gt;css_free || ss-&gt;name || ss-&gt;id,</span><br><span class="line">     <span class="string">&quot;invalid cgroup_subsys %d:%s css_alloc=%p css_free=%p id:name=%d:%s\n&quot;</span>,</span><br><span class="line">     i, cgroup_subsys_name[i], ss-&gt;css_alloc, ss-&gt;css_free,</span><br><span class="line">     ss-&gt;id, ss-&gt;name);</span><br><span class="line">WARN(<span class="built_in">strlen</span>(cgroup_subsys_name[i]) &gt; MAX_CGROUP_TYPE_NAMELEN,</span><br><span class="line">     <span class="string">&quot;cgroup_subsys_name %s too long\n&quot;</span>, cgroup_subsys_name[i]);</span><br><span class="line"></span><br><span class="line">ss-&gt;id = i;</span><br><span class="line">ss-&gt;name = cgroup_subsys_name[i];</span><br><span class="line"><span class="keyword">if</span> (!ss-&gt;legacy_name)</span><br><span class="line">ss-&gt;legacy_name = cgroup_subsys_name[i];</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (ss-&gt;early_init)</span><br><span class="line">cgroup_init_subsys(ss, <span class="literal">true</span>);</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>目前<code>Linux</code>内核中常用的<code>cgroup</code>有如下几种：</p><ul><li><code>cpuset</code>: 控制<code>CPU</code>核的分组，可以将指定的<code>CPU</code>核心分配到某个<code>cgroup</code>中，从而控制系统中的<code>CPU</code>资源的使用；要使用<code>cpuset</code>需要开启内核配置<code>CONFIG_CPUSETS</code></li><li><code>cpu</code>: 控制<code>CPU</code>分组调度，用于控制不同分组的调度时间片分配，确保高优先级的任务可以得到更多的时间片，对应的内核配置为<code>CONFIG_CGROUP_SCHED</code></li><li><code>cpuacct</code>: 用于控制不同分组的<code>CPU</code>使用状态统计，可以看到各个分组的<code>CPU</code>调度与使用状态数据，要使用<code>cpuacct</code>需要开启内核配置<code>CONFIG_CGROUP_CPUACCT</code></li><li><code>blkio</code>: 用于控制不同分组的磁盘<code>IO</code>资源的使用，比如保证前台的应用<code>IO</code>优先级与带宽，减少后台应用对系统<code>IO</code>的抢占，要使用<code>blkio</code>需要开启内核配置<code>CONFIG_BLK_CGROUP</code></li><li><code>memcg</code>: 控制不同分组的内存分配与使用，比如限制某些进程的内存使用量；比如虚拟化的场景，限制客户机总的内存使用量；<code>memcg</code>对应的内存配置<code>CONFIG_MEMCG</code></li><li><code>freezer</code>: 冻结分组子系统，通常用于进程的冻结控制，比如系统资源紧张时，主动冻结后台的某些任务，减少系统资源压力，要使用<code>freezer</code>需要开启内核配置<code>CONFIG_FREEZER</code></li></ul><p><code>Android</code>的<code>cgroup</code>配置都放在描述文件<code>cgroups.json</code>（<code>/system/core/libprocessgroup/profiles/</code>）中进行配置，<code>init</code>进程启动的时候会主动读取该配置文件，然后将各个分组控制器挂载到<code>/dev/xxx</code>对应的节点下，比如<code>cpu</code>分组控制器对应的目录为<code>/dev/cpuctl</code>; <code>cpuset</code>对应的目录为<code>/dev/cpuset</code>; <code>memory</code>对应的目录为<code>/dev/memcg</code>. </p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="type">static</span> <span class="type">bool</span> <span class="title function_">SetupCgroup</span><span class="params">(<span class="type">const</span> CgroupDescriptor&amp; descriptor)</span> &#123;</span><br><span class="line">    <span class="type">const</span> format::CgroupController* controller = descriptor.controller();</span><br><span class="line"></span><br><span class="line">    <span class="type">int</span> result;</span><br><span class="line">    <span class="keyword">if</span> (controller-&gt;version() == <span class="number">2</span>) &#123;</span><br><span class="line">        result = <span class="number">0</span>;</span><br><span class="line">        <span class="keyword">if</span> (!<span class="built_in">strcmp</span>(controller-&gt;name(), CGROUPV2_CONTROLLER_NAME)) &#123;</span><br><span class="line">            <span class="comment">// /sys/fs/cgroup is created by cgroup2 with specific selinux permissions,</span></span><br><span class="line">            <span class="comment">// try to create again in case the mount point is changed</span></span><br><span class="line">            <span class="keyword">if</span> (!Mkdir(controller-&gt;path(), <span class="number">0</span>, <span class="string">&quot;&quot;</span>, <span class="string">&quot;&quot;</span>)) &#123;</span><br><span class="line">                LOG(ERROR) &lt;&lt; <span class="string">&quot;Failed to create directory for &quot;</span> &lt;&lt; controller-&gt;name() &lt;&lt; <span class="string">&quot; cgroup&quot;</span>;</span><br><span class="line">                <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">            &#125;</span><br><span class="line"></span><br><span class="line">            result = mount(<span class="string">&quot;none&quot;</span>, controller-&gt;path(), <span class="string">&quot;cgroup2&quot;</span>, MS_NODEV | MS_NOEXEC | MS_NOSUID,</span><br><span class="line">                           nullptr);</span><br><span class="line"></span><br><span class="line">            <span class="comment">// selinux permissions change after mounting, so it&#x27;s ok to change mode and owner now</span></span><br><span class="line">            <span class="keyword">if</span> (!ChangeDirModeAndOwner(controller-&gt;path(), descriptor.mode(), descriptor.uid(),</span><br><span class="line">                                       descriptor.gid())) &#123;</span><br><span class="line">                LOG(ERROR) &lt;&lt; <span class="string">&quot;Failed to create directory for &quot;</span> &lt;&lt; controller-&gt;name() &lt;&lt; <span class="string">&quot; cgroup&quot;</span>;</span><br><span class="line">                result = <span class="number">-1</span>;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">            <span class="keyword">if</span> (!Mkdir(controller-&gt;path(), descriptor.mode(), descriptor.uid(), descriptor.gid())) &#123;</span><br><span class="line">                LOG(ERROR) &lt;&lt; <span class="string">&quot;Failed to create directory for &quot;</span> &lt;&lt; controller-&gt;name() &lt;&lt; <span class="string">&quot; cgroup&quot;</span>;</span><br><span class="line">                <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">            &#125;</span><br><span class="line"></span><br><span class="line">            <span class="keyword">if</span> (controller-&gt;flags() &amp; CGROUPRC_CONTROLLER_FLAG_NEEDS_ACTIVATION) &#123;</span><br><span class="line">                <span class="built_in">std</span>::<span class="built_in">string</span> str = <span class="built_in">std</span>::<span class="built_in">string</span>(<span class="string">&quot;+&quot;</span>) + controller-&gt;name();</span><br><span class="line">                <span class="built_in">std</span>::<span class="built_in">string</span> path = <span class="built_in">std</span>::<span class="built_in">string</span>(controller-&gt;path()) + <span class="string">&quot;/cgroup.subtree_control&quot;</span>;</span><br><span class="line"></span><br><span class="line">                <span class="keyword">if</span> (!base::WriteStringToFile(str, path)) &#123;</span><br><span class="line">                    LOG(ERROR) &lt;&lt; <span class="string">&quot;Failed to activate controller &quot;</span> &lt;&lt; controller-&gt;name();</span><br><span class="line">                    <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">                &#125;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">        <span class="comment">// mkdir &lt;path&gt; [mode] [owner] [group]</span></span><br><span class="line">        <span class="keyword">if</span> (!Mkdir(controller-&gt;path(), descriptor.mode(), descriptor.uid(), descriptor.gid())) &#123;</span><br><span class="line">            LOG(ERROR) &lt;&lt; <span class="string">&quot;Failed to create directory for &quot;</span> &lt;&lt; controller-&gt;name() &lt;&lt; <span class="string">&quot; cgroup&quot;</span>;</span><br><span class="line">            <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="comment">// Unfortunately historically cpuset controller was mounted using a mount command</span></span><br><span class="line">        <span class="comment">// different from all other controllers. This results in controller attributes not</span></span><br><span class="line">        <span class="comment">// to be prepended with controller name. For example this way instead of</span></span><br><span class="line">        <span class="comment">// /dev/cpuset/cpuset.cpus the attribute becomes /dev/cpuset/cpus which is what</span></span><br><span class="line">        <span class="comment">// the system currently expects.</span></span><br><span class="line">        <span class="keyword">if</span> (!<span class="built_in">strcmp</span>(controller-&gt;name(), <span class="string">&quot;cpuset&quot;</span>)) &#123;</span><br><span class="line">            <span class="comment">// mount cpuset none /dev/cpuset nodev noexec nosuid</span></span><br><span class="line">            result = mount(<span class="string">&quot;none&quot;</span>, controller-&gt;path(), controller-&gt;name(),</span><br><span class="line">                           MS_NODEV | MS_NOEXEC | MS_NOSUID, nullptr);</span><br><span class="line">        &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">            <span class="comment">// mount cgroup none &lt;path&gt; nodev noexec nosuid &lt;controller&gt;</span></span><br><span class="line">            result = mount(<span class="string">&quot;none&quot;</span>, controller-&gt;path(), <span class="string">&quot;cgroup&quot;</span>, MS_NODEV | MS_NOEXEC | MS_NOSUID,</span><br><span class="line">                           controller-&gt;name());</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (result &lt; <span class="number">0</span>) &#123;</span><br><span class="line">        <span class="type">bool</span> optional = controller-&gt;flags() &amp; CGROUPRC_CONTROLLER_FLAG_OPTIONAL;</span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> (optional &amp;&amp; errno == EINVAL) &#123;</span><br><span class="line">            <span class="comment">// Optional controllers are allowed to fail to mount if kernel does not support them</span></span><br><span class="line">            LOG(INFO) &lt;&lt; <span class="string">&quot;Optional &quot;</span> &lt;&lt; controller-&gt;name() &lt;&lt; <span class="string">&quot; cgroup controller is not mounted&quot;</span>;</span><br><span class="line">        &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">            PLOG(ERROR) &lt;&lt; <span class="string">&quot;Failed to mount &quot;</span> &lt;&lt; controller-&gt;name() &lt;&lt; <span class="string">&quot; cgroup&quot;</span>;</span><br><span class="line">            <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>除此之外，<code>init</code>初始化时，<code>Android</code>会创建各种cgroup分组，然后通过<code>rc</code>配置或者<code>Framework</code>的接口设置各个进程所在的分组状态。接下来我们就来详细看看<code>Android</code>对应的cgroup分组管理策略。</p><h2 id="Android中的cgroup分组管理策略"><a href="#Android中的cgroup分组管理策略" class="headerlink" title="Android中的cgroup分组管理策略"></a><strong>Android中的cgroup分组管理策略</strong></h2><p><code>Android</code>为了确保前台应用的资源使用，减少后台应用对资源的抢占，保证关键任务的执行与调度，增加了好几个进程分组:</p><ul><li><code>foreground</code>: 前台进程分组，大部分的应用都属于这个分组，包括系统服务、应用、桌面、系统UI等。</li><li><code>background</code>: 后台进程分组，系统的一些常驻后台进程，如<code>logd</code>等可以放在这个分组中</li><li><code>system-background</code>： 系统服务进程分组，Android一些关键系统服务可以放入该分组，如<code>update_engine</code>, <code>traced_perf</code>, <code>system_server</code>等都放在该分组中</li><li><code>top-app</code>: 系统交互进程分组，正在执行的系统交互的可见应用都会放入该分组，确保前台交互应用的资源优先级</li><li><code>camera-daemon</code>: 摄像头进程分组，摄像头相关的核心服务放入该进程，确保使用摄像头的进程资源分配的优先级</li></ul><p><code>Android</code>系统提供了<code>task_profiles.json</code>（<code>/system/core/libprocessgroup/profiles/</code>）任务配置文件描述进程或者线程要执行的特定操作；每组操作都与一个配件名称相关联，并且可以通过函数<code>SetTaskProfiles</code>&#x2F;<code>SetProcessProfiles</code>进行设置：</p><p>例如，原生的<code>task_profiles.json</code>文件:</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br><span class="line">146</span><br><span class="line">147</span><br><span class="line">148</span><br><span class="line">149</span><br><span class="line">150</span><br><span class="line">151</span><br><span class="line">152</span><br><span class="line">153</span><br><span class="line">154</span><br><span class="line">155</span><br><span class="line">156</span><br><span class="line">157</span><br><span class="line">158</span><br><span class="line">159</span><br><span class="line">160</span><br><span class="line">161</span><br><span class="line">162</span><br><span class="line">163</span><br><span class="line">164</span><br><span class="line">165</span><br><span class="line">166</span><br><span class="line">167</span><br><span class="line">168</span><br><span class="line">169</span><br><span class="line">170</span><br><span class="line">171</span><br><span class="line">172</span><br><span class="line">173</span><br><span class="line">174</span><br><span class="line">175</span><br><span class="line">176</span><br><span class="line">177</span><br><span class="line">178</span><br><span class="line">179</span><br><span class="line">180</span><br><span class="line">181</span><br><span class="line">182</span><br><span class="line">183</span><br><span class="line">184</span><br><span class="line">185</span><br><span class="line">186</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="punctuation">&#123;</span></span><br><span class="line">  <span class="attr">&quot;Attributes&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;MaxCapacityCPUs&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cpuset&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;File&quot;</span><span class="punctuation">:</span> <span class="string">&quot;top-app/cpus&quot;</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;UClampLatencySensitive&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cpu&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;File&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cpu.uclamp.latency_sensitive&quot;</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;FreezerState&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;freezer&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;File&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cgroup.freeze&quot;</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line">  <span class="attr">&quot;Profiles&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;Frozen&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Actions&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SetAttribute&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;FreezerState&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Value&quot;</span><span class="punctuation">:</span> <span class="string">&quot;1&quot;</span></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;HighPerformance&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Actions&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;JoinCgroup&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cpu&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;foreground&quot;</span></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;MaxPerformance&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Actions&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;JoinCgroup&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cpu&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;top-app&quot;</span></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;CameraServicePerformance&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Actions&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;JoinCgroup&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cpu&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;camera-daemon&quot;</span></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;ProcessCapacityLow&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Actions&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;JoinCgroup&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cpuset&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;background&quot;</span></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;ProcessCapacityHigh&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Actions&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;JoinCgroup&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cpuset&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;foreground&quot;</span></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;HighIoPriority&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Actions&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;JoinCgroup&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;blkio&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;&quot;</span></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SFMainPolicy&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Actions&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;JoinCgroup&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Controller&quot;</span><span class="punctuation">:</span> <span class="string">&quot;cpuset&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Path&quot;</span><span class="punctuation">:</span> <span class="string">&quot;system-background&quot;</span></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;PerfBoost&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Actions&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SetClamps&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Boost&quot;</span><span class="punctuation">:</span> <span class="string">&quot;50%&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Clamp&quot;</span><span class="punctuation">:</span> <span class="string">&quot;0&quot;</span></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;LowMemoryUsage&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Actions&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SetAttribute&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;MemSoftLimit&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Value&quot;</span><span class="punctuation">:</span> <span class="string">&quot;16MB&quot;</span></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">        <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SetAttribute&quot;</span><span class="punctuation">,</span></span><br><span class="line">          <span class="attr">&quot;Params&quot;</span><span class="punctuation">:</span></span><br><span class="line">          <span class="punctuation">&#123;</span></span><br><span class="line">            <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;MemSwappiness&quot;</span><span class="punctuation">,</span></span><br><span class="line">            <span class="attr">&quot;Value&quot;</span><span class="punctuation">:</span> <span class="string">&quot;150&quot;</span></span><br><span class="line"></span><br><span class="line">          <span class="punctuation">&#125;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line"></span><br><span class="line">  <span class="attr">&quot;AggregateProfiles&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SCHED_SP_BACKGROUND&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Profiles&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span> <span class="string">&quot;HighEnergySaving&quot;</span><span class="punctuation">,</span> <span class="string">&quot;LowIoPriority&quot;</span><span class="punctuation">,</span> <span class="string">&quot;TimerSlackHigh&quot;</span> <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SCHED_SP_FOREGROUND&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Profiles&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span> <span class="string">&quot;HighPerformance&quot;</span><span class="punctuation">,</span> <span class="string">&quot;HighIoPriority&quot;</span><span class="punctuation">,</span> <span class="string">&quot;TimerSlackNormal&quot;</span> <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SCHED_SP_TOP_APP&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Profiles&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span> <span class="string">&quot;MaxPerformance&quot;</span><span class="punctuation">,</span> <span class="string">&quot;MaxIoPriority&quot;</span><span class="punctuation">,</span> <span class="string">&quot;TimerSlackNormal&quot;</span> <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;Name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;SCHED_SP_SYSTEM&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;Profiles&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span> <span class="string">&quot;ServicePerformance&quot;</span><span class="punctuation">,</span> <span class="string">&quot;LowIoPriority&quot;</span><span class="punctuation">,</span> <span class="string">&quot;TimerSlackNormal&quot;</span> <span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="punctuation">]</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>任务配置文件主要包括两个部分：<code>Attributes</code>和<code>Profiles</code>。<code>Attributes</code>部分描述了如何配置控制组的属性，主要包含如下内容：</p><ul><li><code>name</code>字段： 制定<code>Attribute</code>的名称</li><li><code>Controller</code>字段： 按照名称引用<code>cgroups.json</code>文件的<code>cgroup</code>控制器</li><li><code>File</code>字段：相应控制器下的特定文件</li></ul><p><code>Attributes</code>是任务配置文件定义中的引用；在任务配置文件之外，仅当框架需要直接访问相应文件且无法使用任务配置文件抽象访问时，才应使用属性。在所有其他情况下，应该使用任务配置文件；它们可以更好地分离所需行为及其实现详情。<code>Profiles</code>部分使用以下字段来包含任务配置文件定义：</p><ul><li><code>Name</code>字段：定义配置文件的名称</li><li><code>Actions</code>部分： 列出配置文件对应执行的一组操作。每项操作包含如下几项：<ul><li><code>Name</code>字段： 指定操作</li><li><code>Params</code>字段： 指定操作的一组参数</li></ul></li></ul><p>下表是常用的受支持的操作：</p><table><thead><tr><th>操作</th><th>参数</th><th>说明</th></tr></thead><tbody><tr><td>SetTimerSlack</td><td>Slack</td><td>定时器可宽延时间(ns)</td></tr><tr><td>SetAttribute</td><td>Name&#x2F;Value</td><td>引用Attributes部分中某一个属性的名称和值</td></tr><tr><td>WriteFile</td><td>FilePath&#x2F;Value</td><td>文件的路径和要写入的文件值</td></tr><tr><td>JoinCgroup</td><td>Controller&#x2F;Path</td><td>指定控制组的名称和对应的cgroup路径</td></tr></tbody></table><p>Android12及以上的版本有一个<code>AggregateProfiles</code>项，包含了聚合的配置文件，每个聚合配置文件对应了一个或者多个配置文件的别名。其包含了两个部分的内容：</p><ul><li><code>Name</code>字段：定义聚合配置文件的名称</li><li><code>Profiles</code>字段： 聚合配置文件中包含的配置文件名称</li></ul><h2 id="利用cgroup优化Android系统性能"><a href="#利用cgroup优化Android系统性能" class="headerlink" title="利用cgroup优化Android系统性能"></a><strong>利用cgroup优化Android系统性能</strong></h2><p>Android提供了两种方式来控制进程的<code>cgroup</code>分组以及对应的优先级状态，一种是通过<code>init</code>脚本语言命令来设置，一种是通过<code>libprocessgroup</code>的接口来设置：</p><ul><li><code>init</code>脚本语言提供了一个<code>task_profiles</code>命令来设置进程的<code>cgroup</code>分组状态。<code>task_profiles</code>命令的格式如下：</li></ul><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">task_profiles MaxPerformance</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>Android 12以下的版本也可以通过<code>writepid</code>来写入到对应的<code>cgroup</code>目录:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">writepid /dev/cpuctl/top-app/tasks</span><br><span class="line"></span><br></pre></td></tr></table></figure><ul><li><code>API</code>接口设置进程的<code>cgroup</code>分组状态: 为了保持兼容，Android 10及更高版本保留了<code>cutils/sched_policy.h</code>的接口：<code>set_cpuset_policy</code>、<code>set_sched_policy</code> 和 <code>get_sched_policy</code> ，但Android 10以上的版本已经将对应的接口全部移到了<code>libprocessgroup</code>中，因此建议使用<code>processgroup/sched_policy.h</code>的接口。</li></ul><p>那么，利用Android提供的<code>cgroup</code>机制，我们可以做哪些方面的性能优化了？<code>cgroup</code>的核心是资源的分配与控制，确保系统优先级的任务得到更多资源，从这个角度出发，我们可以大致有如下几个优化的方向：</p><ul><li>通过<code>cpu</code>的<code>cgroup</code>分组管理，合理分配系统大小核心，这对于移动端大小核的异构架构来说，尤其重要。例如，将<code>top-app</code>, <code>camera-daemon</code>相关的分组绑定到大核，而<code>backgroud</code>&#x2F;<code>system-backgroud</code>等分组绑定到小核，可以确保系统关键的任务得到更多资源，确保系统响应延迟</li><li>为了减少渲染延迟，可以适当的将<code>SurfaceFlinger</code>相关的线程与服务都尽量绑定到大核上，从而提升系统渲染的帧率，减少卡顿、丢帧</li><li>通过<code>blkio</code>分组，可以用来控制前后台的<code>I/O</code>资源使用，在系统高负载时限制后台的<code>I/O</code>资源使用，从而提高前台应用的<code>I/O</code>响应延迟</li></ul><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a><strong>参考资料</strong></h2><ul><li><a href="https://source.android.com/docs/core/perf/cgroups?hl=zh-cn">https://source.android.com/docs/core/perf/cgroups?hl=zh-cn</a></li><li><a href="https://en.wikipedia.org/wiki/ARM_big.LITTLE">https://en.wikipedia.org/wiki/ARM_big.LITTLE</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;&lt;code&gt;cgroups(Control Groups)&lt;/code&gt;即控制分组，是&lt;code&gt;Linux&lt;/code&gt;中的一种进程资源分组访问控制机制，用于将系统中的进程划分为不同的分组(形成一种树状层级的结构)，利用不同的分组可以实现对各个进程的资源使用，如CPU、IO、内存、网络等系统资源进行优先级管理，可以确保在系统资源紧张的情况下，高优先级的进程可以获得更多的系统资源。简单来说，通过&lt;code&gt;cgroups&lt;/code&gt;，我们可以实现对系统资源的分配、访问优先级、访问限制以及管理、监控等更精细的控制，从而提升系统的性能。本文主要介绍&lt;code&gt;Android&lt;/code&gt;系统如何利用&lt;code&gt;cgroups&lt;/code&gt;来改善系统性能，主要分为以下几个部分：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;简单介绍&lt;code&gt;cgroup&lt;/code&gt;的实现原理&lt;/li&gt;
&lt;li&gt;Android中的&lt;code&gt;cgroup&lt;/code&gt;分组管理策略&lt;/li&gt;
&lt;li&gt;如何利用&lt;code&gt;cgroup&lt;/code&gt;优化Android系统性能&lt;/li&gt;
&lt;/ul&gt;</summary>
    
    
    
    <category term="Android" scheme="https://sniffer.site/categories/Android/"/>
    
    
    <category term="性能优化" scheme="https://sniffer.site/tags/%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96/"/>
    
    <category term="cgroups" scheme="https://sniffer.site/tags/cgroups/"/>
    
  </entry>
  
  <entry>
    <title>如何移植fio到Android平台</title>
    <link href="https://sniffer.site/2024/03/28/%E5%A6%82%E4%BD%95%E7%A7%BB%E6%A4%8Dfio%E5%88%B0Android%E5%B9%B3%E5%8F%B0/"/>
    <id>https://sniffer.site/2024/03/28/%E5%A6%82%E4%BD%95%E7%A7%BB%E6%A4%8Dfio%E5%88%B0Android%E5%B9%B3%E5%8F%B0/</id>
    <published>2024-03-28T12:01:20.000Z</published>
    <updated>2024-03-31T10:06:45.067Z</updated>
    
    <content type="html"><![CDATA[<p>fio是一个广泛使用的磁盘性能测试工具，功能强大，可以用于测试磁盘性能，也可以通过<code>I/O</code>重放来模拟用户的实际请求，其主要有如下几个特点:</p><ul><li>支持多种文件系统，包括NTFS，ext4，btrfs，xfs等</li><li>支持多种IO模式，包括randwrite，read，write，dd，trim，flush，discard等<span id="more"></span></li><li>fio可以测试不同类型的IO，包括随机写，连续写，顺序写，随机读，连续读，顺序读等</li><li>另外<code>fio</code>还支持I&#x2F;O限制，可以限制IO带宽，IO延迟，IO吞吐量等</li></ul><p>这篇文章，我们将介绍如何移植<code>fio</code>到Android平台，以及常见的使用方法。首先来看看如何通过交叉编译移植<code>fio</code>。</p><h2 id="编译准备"><a href="#编译准备" class="headerlink" title="编译准备"></a><strong>编译准备</strong></h2><p>首先到<code>fio</code>的官方网站<code>https://github.com/axboe/fio</code>下载源码：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">git <span class="built_in">clone</span> https://github.com/axboe/fio</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>如果本地没有安装过NDK，需要下载NDK:<a href="https://developer.android.google.cn/ndk/downloads?hl=zh-cn%EF%BC%9B%E7%84%B6%E5%90%8E%E8%AE%BE%E5%AE%9ANDK%E7%9A%84%E7%8E%AF%E5%A2%83%E5%8F%98%E9%87%8F%EF%BC%9A">https://developer.android.google.cn/ndk/downloads?hl=zh-cn；然后设定NDK的环境变量：</a></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="built_in">export</span> NDK_HOME=/path/to/ndk</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="编译代码"><a href="#编译代码" class="headerlink" title="编译代码"></a><strong>编译代码</strong></h2><p>在编译之前，先通过<code>./configure --help</code>看看编译配置具体有哪些参数:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br></pre></td><td class="code"><pre><span class="line">./configure --<span class="built_in">help</span></span><br><span class="line">--prefix=               Use this directory as installation prefix</span><br><span class="line">--cpu=                  Specify target CPU <span class="keyword">if</span> auto-detect fails</span><br><span class="line">--cc=                   Specify compiler to use</span><br><span class="line">--extra-cflags=         Specify extra CFLAGS to pass to compiler</span><br><span class="line">--build-32bit-win       Enable 32-bit build on Windows</span><br><span class="line">--target-win-ver=       Minimum version of Windows to target (only accepts 7)</span><br><span class="line">--enable-pdb            Enable Windows PDB symbols generation (needs clang/lld)</span><br><span class="line">--build-static          Build a static fio</span><br><span class="line">--esx                   Configure build options <span class="keyword">for</span> esx</span><br><span class="line">--enable-gfio           Enable building of gtk gfio</span><br><span class="line">--disable-numa          Disable libnuma even <span class="keyword">if</span> found</span><br><span class="line">--disable-rdma          Disable RDMA support even <span class="keyword">if</span> found</span><br><span class="line">--disable-rados         Disable Rados support even <span class="keyword">if</span> found</span><br><span class="line">--disable-rbd           Disable Rados Block Device even <span class="keyword">if</span> found</span><br><span class="line">--disable-http          Disable HTTP support even <span class="keyword">if</span> found</span><br><span class="line">--disable-gfapi         Disable gfapi</span><br><span class="line">--enable-libhdfs        Enable hdfs support</span><br><span class="line">--enable-libnfs         Enable nfs support</span><br><span class="line">--disable-libnfs        Disable nfs support</span><br><span class="line">--disable-lex           Disable use of lex/yacc <span class="keyword">for</span> math</span><br><span class="line">--disable-pmem          Disable pmem based engines even <span class="keyword">if</span> found</span><br><span class="line">--enable-lex            Enable use of lex/yacc <span class="keyword">for</span> math</span><br><span class="line">--disable-shm           Disable SHM support</span><br><span class="line">--disable-optimizations Don<span class="string">&#x27;t enable compiler optimizations</span></span><br><span class="line"><span class="string">--enable-cuda           Enable GPUDirect RDMA support</span></span><br><span class="line"><span class="string">--enable-libcufile      Enable GPUDirect Storage cuFile support</span></span><br><span class="line"><span class="string">--disable-native        Don&#x27;</span>t build <span class="keyword">for</span> native host</span><br><span class="line">--with-ime=             Install path <span class="keyword">for</span> DDN<span class="string">&#x27;s Infinite Memory Engine</span></span><br><span class="line"><span class="string">--enable-libiscsi       Enable iscsi support</span></span><br><span class="line"><span class="string">--enable-libnbd         Enable libnbd (NBD engine) support</span></span><br><span class="line"><span class="string">--disable-xnvme         Disable xnvme support even if found</span></span><br><span class="line"><span class="string">--disable-isal          Disable isal support even if found</span></span><br><span class="line"><span class="string">--disable-libblkio      Disable libblkio support even if found</span></span><br><span class="line"><span class="string">--disable-libzbc        Disable libzbc even if found</span></span><br><span class="line"><span class="string">--disable-tcmalloc      Disable tcmalloc support</span></span><br><span class="line"><span class="string">--dynamic-libengines    Lib-based ioengines as dynamic libraries</span></span><br><span class="line"><span class="string">--disable-dfs           Disable DAOS File System support even if found</span></span><br><span class="line"><span class="string">--enable-asan           Enable address sanitizer</span></span><br><span class="line"><span class="string">--seed-buckets=         Number of seed buckets for the refill-buffer</span></span><br><span class="line"><span class="string">--disable-tlsDisable __thread local storage</span></span><br><span class="line"><span class="string"></span></span><br></pre></td></tr></table></figure><p>具体需要哪些选项，我们可以根据需要来进行选择与配置。为了便于编译，我们写一个简单的编译脚本：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#!/bin/bash</span></span><br><span class="line"></span><br><span class="line">UNAME=Android</span><br><span class="line">ARCH=arm64</span><br><span class="line">CPU=aarch64</span><br><span class="line">API=26</span><br><span class="line">PREFIX=$(<span class="built_in">pwd</span>)/Android/<span class="variable">$CPU</span></span><br><span class="line">CROSS_COMPILE=<span class="variable">$NDK_HOME</span>/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android<span class="variable">$API</span>-</span><br><span class="line">CC=<span class="variable">$NDK_HOME</span>/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android<span class="variable">$API</span>-clang</span><br><span class="line"><span class="comment">#CROSS_PREFIX=$NDK_HOME/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android$API-</span></span><br><span class="line"></span><br><span class="line">./configure --prefix=<span class="variable">$PREFIX</span> --cpu=<span class="variable">$CPU</span> --cc=<span class="variable">$CC</span> --build-static --disable-numa</span><br><span class="line"></span><br><span class="line">make clean</span><br><span class="line">make V=1 UNAME=<span class="variable">$UNAME</span> CROSS_COMPILE=<span class="variable">$CROSS_COMPILE</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>执行上述脚本发现没在配置阶段会提示错误: 交叉编译工具使用的是 <code>gcc</code>, 而看NDK的工具都是 <code>clang</code> 的，需要修改下配置：</p><ul><li>由于当前NDK都采用 <code>clang</code>, 但 <code>fio</code>源码配置文件默认是 <code>gcc</code>的编译器，因此需要修改下 <code>configure</code>  文件：</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">diff --git a/configure b/configure</span><br><span class="line">index 420d97db..c245bcd9 100755</span><br><span class="line">--- a/configure</span><br><span class="line">+++ b/configure</span><br><span class="line">@@ -350,7 +350,7 @@ <span class="keyword">if</span> <span class="built_in">test</span> -z <span class="string">&quot;<span class="variable">$&#123;CC&#125;</span><span class="variable">$&#123;cross_prefix&#125;</span>&quot;</span>; <span class="keyword">then</span></span><br><span class="line">     cc=clang</span><br><span class="line">   <span class="keyword">fi</span></span><br><span class="line"> <span class="keyword">else</span></span><br><span class="line">-  cc=<span class="string">&quot;<span class="variable">$&#123;CC-<span class="variable">$&#123;cross_prefix&#125;</span>gcc&#125;</span>&quot;</span></span><br><span class="line">+  cc=<span class="string">&quot;<span class="variable">$&#123;CC-<span class="variable">$&#123;cross_prefix&#125;</span>clang&#125;</span>&quot;</span></span><br><span class="line"> <span class="keyword">fi</span></span><br></pre></td></tr></table></figure><p>修改完后，再次编译还是会提示错误:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">In file included from engines/io_uring.c:29:</span><br><span class="line">engines/nvme.h:18:8: error: redefinition of <span class="string">&#x27;nvme_uring_cmd&#x27;</span></span><br><span class="line">struct nvme_uring_cmd &#123;</span><br><span class="line">       ^</span><br><span class="line">/home/jason/Android/Sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/../sysroot/usr/include/linux/nvme_ioctl.h:80:8: note: previous definition is here</span><br><span class="line">struct nvme_uring_cmd &#123;</span><br><span class="line">       ^</span><br><span class="line">1 error generated.</span><br><span class="line">make: *** [Makefile:526: engines/io_uring.o] Error 1</span><br></pre></td></tr></table></figure><p>看代码是由于重复定义导致了， fio的源码里有一个地方重新定义了, 因此需要针对Android平台做判断, 照例修改下 <code>configure</code>的配置即可:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line">/*</span><br><span class="line"> * If the uapi headers installed on the system lacks nvme uring <span class="built_in">command</span></span><br><span class="line"> * support, use the <span class="built_in">local</span> version to prevent compilation issues.</span><br><span class="line"> */</span><br><span class="line"><span class="comment">#ifndef CONFIG_NVME_URING_CMD</span></span><br><span class="line">struct nvme_uring_cmd &#123;</span><br><span class="line">__u8opcode;</span><br><span class="line">__u8flags;</span><br><span class="line">__u16rsvd1;</span><br><span class="line">__u32nsid;</span><br><span class="line">__u32cdw2;</span><br><span class="line">__u32cdw3;</span><br><span class="line">__u64metadata;</span><br><span class="line">__u64addr;</span><br><span class="line">__u32metadata_len;</span><br><span class="line">__u32data_len;</span><br><span class="line">__u32cdw10;</span><br><span class="line">__u32cdw11;</span><br><span class="line">__u32cdw12;</span><br><span class="line">__u32cdw13;</span><br><span class="line">__u32cdw14;</span><br><span class="line">__u32cdw15;</span><br><span class="line">__u32timeout_ms;</span><br><span class="line">__u32   rsvd2;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><p>找到生成 <code>CONFIG_NVME_URING_CMD</code>的地方，增加对 <code>Android</code>的判断即可:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">@@ -2656,7 +2656,7 @@ <span class="keyword">if</span> <span class="built_in">test</span> <span class="string">&quot;<span class="variable">$libzbc</span>&quot;</span> != <span class="string">&quot;no&quot;</span> ; <span class="keyword">then</span></span><br><span class="line"> <span class="keyword">fi</span></span><br><span class="line"> print_config <span class="string">&quot;libzbc engine&quot;</span> <span class="string">&quot;<span class="variable">$libzbc</span>&quot;</span></span><br><span class="line"> </span><br><span class="line">-<span class="keyword">if</span> <span class="built_in">test</span> <span class="string">&quot;<span class="variable">$targetos</span>&quot;</span> = <span class="string">&quot;Linux&quot;</span> ; <span class="keyword">then</span></span><br><span class="line">+<span class="keyword">if</span> <span class="built_in">test</span> <span class="string">&quot;<span class="variable">$targetos</span>&quot;</span> = <span class="string">&quot;Linux&quot;</span> || <span class="built_in">test</span> <span class="string">&quot;<span class="variable">$targetos</span>&quot;</span> = <span class="string">&quot;Android&quot;</span>; <span class="keyword">then</span></span><br><span class="line"> <span class="comment">##########################################</span></span><br><span class="line"> <span class="comment"># Check NVME_URING_CMD support</span></span><br><span class="line"> <span class="built_in">cat</span> &gt; <span class="variable">$TMPC</span> &lt;&lt; <span class="string">EOF</span></span><br><span class="line"><span class="string"> </span></span><br></pre></td></tr></table></figure><p>这一次编译正常了，可以看到有生成可执行文件，将其push到Android设备，可以正常执行。</p><h2 id="如何使用fio"><a href="#如何使用fio" class="headerlink" title="如何使用fio"></a>如何使用<code>fio</code></h2><p><code>fio</code>的命令主要有两个部分，一部分是参数，一部分是测试的配置文件。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">fio [options] [jobfile] ...</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>参数用于设置<code>fio</code>以及输出一些调试信息，而真正让<code>fio</code>运行的是<code>jobfile</code>，用于指定运行时的测试参数，包括如下几个部分:</p><ul><li>I&#x2F;O的类型： 指定读写模式，比如是顺序读还是随机读等，比如是否使用直接I&#x2F;O(<code>direct I/O</code>)</li><li>读写块大小： 指定读写块大小，比如是4k还是64k</li><li>读写的总文件大小： 指定文件大小，比如是1G还是10G</li><li>I&#x2F;O引擎: 使用共享内存的方式还是普通的读写操作</li><li>I&#x2F;O深度： 对于使用异步I&#x2F;O引擎的情况，指定I&#x2F;O队列的大小</li><li>目标文件与设备： 指定测试需要执行的文件或者设备</li><li>线程或进程： 指定测试需要执行的线程数量</li></ul><p>比如如果我们需要测试某个磁盘的性能可以使用直接模式，具体命令如下：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">fio -direct=1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bs=16k -size=10G -numjobs=10 -runtime=1000 -group_reporting -name=fio-test</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>测试完成后，<code>fio</code>会生成报告, 包括<code>I/O</code>带宽，读写的速度以及<code>I/O</code>延迟等:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">Jobs: 10 (f=10): [w(10)][85.7%][w=1829MiB/s][w=117k IOPS][eta 00m:01s]</span><br><span class="line">fio-test: (groupid=0, <span class="built_in">jobs</span>=10): err= 0: pid=16991: Sun Mar 31 18:03:08 2024</span><br><span class="line">  write: IOPS=95.3k, BW=1489MiB/s (1562MB/s)(10.0GiB/6875msec); 0 zone resets</span><br><span class="line">    clat (usec): min=14, max=545565, avg=101.30, stdev=3043.89</span><br><span class="line">     lat (usec): min=14, max=545569, avg=103.13, stdev=3043.92</span><br><span class="line">    clat percentiles (usec):</span><br><span class="line">     |  1.00th=[   24],  5.00th=[   28], 10.00th=[   32], 20.00th=[   38],</span><br><span class="line">     | 30.00th=[   45], 40.00th=[   53], 50.00th=[   58], 60.00th=[   61],</span><br><span class="line">     | 70.00th=[   64], 80.00th=[   68], 90.00th=[   74], 95.00th=[   86],</span><br><span class="line">     | 99.00th=[  668], 99.50th=[ 2278], 99.90th=[ 4817], 99.95th=[ 5342],</span><br><span class="line">     | 99.99th=[ 6587]</span><br><span class="line">   bw (  MiB/s): min=    0, max= 1947, per=98.94%, avg=1473.64, stdev=65.83, samples=130</span><br><span class="line">   iops        : min=   20, max=124616, avg=94312.77, stdev=4213.24, samples=130</span><br><span class="line">  lat (usec)   : 20=0.12%, 50=35.96%, 100=60.59%, 250=2.20%, 500=0.09%</span><br><span class="line">  lat (usec)   : 750=0.06%, 1000=0.16%</span><br><span class="line">  lat (msec)   : 2=0.20%, 4=0.40%, 10=0.22%, 20=0.01%, 250=0.01%</span><br><span class="line">  lat (msec)   : 750=0.01%</span><br><span class="line">  cpu          : usr=4.98%, sys=37.83%, ctx=664271, majf=0, minf=0</span><br><span class="line">  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, &gt;=64=0.0%</span><br><span class="line">     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.0%</span><br><span class="line">     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.0%</span><br><span class="line">     issued rwts: total=0,655360,0,0 short=0,0,0,0 dropped=0,0,0,0</span><br><span class="line">     latency   : target=0, window=0, percentile=100.00%, depth=1</span><br><span class="line"></span><br><span class="line">Run status group 0 (all <span class="built_in">jobs</span>):</span><br><span class="line">  WRITE: bw=1489MiB/s (1562MB/s), 1489MiB/s-1489MiB/s (1562MB/s-1562MB/s), io=10.0GiB (10.7GB), run=6875-6875msec</span><br><span class="line"></span><br><span class="line">Disk stats (<span class="built_in">read</span>/write):</span><br><span class="line">  nvme0n1: ios=0/628636, sectors=0/20133840, merge=0/2245, ticks=0/40603, in_queue=41130, util=96.51%</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a><strong>参考资料</strong></h2><ul><li><a href="https://fio.readthedocs.io/en/latest/fio_doc.html">https://fio.readthedocs.io/en/latest/fio_doc.html</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;fio是一个广泛使用的磁盘性能测试工具，功能强大，可以用于测试磁盘性能，也可以通过&lt;code&gt;I/O&lt;/code&gt;重放来模拟用户的实际请求，其主要有如下几个特点:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;支持多种文件系统，包括NTFS，ext4，btrfs，xfs等&lt;/li&gt;
&lt;li&gt;支持多种IO模式，包括randwrite，read，write，dd，trim，flush，discard等</summary>
    
    
    
    <category term="Android" scheme="https://sniffer.site/categories/Android/"/>
    
    
    <category term="fio" scheme="https://sniffer.site/tags/fio/"/>
    
    <category term="文件系统" scheme="https://sniffer.site/tags/%E6%96%87%E4%BB%B6%E7%B3%BB%E7%BB%9F/"/>
    
    <category term="性能工具" scheme="https://sniffer.site/tags/%E6%80%A7%E8%83%BD%E5%B7%A5%E5%85%B7/"/>
    
  </entry>
  
  <entry>
    <title>你好,2024</title>
    <link href="https://sniffer.site/2024/01/30/%E4%BD%A0%E5%A5%BD,2024/"/>
    <id>https://sniffer.site/2024/01/30/%E4%BD%A0%E5%A5%BD,2024/</id>
    <published>2024-01-29T23:08:50.000Z</published>
    <updated>2024-02-26T10:10:42.901Z</updated>
    
    <content type="html"><![CDATA[<blockquote><p>Nothing is easier than self-deceit. For what each man wishes, that he also believes to be true(人们善于自欺，人们想得到什么，就会相信什么)</p><p>  德摩斯梯尼（Demosthenes，古希腊）</p></blockquote><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/new-year-fireworks.jpg" alt="new year fireworks"></p><span id="more"></span><p>一晃眼，2024年已经过去一个月了，一直想写下2023年的总结，却因为各种事情迟迟没有动笔。2023发生了太多事情，想说点什么，却又无从说起。回想起来，各种事情却又历历在目:</p><ul><li>母亲今年开始一直开始头痛，国庆期间抽空回老家陪老妈去了一趟医院，所幸的是检查结果是良性的血管瘤，并没有大碍，修养调理下应该问题不大。这件事让我真正感到所谓上有老，下有小的压力与责任。</li><li>小宝贝开始进入幼儿园学习了，开始了全新的阶段，总体上说，她适应的不错，事实证明很多之前的担忧都是多余的；在陪伴小宝贝成长的过程中，我逐渐意识到，很多发生在儿女身上的问题，往往不是她自身有什么问题，问题的本质来源于家庭，来源于父母自己，来源于错误的观念。如果我多尝试站在女儿的角度思考某些问题，重新审视很多的事情，许多所谓的问题都会烟消云散。父母与儿女的冲突根源在于权力的不对等，为人父母总想用自己所谓的权威与身体上的强大来压制儿女的某些行为，结果只会让彼此更痛苦、难受。真正的爱首先是接纳，然后是理解与包容。我爱小宝贝，那么，我首先需要做的是接纳她的个性，接纳她身上所谓的<code>毛病</code>，然后理解她，用真情真心去陪伴她一起成长。我需要融入她的生活与思维，忘掉烙印在自己身上的各种身份，做一个引路人而不是主导者。</li><li>去年下旬，业内的一个技术大佬左耳朵耗子叔因为突发心肌梗塞去世了，真的非常难以置信；他去世前几天我还在看他的博文，字里行间可以感受到他对于技术的热爱，没料到翌日人就没了，真是伤心又难过。不得不感慨，人的生命其实非常脆弱，步入中年，身体在长年累月的工作与压力之下，是疾病的高发期。这也给我们做技术的人一个警醒，工作挣钱固然重要，更重要的是有一个健康的身体。</li><li>想了很久，还是离开了工作多年的公司，见证了公司曲折的创业历程，经历了从第一个产品推出到公司上市，非常难得的经历，有许多的不舍，但是考虑到自己职业生涯的发展，还是决定跳出当前的技术舒适圈，去寻求更大的突破。祝愿公司在未来的发展中百舸争流，再创辉煌。</li><li>阅读量较之前多了不少，可回想来看，阅读质量却没有太多的提升。印象比较深的是&lt;稻盛和夫的哲学&gt;、&lt;哲学之树-通往自我认知的哲学课&gt;、&lt;思考快与慢&gt;、&lt;史蒂夫-乔布斯传&gt;、&lt;认识大脑-关于大脑的7 1&#x2F;2课&gt;，其他的则只有一些很模糊的印象。阅读，在很大程度上是作者与读者之间的心灵交流，如果两方的思维方式差异较大，有时确实很难真正理解书中的思想与观点；总结下来，想要提升阅读的质量，一方面是要提升阅读技巧，要学会做阅读笔记，把阅读时的所思所想都记录下来；更多的还是要改善自己的思维方式，变换自己的观念，才能真正有所收获。</li></ul><p>生活曲曲折折，2023思考更多的是活着的意义与人生而为人的价值。前段时间看辽宁的一个名叫柏剑老师，十年如一日坚持不懈通过体育锻炼来培养一群问题少年的故事，十分感慨；与此同时，又看到某地的贪官私藏大量现金的新闻。联系起来看，不禁要问，人与人之间何以有如此大的差异了？从本质上来说，这些抉择的背后体现的就是一个人的价值观与世界观。这眼花缭乱的世界，一个人生命的意义与价值究竟在哪里？</p><p>如果从整个宇宙范围来说，人生其实是没有意义的，所有的意义都会指向一个空-即死亡；死亡之后，就是空，就是虚无。但人活着，有荷尔蒙，便会有欲望，有好奇心，有探索的渴求，正是身体里的这些化学的荷尔蒙驱动了我们的行为，塑造着我们独特的个性。回到生活本身，如果我们只是为身体的荷尔蒙驱使，被自己的意识与观念驱动，那本质上来说，我们只是自己的思维与欲望的囚徒，一个困在自然进化牢笼中的囚徒而已。欲望是一种枷锁，所有通过努力来获取到的财富、金钱、地位、名利以及身体的满足都是一种无形的枷锁，如果我们不去反思，不假思索的接受现实，无疑我们是在自我欺骗罢了。譬如，那些拼了命捞钱的贪官污吏，财迷心窍，最终也不过是让金钱、名利困住，成为一个监守自盗的人；那些努力通过流量与各种宣传话术来挣钱的主播，一旦丧失自己的内心的道德底线，最终不过是给自己戴上一个牢笼而已。</p><p>生命也好，生活也罢，真正难的是平衡，是平衡的智慧；是内心里对于现实的认知的平衡，不自欺，不自弃。做到不自欺，不自弃，我们才有可能真正迈向智慧的台阶，让自己生活快乐、轻松、淡定，平和。面向2024，也希望自己能做到不自欺，不自弃，踏踏实实过好每一天，成为一个更好的父亲，一个更好的老公，一个更好的人。以下几点自勉，与大家共勉。</p><ul><li>学会爱:不仅仅是要爱身边的人，也要学会爱自己；学会感受别人的爱，学会感受自然的一切馈赠。培养的爱的能力，保持自省与自律，做一个有智慧的人。</li><li>发现价值，创造价值: 工作不仅仅是赚取一份收入，也是体现自我价值的地方。做一件事，不能只关注短期的收益，也要努力思考背后的价值，看到背后更长远的收益；保持开放的视野，点滴积累，厚积薄发。</li><li>读书与思考：阅读让人平静，在如今被各种短视频充斥的媒体时代，时间被割裂成碎片，我们已很难专注的进行深度思考。多去阅读，保持思考，保持定力。</li><li>发现身体的价值: 在思维与认知上倾注了太多的精力、时间，身体太容易被忽略；只有当身体出现各种疾病症状时，我们才可能意识到身体的价值。人到中年，见多了生老病死，才慢慢发现，身体的价值可能远比想象的要大。我们要关注身体的信号，减少对身体的消耗-熬夜、通宵、不运动、饮食不规律。学着做身体的朋友，关心与呵护她。</li></ul><p>2023，再见；你好，2024！</p>]]></content>
    
    
    <summary type="html">&lt;blockquote&gt;&lt;p&gt;Nothing is easier than self-deceit. For what each man wishes, that he also believes to be true(人们善于自欺，人们想得到什么，就会相信什么)&lt;/p&gt;
&lt;p&gt;  德摩斯梯尼（Demosthenes，古希腊）&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;img src=&quot;https://md-files.oss-cn-shenzhen.aliyuncs.com/new-year-fireworks.jpg&quot; alt=&quot;new year fireworks&quot;&gt;&lt;/p&gt;</summary>
    
    
    
    <category term="思考" scheme="https://sniffer.site/categories/%E6%80%9D%E8%80%83/"/>
    
    
    <category term="成长" scheme="https://sniffer.site/tags/%E6%88%90%E9%95%BF/"/>
    
    <category term="价值" scheme="https://sniffer.site/tags/%E4%BB%B7%E5%80%BC/"/>
    
    <category term="探索" scheme="https://sniffer.site/tags/%E6%8E%A2%E7%B4%A2/"/>
    
  </entry>
  
  <entry>
    <title>认识你的大脑</title>
    <link href="https://sniffer.site/2023/11/26/%E8%AE%A4%E8%AF%86%E4%BD%A0%E7%9A%84%E5%A4%A7%E8%84%91/"/>
    <id>https://sniffer.site/2023/11/26/%E8%AE%A4%E8%AF%86%E4%BD%A0%E7%9A%84%E5%A4%A7%E8%84%91/</id>
    <published>2023-11-26T04:37:04.000Z</published>
    <updated>2023-11-28T01:17:22.658Z</updated>
    
    <content type="html"><![CDATA[<blockquote><p>If we spoke a different language, we would perceive a somewhat differenet world.</p><pre><code>Ludwig Wittgenstein</code></pre></blockquote><p> <img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/high-tech-brain.jpg" alt="Your high tech brain"></p> <span id="more"></span><p> 一直对大脑的构造与运作机制比较感兴趣，想花点时间了解人类思维与意识的奥秘。前段时间，从图书馆借了<code>Eisa Feldman Barrett</code>写的一本比较薄的有关大脑的书<code>认识大脑-关于大脑的7 1/2堂课</code>，通俗易懂，也澄清了认知上的一些误解与疑惑。基于这本书的内容，结合自己这几年的思考与经历，有一些想法与心得，在这里一并写下来。有很多地方比较粗浅，有错误的地方，还请大家指正。</p><blockquote><p>我们的每一个想法，每一种快乐、愤怒或者敬畏的感觉，我们给予或者接受的每一个拥抱，释放的每一个善意，承受的每一次侮辱，都类似从身体预算中存款或者取款，我们往往意识不到，而实际一切都在发生。认清这一点，对理解大脑如何工作，以及如何保持健康，活的更长，更有意义至关重要。</p></blockquote><p>大脑的诞生并不是用于思考的，大脑是为生存而演化出来的复杂计算系统。在几亿年前，地球上并不存在有大脑的生物；当时的海洋中，有文昌鱼这种无脊椎动物，它们的神经系统非常单一，仅有一小团细胞（并不能称为大脑）；没有味觉与嗅觉器官，文昌鱼就是靠着简单的感知能力在海底生存了近5.5亿年。那么，地球上为什么会出现有大脑的生物了？进化论认为，这是源自生物的生存竞争与自然选择。在激烈的生存竞争中，那些拥有更快反应速度、更强大感知能力、更高的能效比的生物具备更好的生存能力，有更大的几率生存下来；而大脑正是用于调节生物内部能量效率的计算中心。对人类来说同样如此，大脑首先是一个为生存而演化出来的器官。人类的大脑会准确预算每个行为需要消耗的能量，为每个行为储备水、盐和葡萄糖等资源：你所采取的每一步行动都是一种带有经济考虑的选择-大脑会预测什么时候消耗资源，什么时候节省资源。所有动物，包括人类都学会了从过去的行为中学习如何预测下一次行动，如果过去的行为带来了好处，比如一次成功的逃脱或享受一顿美餐，那么它们就会重复过去的行为；大脑会通过某种方式唤起过去的经历，让身体为行动做好准备。</p><blockquote><p>柏拉图认为，人类的思维是一场三股不同力量之间持续不断地战争: 第一股力量是生存，代表饥饿与身体的欲望；第二股力量是情绪，如喜悦、愤怒和恐惧；第三种力量是理性思维。后来，有科学家根据这一类观点发展出三重脑理论，认为人的大脑有三层，一层负责生存（对应脑干），一层用来感受（对应边缘系统），一层用来思考（对应新皮质）。</p></blockquote><p><img src="https://md-files.oss-cn-shenzhen.aliyuncs.com/three-brain-theory.png" alt="三重大脑结构"></p><p>单纯就大脑的大小，人类的大脑确实比一般的动物，如老鼠，鱼类，蜥蜴，猕猴都要大得多，结构上也似乎不太相同。但最近的科学研究发现，所有哺乳动物的大脑发育都遵从同一个计划（爬行动物与其他脊椎动物的大脑也可能遵从类似的构造模式）：哺乳动物的大脑神经元的形成顺序是可预测的。区分在于，不同动物在构造的不同阶段所花费的时间不同而已，不同部分的大脑大小不一样罢了。因此，我们会看到人类有一个很大的新皮质，而老鼠的相对比较小。从这个角度来说，人类的大脑没有新增的部分-我们大脑中的神经元不仅可以在其他哺乳动物的大脑中发现，也可以在其他脊椎动物中被找到。人类的大脑并没有特别的地方；人类只是一种有趣的动物，具有特殊的适应性，帮助我们在特定的环境中生存与竞争。但自然界的其他动物表现的并不比人类差，如鸟可以飞，细菌则可以在恶劣的环境中存活。</p><p>那么，我们常说的理性又究竟是怎么回事？传统上，理性的行为是指不受情绪干扰-思维被认为是理性的，而情绪则被认为是非理性的。但事实并非如此。思维有时并非总是理性的。比如连续刷了几个小时抖音，你安慰自己发现了很多有趣的东西。因此，书中作者将理性定义为：身体预算-对我们每天所需的水、盐、葡萄糖和其他身体资源的管理；它意味着在特定情况下进行良好的身体预算投资。</p><blockquote><p>大脑是一个复杂的网络，由多达1280亿个神经元组成；神经元之间通过树突相互连接，通过突触来传递信号。这种复杂的神经元网络让大脑具备高度的复杂性，让大脑具备更高的容错能力与抗损伤能力，这也是人类创造性的源泉。与其他很多动物不同的是，人在刚出生时，虽然神经元的数量两倍于成年人的大脑，但神经元的连接只是完成了一部分，人类需要在后面长达25年的时间内不断完善、构造神经元的连接-新的连接被创建、不断调整，多余的连接则会被<code>修剪</code>掉。</p></blockquote><p>大脑神经元网络的复杂性所涌现出的能力，不禁让我联想起最近人工智能领域的<code>chatGPT</code>： 一个类神经元网络在数据与参数大到一定程度后所展现出的潜能让人惊讶不已。原本，人类可能需要通过正向分析大脑的整个构造过程才能解开智能的真正奥秘，如今通过<code>chatGPT</code>这种大模型技术，我们可能从另外一个角度破解智能的本质。但在此真正的通用人工智能出现之前，人还是要努力思考下，是什么让一个人具备了不一样的独特性？或者说，我们应该怎么做才能让一个婴儿能够变得更聪明，更有创造力，更有责任感与道德心？一个婴儿，刚出生时大脑连接尚未完成，理论上具备无限的可能。我们每一个人都应该重视对儿童的关心与爱护；整个社会要努力去营造一个让儿童健康成长的环境，这种付出成本较小，但是收益却是大的多：一个健康、智慧而有责任心的公民会为未来的社会带来可靠的价值，而一个在不幸与贫穷中长大的儿童则可能成长为一个愤世嫉俗、无所事事、毫无责任与道德感的人，这只会增加社会的成本。</p><blockquote><p>大脑是一个预测器官，它会根据过去的经验来对当前感知到的现实进行判断，从而做出预测；在我们做出某个行为之前，大脑实际上已经做了预测。什么在塑造我们的大脑神经元的构造？除了先天基因之外，家庭、教育、社会文化都在潜移默化中塑造一个人的大脑连接的回路。正是大脑神经元连接的多样性，塑造了丰富多彩的人类文明，让每个个体都具有独特的个性，使我们具备了与其他动物不一样的能力-创造社会现实。大脑通过创造力、沟通、模仿、合作以及压缩（总结信息，删除冗余）来构造人类独有的社会文化，让人类成为一个社会性的物种。</p></blockquote><p>是什么在塑造我们的大脑（心灵）？又是什么力量让人形成了对自我、他人以及社会的观念与看法？同样的一件事，每个人都会有自己的一套想法与思维方式；我们每一天的行为的驱动力又来自哪里？如果大脑的连接已经被过去的经验所塑造，我们改变不了过去，那么是否意味着我们也没法改变现状，没有拥有改变自我的意志与力量？但实际情况并非如此-我们可以改变自我。我们无法改变我们的家庭，无法改变身边的人，无法改变社会现实，但可以通过努力来改变大脑的连接方式，改变大脑对于外界的行为预测。比如，你可能在公众面前演讲比较紧张，通过足够的练习与反馈，你可以获得这一技能，从而让自己适应这种状态-在众人面前讲话做到放松自如。这就是大脑的潜能，作为大脑的主人，只要学会利用这种控制的权利，承担起改变的责任，那么，通过足够的努力与学习，我们就可以达成自己想要实现的目标。改变的前提是，我们要放下当前大脑固有连接所带来的认知与思维模式，换一种不一样的角度来看待当前的现实，从而调整大脑对于固有行为的预测，进而改变自我，完成新大脑神经元连接的塑造。</p><p>但在一个由各种类型连接的大脑塑造的社会现实中，我们不仅仅要面对自我，也要面对他人，以及形态各异的社会壁垒。我们渴望改变自我，但与此同时也受限于社会关系；我们要改变自我，那么不得不去与他人合作，与他人达成协作的关系。所谓<code>他人即天堂，他人即地狱</code>，与他人（家庭、同事、朋友）建立良好健康的关系，可以帮助我们重塑自我，让我们获得成长的动力，过得幸福、快乐与满足；反之，他人可能会成为吞噬自我的地狱，让我们陷入挣扎与纠缠的泥潭。</p>]]></content>
    
    
    <summary type="html">&lt;blockquote&gt;&lt;p&gt;If we spoke a different language, we would perceive a somewhat differenet world.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Ludwig Wittgenstein
&lt;/code&gt;&lt;/pre&gt;
&lt;/blockquote&gt;

&lt;p&gt; &lt;img src=&quot;https://md-files.oss-cn-shenzhen.aliyuncs.com/high-tech-brain.jpg&quot; alt=&quot;Your high tech brain&quot;&gt;&lt;/p&gt;</summary>
    
    
    
    <category term="社会万象" scheme="https://sniffer.site/categories/%E7%A4%BE%E4%BC%9A%E4%B8%87%E8%B1%A1/"/>
    
    
    <category term="哲学" scheme="https://sniffer.site/tags/%E5%93%B2%E5%AD%A6/"/>
    
    <category term="大脑" scheme="https://sniffer.site/tags/%E5%A4%A7%E8%84%91/"/>
    
    <category term="人生" scheme="https://sniffer.site/tags/%E4%BA%BA%E7%94%9F/"/>
    
  </entry>
  
  <entry>
    <title>Android常用的性能分析工具</title>
    <link href="https://sniffer.site/2023/10/20/Android%E5%B8%B8%E7%94%A8%E7%9A%84%E6%80%A7%E8%83%BD%E5%88%86%E6%9E%90%E5%B7%A5%E5%85%B7%E4%BD%BF%E7%94%A8%E7%AE%80%E4%BB%8B/"/>
    <id>https://sniffer.site/2023/10/20/Android%E5%B8%B8%E7%94%A8%E7%9A%84%E6%80%A7%E8%83%BD%E5%88%86%E6%9E%90%E5%B7%A5%E5%85%B7%E4%BD%BF%E7%94%A8%E7%AE%80%E4%BB%8B/</id>
    <published>2023-10-20T03:45:04.000Z</published>
    <updated>2025-06-26T01:40:38.524Z</updated>
    
    <content type="html"><![CDATA[<p>在Android系统开发过程中，经常碰到CPU占用率高、内存泄露、内存占用高等性能相关的问题，这时通常需要抓取系统的<code>trace</code>日志，用以查看进程的CPU占用，内存分配等情况。怎么抓取系统trace， 这时一般需要用到系统性能相关的分析工具。这篇文章就以<code>Android S</code>为例，说明Android开发中常用的一些性能优化工具的使用方法，主要包括如下几个工具:</p><span id="more"></span><ul><li>atrace</li><li>systrace</li><li>dumpsys</li><li>simpleperf</li><li>perfetto</li><li>Android Profiler</li></ul><h2 id="atrace"><a href="#atrace" class="headerlink" title="atrace"></a><strong>atrace</strong></h2><p><code>atrace</code>是<code>Android</code>系统的自带的一个抓取<code>systrace</code>的工具，不仅可用于抓取系统服务的状态，如<code>input</code>、<code>SurfaceFlinger</code>、<code>Window Manager</code>等，也可以用于抓取内核的<code>trace</code>日志，如<code>CPU</code>调度、<code>irq</code>中断、内存等信息, 具体支持那些类型的<code>trace</code>日志，可以通过<code>adb shell atrace --list_categories</code>查看。</p><blockquote><p>对<code>atrace</code>实现感兴趣的同学，可以查看<code>frameworks/native/cmds/atrace</code>的代码</p></blockquote><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">usage: atrace [options] [categories...]</span><br><span class="line">options include:</span><br><span class="line">-a appname      <span class="built_in">enable</span> app-level tracing <span class="keyword">for</span> a comma separated list of cmdlines; * is a wildcard matching any process</span><br><span class="line">-b N            use a trace buffer size of N KB</span><br><span class="line">-c              trace into a circular buffer</span><br><span class="line">-f filename     use the categories written <span class="keyword">in</span> a file as space-separated</span><br><span class="line">                    values <span class="keyword">in</span> a line</span><br><span class="line">-k fname,...    trace the listed kernel <span class="built_in">functions</span></span><br><span class="line">-n              ignore signals</span><br><span class="line">-s N            <span class="built_in">sleep</span> <span class="keyword">for</span> N seconds before tracing [default 0]</span><br><span class="line">-t N            trace <span class="keyword">for</span> N seconds [default 5]</span><br><span class="line">-z              compress the trace dump</span><br><span class="line">--async_start   start circular trace and <span class="built_in">return</span> immediately</span><br><span class="line">--async_dump    dump the current contents of circular trace buffer</span><br><span class="line">--async_stop    stop tracing and dump the current contents of circular</span><br><span class="line">                    trace buffer</span><br><span class="line">--stream        stream trace to stdout as it enters the trace buffer</span><br><span class="line">                    Note: this can take significant CPU time, and is best</span><br><span class="line">                    used <span class="keyword">for</span> measuring things that are not affected by</span><br><span class="line">                    CPU performance, like pagecache usage.</span><br><span class="line">--list_categories</span><br><span class="line">                list the available tracing categories</span><br><span class="line">-o filename      write the trace to the specified file instead</span><br><span class="line">                    of stdout.</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>比如，如果我们想查看<code>SurfaceFlinger</code>（对应的类型未<code>gfx</code>）,可以再设备上执行如下命令抓取(默认只抓取5s的记录):</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">adb shell</span><br><span class="line"></span><br><span class="line">atrace gfx irq <span class="built_in">sched</span> &gt; /data/gfx.trace</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>把抓取到的<code>gfx.trace</code>文件拉到本地，然后使用<a href="https://ui.perfetto.dev/"><code>perfetto</code></a>工具打开即可查看。</p><h2 id="systrace"><a href="#systrace" class="headerlink" title="systrace"></a>systrace</h2><p>与<code>atrace</code>类似，<code>systrace</code>也是一个用于抓取系统<code>trace</code>日志的工具（适用于Android4.3以后的所有版本），不过<code>systrace</code>通过一个<code>python</code>脚本将抓取到的日志转换成<code>html</code>的可视化格式，这样就可以通过在<code>chrome</code>浏览器中输入<code>chrome://tracing/</code>，然后将对应的<code>html</code>加载即可浏览。</p><blockquote><p><code>systrace.py</code>工具可以通过<code>google</code>的网站下载或者再<code>SDK</code>中下载；如果是<code>SDK</code>，可以在<code>platform-tools/systrace</code>中找到对应的脚本</p></blockquote><p>通过<code>python2.7 systrace.py -h</code>查看具体的命令说明:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">Usage: systrace.py [options] [category1 [category2 ...]]</span><br><span class="line"></span><br><span class="line">Example: systrace.py -b 32768 -t 15 gfx input view <span class="built_in">sched</span> freq</span><br><span class="line"></span><br><span class="line">Options:</span><br><span class="line">  -h, --<span class="built_in">help</span>            show this <span class="built_in">help</span> message and <span class="built_in">exit</span></span><br><span class="line">  -o FILE               write HTML to FILE</span><br><span class="line">  -t N, --time=N        trace <span class="keyword">for</span> N seconds</span><br><span class="line">  -b N, --buf-size=N    use a trace buffer size of N KB</span><br><span class="line">  -k KFUNCS, --ktrace=KFUNCS</span><br><span class="line">                        specify a comma-separated list of kernel <span class="built_in">functions</span> to</span><br><span class="line">                        trace</span><br><span class="line">  -l, --list-categories</span><br><span class="line">                        list the available categories and <span class="built_in">exit</span></span><br><span class="line">  -a APP_NAME, --app=APP_NAME</span><br><span class="line">                        <span class="built_in">enable</span> application-level tracing <span class="keyword">for</span> comma-separated</span><br><span class="line">                        list of app cmdlines</span><br><span class="line">  --link-assets         <span class="built_in">link</span> to original CSS or JS resources instead of</span><br><span class="line">                        embedding them</span><br><span class="line">  --from-file=FROM_FILE</span><br><span class="line">                        <span class="built_in">read</span> the trace from a file (compressed) rather than</span><br><span class="line">                        running a live trace</span><br><span class="line">  --asset-dir=ASSET_DIR</span><br><span class="line">  -e DEVICE_SERIAL, --serial=DEVICE_SERIAL</span><br><span class="line">                        adb device serial number</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>比如可以通过如下命令抓取图形、输入以及调度相关的<code>systrace</code>日志:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">python2<span class="number">.7</span> systrace.py -b <span class="number">32768</span> -t <span class="number">15</span> gfx input view sched freq</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>默认会抓取15s的日志，然后自动保存到一个<code>trace.html</code>的HTLM文件中；这个问题我们可以通过在<code>Chrome</code>浏览器中输入<code>chrome://tracing/</code>查看，也可以使用<a href="https://ui.perfetto.dev/"><code>perfetto</code></a>进行日志的浏览分析。</p><p>更多的使用规则可以参考<a href="https://stuff.mit.edu/afs/sipb/project/android/docs/tools/help/systrace.html">systrace usage</a></p><h2 id="dumpsys"><a href="#dumpsys" class="headerlink" title="dumpsys"></a><strong>dumpsys</strong></h2><p><code>dumpsys</code>是<code>Android</code>自带的一个工具，除了用于<code>dump</code>系统中注册的服务状态外，还可以用于各个进出的CPU占用、内存分配等情况，在分析一些系统的问题可能会有帮助:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">usage: dumpsys</span><br><span class="line">         To dump all services.</span><br><span class="line">or:</span><br><span class="line">       dumpsys [-t TIMEOUT] [--priority LEVEL] [--pid] [--thread] [--<span class="built_in">help</span> | -l | --skip SERVICES | SERVICE [ARGS]]</span><br><span class="line">         --<span class="built_in">help</span>: shows this <span class="built_in">help</span></span><br><span class="line">         -l: only list services, <span class="keyword">do</span> not dump them</span><br><span class="line">         -t TIMEOUT_SEC: TIMEOUT to use <span class="keyword">in</span> seconds instead of default 10 seconds</span><br><span class="line">         -T TIMEOUT_MS: TIMEOUT to use <span class="keyword">in</span> milliseconds instead of default 10 seconds</span><br><span class="line">         --pid: dump PID instead of usual dump</span><br><span class="line">         --thread: dump thread usage instead of usual dump</span><br><span class="line">         --proto: filter services that support dumping data <span class="keyword">in</span> proto format. Dumps</span><br><span class="line">               will be <span class="keyword">in</span> proto format.</span><br><span class="line">         --priority LEVEL: filter services based on specified priority</span><br><span class="line">               LEVEL must be one of CRITICAL | HIGH | NORMAL</span><br><span class="line">         --skip SERVICES: dumps all services but SERVICES (comma-separated list)</span><br><span class="line">         SERVICE [ARGS]: dumps only service SERVICE, optionally passing ARGS to it</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>如果我们只需要<code>dump</code>部分服务，可以先<code>dumpsys -l</code>获取到当前的服务列表, 然后保存对应的服务的<code>dump</code>日志:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">dumpsys SurfaceFlinger</span><br><span class="line"></span><br><span class="line">dumpsys cpuinfo</span><br><span class="line"></span><br><span class="line">dumpsys meminfo</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="simpleperf"><a href="#simpleperf" class="headerlink" title="simpleperf"></a><strong>simpleperf</strong></h2><p><code>simpleperf</code>是Android原生自带的一个用于采集APP或者系统日志的分析工具，其实际上包含了一系列工具库，包括可以在Android Native的直接运行的工具<code>simpleperf</code>；可以在PC上执行的脚本集合，包括生成火焰图的<code>inferno.py</code>, 用于分析APP（包括Native进程）的<code>app_profile.py</code>，所有的这些包括源代码都可以在AOSP的源代码仓库路径<code>system/extras/simpleperf</code>中找到。</p><p>先来看下如何使用<code>simpleperf</code>在Android本地上抓取分析的日志(如果没有<code>simpleperf</code>，可以到<code>simpleperf</code>目录下载或者自行编译一个push到系统中)；输入<code>simpleperf -h</code>查看使用说明:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">Usage: simpleperf [common options] subcommand [args_for_subcommand]</span><br><span class="line">common options:</span><br><span class="line">    -h/--help     Print this <span class="built_in">help</span> information.</span><br><span class="line">    --<span class="built_in">log</span> &lt;severity&gt; Set the minimum severity of logging. Possible severities</span><br><span class="line">                     include verbose, debug, warning, info, error, fatal.</span><br><span class="line">                     Default is info.</span><br><span class="line">    --log-to-android-buffer  Write <span class="built_in">log</span> to android <span class="built_in">log</span> buffer instead of stderr.</span><br><span class="line">    --version     Print version of simpleperf.</span><br><span class="line">subcommands:</span><br><span class="line">    api-collect         Collect recording data generated by app api</span><br><span class="line">    api-prepare         Prepare recording via app api</span><br><span class="line">    debug-unwind        Debug/test offline unwinding.</span><br><span class="line">    dump                dump perf record file</span><br><span class="line">    <span class="built_in">help</span>                <span class="built_in">print</span> <span class="built_in">help</span> information <span class="keyword">for</span> simpleperf</span><br><span class="line">    inject              parse etm instruction tracing data</span><br><span class="line">    kmem                collect kernel memory allocation information</span><br><span class="line">    list                list available event types</span><br><span class="line">    merge               merge multiple perf.data into one</span><br><span class="line">    monitor             monitor events and <span class="built_in">print</span> their textual representations to stdout</span><br><span class="line">    record              record sampling info <span class="keyword">in</span> perf.data</span><br><span class="line">    report              report sampling information <span class="keyword">in</span> perf.data</span><br><span class="line">    report-sample       report raw sample information <span class="keyword">in</span> perf.data</span><br><span class="line">    <span class="built_in">stat</span>                gather performance counter information</span><br><span class="line">    trace-sched         Trace system-wide process runtime events.</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>这里比较关键的两个命令是<code>simpleperf record/report</code>, <code>record</code>命令用于抓取进程的<code>perf</code>数据，而<code>report</code>指令则用于展示抓取到的<code>perf.data</code>，比如我们用如下命令抓取某个进程的分析日志:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># perf.data会保存在该目录</span></span><br><span class="line"><span class="built_in">cd</span> /data</span><br><span class="line"></span><br><span class="line">simpleperf record -p 1047 --duration 10</span><br><span class="line"></span><br><span class="line">simpleperf report</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>除了在Android查看<code>perf.data</code>之外，也可以将<code>perf.data</code>的数据保存到PC上，用<code>scripts</code>下面的<code>report_html.py</code>脚本以网页的形式查看结果, 只需要执行<code>./report_html.py -i perf.data</code>，脚本就会自动解析日志并在浏览器中展示一个HTML形式的报告。</p><p><code>simpleperf</code>里边还有一个很有趣的工具<code>inferno.sh</code>，可以一个指令快速生成火焰图：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">./inferno.sh --pid 2481 --title system</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>执行后会自动生成一个对应进程调用链的<a href="https://www.brendangregg.com/flamegraphs.html">火焰图</a>，然后在浏览器中查看各个线程执行的调用堆栈。</p><p><img src="https://sniffer-site.oss-cn-shenzhen.aliyuncs.com/simpleperf-flamegraph.png" alt="flamegraph"></p><p>另外我们还可以通过<code>simpleperf</code>来获取某个进程内核的<code>CPU</code>占用以及指令执行的统计信息:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment"># Stat using default events (cpu-cycles,instructions,...), and monitor process 7394 for 10 seconds.</span></span><br><span class="line">$ simpleperf <span class="built_in">stat</span> -p 7394 --duration 10</span><br><span class="line">Performance counter statistics:</span><br><span class="line"></span><br><span class="line"><span class="comment">#         count  event_name                # count / runtime</span></span><br><span class="line">     16,513,564  cpu-cycles                <span class="comment"># 1.612904 GHz</span></span><br><span class="line">      4,564,133  stalled-cycles-frontend   <span class="comment"># 341.490 M/sec</span></span><br><span class="line">      6,520,383  stalled-cycles-backend    <span class="comment"># 591.666 M/sec</span></span><br><span class="line">      4,900,403  instructions              <span class="comment"># 612.859 M/sec</span></span><br><span class="line">         47,821  branch-misses             <span class="comment"># 6.085 M/sec</span></span><br><span class="line">  25.274251(ms)  task-clock                <span class="comment"># 0.002520 cpus used</span></span><br><span class="line">              4  context-switches          <span class="comment"># 158.264 /sec</span></span><br><span class="line">            466  page-faults               <span class="comment"># 18.438 K/sec</span></span><br><span class="line"></span><br><span class="line">Total <span class="built_in">test</span> time: 10.027923 seconds.</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>更多详情可以参考<a href="https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/executable_commands_reference.md">simpleperf usage</a>或者AOSP中的代码目录<code>system/extras/simpleperf</code>。</p><h2 id="perfetto"><a href="#perfetto" class="headerlink" title="perfetto"></a><strong>perfetto</strong></h2><p><a href="https://perfetto.dev/"><code>perfetto</code></a>是Google开源的用于系统性能分析、Trace日志抓取的工具，是一个综合了Trace日志抓取、分析以及UI展示的工具链。<code>perfetto</code>采集的数据主要来自<code>ftrace</code>(收集内核信息)，<code>atrace</code>(收集服务与应用的Trace日志)以及<code>heapprofd</code>(用于收集APP的内存使用情况)；Android 10以后的版本都默认集成了一个<code>perfetto</code>的可执行程序用于Trace的抓取:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">Usage: perfetto</span><br><span class="line">  --background     -d      : Exits immediately and continues tracing <span class="keyword">in</span></span><br><span class="line">                            background</span><br><span class="line">  --config         -c      : /path/to/trace/config/file or - <span class="keyword">for</span> stdin</span><br><span class="line">  --out            -o      : /path/to/out/trace/file or - <span class="keyword">for</span> stdout</span><br><span class="line">  --upload                 : Upload field trace (Android only)</span><br><span class="line">  --dropbox        TAG     : DEPRECATED: Use --upload instead</span><br><span class="line">                            TAG should always be <span class="built_in">set</span> to <span class="string">&#x27;perfetto&#x27;</span></span><br><span class="line">  --no-guardrails          : Ignore guardrails triggered when using --upload</span><br><span class="line">                            (<span class="keyword">for</span> testing).</span><br><span class="line">  --txt                    : Parse config as pbtxt. Not <span class="keyword">for</span> production use.</span><br><span class="line">                            Not a stable API.</span><br><span class="line">  --reset-guardrails       : Resets the state of the guardails and exits</span><br><span class="line">                            (<span class="keyword">for</span> testing).</span><br><span class="line">  --query                  : Queries the service state and prints it as</span><br><span class="line">                            human-readable text.</span><br><span class="line">  --query-raw              : Like --query, but prints raw proto-encoded bytes</span><br><span class="line">                            of tracing_service_state.proto.</span><br><span class="line">  --save-for-bugreport     : If a trace with bugreport_score &gt; 0 is running, it</span><br><span class="line">                            saves it into a file. Outputs the path when <span class="keyword">done</span>.</span><br><span class="line">  --<span class="built_in">help</span>           -h</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">light configuration flags: (only when NOT using -c/--config)</span><br><span class="line">  --time           -t      : Trace duration N[s,m,h] (default: 10s)</span><br><span class="line">  --buffer         -b      : Ring buffer size N[mb,gb] (default: 32mb)</span><br><span class="line">  --size           -s      : Max file size N[mb,gb] (default: in-memory ring-buffer only)</span><br><span class="line">  --app            -a      : Android (atrace) app name</span><br><span class="line">  ATRACE_CAT               : Record ATRACE_CAT (e.g. wm)</span><br><span class="line">  FTRACE_GROUP/FTRACE_NAME : Record ftrace event (e.g. <span class="built_in">sched</span>/sched_switch)</span><br><span class="line"></span><br><span class="line">statsd-specific flags:</span><br><span class="line">  --alert-id           : ID of the alert that triggered this trace.</span><br><span class="line">  --config-id          : ID of the triggering config.</span><br><span class="line">  --config-uid         : UID of app <span class="built_in">which</span> registered the config.</span><br><span class="line">  --subscription-id    : ID of the subscription that triggered this trace.</span><br><span class="line"></span><br><span class="line">Detach mode. DISCOURAGED, <span class="built_in">read</span> https://perfetto.dev/docs/concepts/detached-mode :</span><br><span class="line">  --detach=key          : Detach from the tracing session with the given key.</span><br><span class="line">  --attach=key [--stop] : Re-attach to the session (optionally stop tracing once reattached).</span><br><span class="line">  --is_detached=key     : Check <span class="keyword">if</span> the session can be re-attached (0:Yes, 2:No, 1:Error).</span><br><span class="line"></span><br></pre></td></tr></table></figure><p><code>perfetto</code>包含以下两种模式，可确定用于记录跟踪数据的数据源：</p><ul><li>轻量模式：只能选择一部分数据源，具体来说就是<code>atrace</code>和<code>ftrace</code>。但此模式可提供类似于<code>systrace</code>的接口。</li><li>普通模式：从协议缓冲区获取其配置，并允许您更充分地利用<code>perfetto</code>功能，方法是使用<code>atrace</code>和<code>ftrace</code>之外的数据源。</li></ul><p>使用轻量模式来抓取<code>Trace</code>日志，这个跟<code>systrace</code>的用法类似，只需要制定<code>APP</code>的名字即可：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">perfetto --app &lt;app_name&gt; --time 15s -o /data/ss.trace</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>如果使用普通模式，首先需要参考标准的<a href="https://perfetto.dev/docs/reference/trace-config-proto"><code>TraceConfig</code></a>写一个配置文件；生成配置文件后，可以选择以<code>PBTX(ProtoBuf TeXtual representation)</code>的传递（生产环境不推荐）或者通过<a href="https://github.com/protocolbuffers/protobuf/releases"><code>protoc</code></a>工具转换成<code>Binary</code>形式的<code>protobuf</code>文件传给给<code>perfetto</code>:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">#把本地的sys_stats.cfg推到/data/misc/perfetto-configs</span></span><br><span class="line">perfetto -c /data/misc/perfetto-configs/sys_stats.cfg --txt -o /data/sys-stats.trace</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>在AOSP源码目录<code>external/perfetto/test/configs</code>下面提供了很多<code>perfetto</code>的配置文件可以参考。除了使用<code>Android</code>系统自带的<code>perfetto</code>工具抓取Trace之外，还可以用<a href="https://ui.perfetto.dev/">网页版的<code>perfetto</code>工具抓取</a>: 点击<code>Record new trace</code>，会弹出一个页面，可以选择抓取<code>CPU</code>、<code>GPU</code>、<code>Memory</code>以及APP与服务的Trace日志:</p><p><img src="https://sniffer-site.oss-cn-shenzhen.aliyuncs.com/perfetto-ui-record-new-traces.png" alt="perfetto-ui usage"></p><h2 id="Android-Profiler"><a href="#Android-Profiler" class="headerlink" title="Android Profiler"></a><strong>Android Profiler</strong></h2><p>除了使用上述几个工具之外，<code>Android Studio(AS)</code>中也提供了一个可视化的<a href="https://developer.android.com/studio/profile?hl=zh-cn">图形界面工具<code>Profiler</code></a>来分析APP或者设备中进程的性能，包括CPU占用、内存分配以及网络使用等常见的性能指标。</p><p>打开<code>AS</code>后，点击界面的下方中的仪表指针的图标<code>Profiler</code>，然后点击弹出的界面左侧<code>SESSIONS</code>一栏中点击<code>+</code>按钮，选择需要分析跟踪的进程后，会弹出一个显示进程CPU、内存以及网络占用情况的界面:</p><p><img src="https://sniffer-site.oss-cn-shenzhen.aliyuncs.com/android-profiler-ui.png" alt="android profiler ui"></p><p>点击图中的<code>CPU</code>可以看到进程中各个线程的CPU占用情况，如果想要查看某个线程的调用堆栈，可以通过界面上的<code>Record</code>的功能来抓取<code>Java</code>或者<code>C/C++</code>代码<code>Trace</code>日志，抓取完成后会生产对应<code>Trace</code>的火焰图，可以用来进一步分析各个线程执行耗时; 类似的，<code>MEMORY</code>工具可以用来分析热点代码的内存分配情况，用来分析内存泄露、内存占用不合理的情况。</p><p>有关更多Profiler工具的使用说明，可以参考<a href="https://developer.android.com/studio/profile?hl=zh-cn">Android Profiler官网链接</a>。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://developer.android.com/topic/performance/tracing?hl=zh-cn">https://developer.android.com/topic/performance/tracing?hl=zh-cn</a></li><li><a href="https://developer.android.com/studio/profile?hl=zh-cn">https://developer.android.com/studio/profile?hl=zh-cn</a></li><li><a href="https://cs.android.com/">https://cs.android.com/</a></li><li><a href="https://facebookmicrosites.github.io/psi/docs/overview">https://facebookmicrosites.github.io/psi/docs/overview</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;在Android系统开发过程中，经常碰到CPU占用率高、内存泄露、内存占用高等性能相关的问题，这时通常需要抓取系统的&lt;code&gt;trace&lt;/code&gt;日志，用以查看进程的CPU占用，内存分配等情况。怎么抓取系统trace， 这时一般需要用到系统性能相关的分析工具。这篇文章就以&lt;code&gt;Android S&lt;/code&gt;为例，说明Android开发中常用的一些性能优化工具的使用方法，主要包括如下几个工具:&lt;/p&gt;</summary>
    
    
    
    <category term="Android" scheme="https://sniffer.site/categories/Android/"/>
    
    
    <category term="Android" scheme="https://sniffer.site/tags/Android/"/>
    
    <category term="性能分析" scheme="https://sniffer.site/tags/%E6%80%A7%E8%83%BD%E5%88%86%E6%9E%90/"/>
    
    <category term="systrace" scheme="https://sniffer.site/tags/systrace/"/>
    
    <category term="perfetto" scheme="https://sniffer.site/tags/perfetto/"/>
    
  </entry>
  
</feed>
