使用 move_pages() 移动大页面?

Using move_pages() to move hugepages?

这个问题是针对:

  1. 内核 3.10.0-1062.4.3.el7.x86_64
  2. 通过引导参数分配的非透明大页面,可能映射也可能不映射到文件(例如挂载的大页面)
  3. x86_64

根据这个内核source, move_pages() will call do_pages_move() to move a page, but I don't see how it indirectly calls migrate_huge_page()

所以我的问题是:

  1. 可以move_pages()移动大页面吗?如果是,传递页面地址数组时页面边界应该是 4KB 还是 2MB? 5 年前似乎有一个 patch 支持移动大页面。
  2. 如果move_pages()无法移动大页面,我该如何移动大页面?
  3. 移动大页面后,我可以像查询常规页面那样查询大页面的 NUMA ID 吗

根据下面的代码,我似乎通过 move_pages() 移动大页面,页面大小 = 2MB,但这是正确的方法吗?:

#include <cstdint>
#include <iostream>
#include <numaif.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>
#include <limits>

int main(int argc, char** argv) {
        const int32_t dst_node = strtoul(argv[1], nullptr, 10);
        const constexpr uint64_t size = 4lu * 1024 * 1024;
        const constexpr uint64_t pageSize = 2lu * 1024 * 1024;
        const constexpr uint32_t nPages = size / pageSize;
        int32_t status[nPages];
        std::fill_n(status, nPages, std::numeric_limits<int32_t>::min());;
        void* pages[nPages];
        int32_t dst_nodes[nPages];
        void* ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB, -1, 0);

        if (ptr == MAP_FAILED) {
                throw "failed to map hugepages";
        }
        memset(ptr, 0x41, nPages*pageSize);
        for (uint32_t i = 0; i < nPages; i++) {
                pages[i] = &((char*)ptr)[i*pageSize];
                dst_nodes[i] = dst_node;
        }

        std::cout << "Before moving" << std::endl;

        if (0 != move_pages(0, nPages, pages, nullptr, status, 0)) {
            std::cout << "failed to inquiry pages because " << strerror(errno) << std::endl;
        }
        else {
                for (uint32_t i = 0; i < nPages; i++) {
                        std::cout << "page # " << i << " locates at numa node " << status[i] << std::endl;
                }
        }

        // real move
        if (0 != move_pages(0, nPages, pages, dst_nodes, status, MPOL_MF_MOVE_ALL)) {
                std::cout << "failed to move pages because " << strerror(errno) << std::endl;
                exit(-1);
        }

        const constexpr uint64_t smallPageSize = 4lu * 1024;
        const constexpr uint32_t nSmallPages = size / smallPageSize;
        void* smallPages[nSmallPages];
        int32_t smallStatus[nSmallPages] = {std::numeric_limits<int32_t>::min()};
        for (uint32_t i = 0; i < nSmallPages; i++) {
                smallPages[i] = &((char*)ptr)[i*smallPageSize];
        }


        std::cout << "after moving" << std::endl;
        if (0 != move_pages(0, nSmallPages, smallPages, nullptr, smallStatus, 0)) {
            std::cout << "failed to inquiry pages because " << strerror(errno) << std::endl;
        }
        else {
                for (uint32_t i = 0; i < nSmallPages; i++) {
                        std::cout << "page # " << i << " locates at numa node " << smallStatus[i] << std::endl;
                }
        }

}

我应该根据 4KB 页面大小查询 NUMA ID(如上面的代码)吗?还是 2MB?

对于原始版本的 3.10 linux 内核(没有打 redhat 补丁,因为我没有用于 rhel 内核的 LXR)syscall move_pages 将强制拆分大页面(2MB;THP 和 hugetlbfs 样式)到小页面 (4KB)。 move_pages 使用太短的块(如果我计算正确的话大约 0.5MB)并且函数图如下:

move_pages .. -> migrate_pages -> unmap_and_move ->

static int unmap_and_move(new_page_t get_new_page, unsigned long private,
            struct page *page, int force, enum migrate_mode mode)
{
    struct page *newpage = get_new_page(page, private, &result);
    ....
    if (unlikely(PageTransHuge(page)))
        if (unlikely(split_huge_page(page)))
            goto out;

PageTransHuge returns 对于两种大页面(thp 和 libhugetlbs)都是正确的: https://elixir.bootlin.com/linux/v3.10/source/include/linux/page-flags.h#L411

PageTransHuge() returns true for both transparent huge and hugetlbfs pages, but not normal pages.

split_huge_page will call split_huge_page_to_list which:

Split a hugepage into normal pages. This doesn't change the position of head page.

Split 还将发出 vm_event 种类 THP_SPLIT 的计数器增量。计数器在 /proc/vmstat"file displays various virtual memory statistics"). You can check this counter with this UUOC command cat /proc/vmstat |grep thp_split 测试前后导出。

在 3.10 版本中有一些用于大页面迁移的代码作为 unmap_and_move_huge_page 函数,不从 move_pages 调用。 only usage of it in 3.10 was in migrate_huge_page which is called only from memory failure handler soft_offline_huge_page (__soft_offline_page) (added 2010):

Soft offline a page, by migration or invalidation, without killing anything. This is for the case when a page is not corrupted yet (so it's still valid to access), but has had a number of corrected errors and is better taken out.

答案:

can move_pages() move hugepages? if yes, should the page boundary be 4KB or 2MB when passing an array of addresses of pages? It seems like there was a patch for supporting moving hugepages 5 years ago.

标准 3.10 内核有 move_pages,它将接受 4KB 页面指针数组 "pages",并将大页面分解(拆分)为 512 个小页面,然后迁移小页面。它们被 thp 合并回来的可能性非常小,因为 move_pages 对物理内存页面的请求是分开的,而且它们几乎总是不连续的。

不要指向“2MB”,它仍然会拆分每个提到的大页,并且只迁移该内存的前 4KB 小页。

2013 补丁未添加到原始 3.10 内核中。

补丁似乎在 2013 年 9 月被接受:https://github.com/torvalds/linux/search?q=+extend+hugepage+migration&type=Commits

if move_pages() cannot move hugepages, how can I move hugepages?

move_pages 会将大页面中的数据作为小页面移动。您可以: 在正确的 numa 节点以手动模式分配大页面并复制您的数据(如果要保留虚拟地址,则复制两次);或者使用补丁更新内核到某个版本并使用补丁作者的方法和测试,Naoya Horiguchi (JP). There is copy of his tests: https://github.com/srikanth007m/test_hugepage_migration_extensionhttps://github.com/Naoya-Horiguchi/test_core 为必填项)

https://github.com/srikanth007m/test_hugepage_migration_extension/blob/master/test_move_pages.c

现在我不确定如何开始测试以及如何检查它是否正常工作。对于 ./test_move_pages -v -m private -h 2048 使用最新内核运行,它不会增加 THP_SPLIT 计数器。

他的测试看起来与我们的测试非常相似:mmap、memset 到错误页面、用指向小页面的指针填充页面数组,numa_move_pages

after moving hugepages, can I query the NUMA IDs of hugepages the same way I query regular pages like this answer?

您可以通过在查询模式(带有空节点)中向 move_pages 系统调用提供正确的数组 "pages" 来查询任何内存的状态。数组应列出要检查的内存区域的每个小页面。

如果你知道任何可靠的方法来检查内存是否映射到大页面,你可以查询大页面的任何小页面。我认为如果您可以将物理地址从内核导出到用户-space(使用一些LKM module for example): for huge page virtual and physical addresses will always have 21 common LSB bits, and for small pages bits will coincide only for 1 test in million. Or just write LKM to export PMD Directory

,则可以使用概率方法