"dd" for nvme 会使用 mmio 还是 dma?

will "dd" for nvme use mmio or dma?

最近我尝试调试一个 nvme 超时问题:

# dd if=/dev/urandom of=/dev/nvme0n1 bs=4k count=1024000 
nvme nvme0: controller is down; will reset: CSTS=0x3,
PCI_STATUS=0x2010
nvme nvme0: Shutdown timeout set to 8 seconds
nvme nvme0: 1/0/0 default/read/poll queues 
nvme nvme0: I/O 388 QID 1 timeout, disable controller
blk_update_request: I/O error, dev nvme0n1, sector 64008 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
......

经过一番挖掘,我发现根本原因是 pcie-controller 的范围 dts 属性,用于 pio/outbound 映射:

<0x02000000 0x00 0x08000000 0x20 0x04000000 0x00 0x04000000>; dd timeout
<0x02000000 0x00 0x04000000 0x20 0x04000000 0x00 0x04000000>; dd ok

不管根本原因如何,这里的超时似乎是受mmio影响的,因为0x02000000代表的是non-prefetch mmio。是真的吗?有没有可能dd作为master触发dma和nvme控制器?

它使用 dma 而不是 mmio。

这是 Keith Busch 的回答:

Generally speaking, an nvme driver notifies the controller of new commands via a MMIO write to a specific nvme register. The nvme controller fetches those commands from host memory with a DMA.

One exception to that description is if the nvme controller supports CMB with SQEs, but they're not very common. If you had such a controller, the driver will use MMIO to write commands directly into controller memory instead of letting the controller DMA them from host memory. Do you know if you have such a controller?

The data transfers associated with your 'dd' command will always use DMA.

以下是 ftrace 输出:

nvme_map_data之前的调用堆栈:

# entries-in-buffer/entries-written: 376/376   #P:2
#
#                                          _-----=> irqs-off
#                                         / _----=> need-resched
#                                        | / _---=> hardirq/softirq
#                                        || / _--=> preempt-depth
#                                        ||| /     delay
#           TASK-PID       TGID    CPU#  ||||   TIMESTAMP  FUNCTION
#              | |           |       |   ||||      |         |
    kworker/u4:0-379     (-------) [000] ...1  3712.711523: nvme_map_data <-nvme_queue_rq
    kworker/u4:0-379     (-------) [000] ...1  3712.711533: <stack trace>
 => nvme_map_data
 => nvme_queue_rq
 => blk_mq_dispatch_rq_list
 => __blk_mq_do_dispatch_sched
 => __blk_mq_sched_dispatch_requests
 => blk_mq_sched_dispatch_requests
 => __blk_mq_run_hw_queue
 => __blk_mq_delay_run_hw_queue
 => blk_mq_run_hw_queue
 => blk_mq_sched_insert_requests
 => blk_mq_flush_plug_list
 => blk_flush_plug_list
 => blk_mq_submit_bio
 => __submit_bio_noacct_mq
 => submit_bio_noacct
 => submit_bio
 => submit_bh_wbc.constprop.0
 => __block_write_full_page
 => block_write_full_page
 => blkdev_writepage
 => __writepage
 => write_cache_pages
 => generic_writepages
 => blkdev_writepages
 => do_writepages
 => __writeback_single_inode
 => writeback_sb_inodes
 => __writeback_inodes_wb
 => wb_writeback
 => wb_do_writeback
 => wb_workfn
 => process_one_work
 => worker_thread
 => kthread
 => ret_from_fork

nvme_map_data 的调用图:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0)               |  nvme_map_data [nvme]() {
 0)               |    __blk_rq_map_sg() {
 0) + 15.600 us   |      __blk_bios_map_sg();
 0) + 19.760 us   |    }
 0)               |    dma_map_sg_attrs() {
 0) + 62.620 us   |      dma_direct_map_sg();
 0) + 66.520 us   |    }
 0)               |    nvme_pci_setup_prps [nvme]() {
 0)               |      dma_pool_alloc() {
 0)               |        _raw_spin_lock_irqsave() {
 0)   1.880 us    |          preempt_count_add();
 0)   5.520 us    |        }
 0)               |        _raw_spin_unlock_irqrestore() {
 0)   1.820 us    |          preempt_count_sub();
 0)   5.260 us    |        }
 0) + 16.400 us   |      }
 0) + 23.500 us   |    }
 0) ! 150.100 us  |  }

nvme_pci_setup_prps是nvme做dma的一种方法:

NVMe devices transfer data to and from system memory using Direct Memory Access (DMA). Specifically, they send messages across the PCI bus requesting data transfers. In the absence of an IOMMU, these messages contain physical memory addresses. These data transfers happen without involving the CPU, and the MMU is responsible for making access to memory coherent.

NVMe devices also may place additional requirements on the physical layout of memory for these transfers. The NVMe 1.0 specification requires all physical memory to be describable by what is called a PRP list. To be described by a PRP list, memory must have the following properties:

The memory is broken into physical 4KiB pages, which we'll call device pages.
The first device page can be a partial page starting at any 4-byte aligned address. It may extend up to the end of the current physical page, but not beyond.
If there is more than one device page, the first device page must end on a physical 4KiB page boundary.
The last device page begins on a physical 4KiB page boundary, but is not required to end on a physical 4KiB page boundary.

https://spdk.io/doc/memory.html