Dask 从目录递归读取 CSV 文件

Dask read CSV files recursively from directories

对于以下目录结构

Folder
  Sub-Folder1
           File1.csv
           File2.csv
           File3.csv
           File4.csv
  Sub-Folder2
           File1.csv
           File2.csv
  Sub-Folder3
           File1.csv
           File2.csv

如何使用 read_csvDask 读取这些文件夹中的所有 CSV 文件,将每个文件放入一个分区?

IIUC,你可以使用:

import dask.dataframe as dd

dfs = dd.read_csv('Folder/**/*.csv')

输出:

>>> dfs
Dask DataFrame Structure:
                   A      B      C
npartitions=8                     
               int64  int64  int64
                 ...    ...    ...
...              ...    ...    ...
                 ...    ...    ...
                 ...    ...    ...
Dask Name: read-csv, 8 tasks