使用 pandas 处理时是否可以自动分块 csv？

Question

我有一个 csv 文件，我在其中通过列上的一些规则验证每个单元格。

df.drop(df[~validator].index, inplace=True)

此处的验证器可以是不同的函数，用于检查单元格是否类似于整数，或者单元格内的字符串是否小于 10 个字符等。因此，单独一个单元格即可验证所有信息，而无需任何来自同一行或同一列的其他单元格。

我有这个：

    bad_dfs = []
    for validator, error in people_csv_validators:
        bad_dfs.append(df.loc[~validator])
        df.drop(df[~validator].index, inplace=True)
    bad_df = pd.concat(bad_dfs)

之前，数据帧小于 1m 行，20 列或更少，列数没有改变，但行增加了很多，我希望能够使用固定数量的内存来处理它。所以我想我会把它分块，因为验证不依赖于任何东西。

现在，我知道我可以将块参数放入我拥有的 read_csv 中，然后使用 mode="a" 逐块写入 csv 文件，但我想 dask 并将其他库与它们的数据框 class 结合起来，我想可能还有其他一些方法可以做到这一点。

有什么标准的方法可以做到这一点，比如

df = pd.read_csv(path, chunk_in_the_background_and_write_to_this_file=output_path, chunk_count=10^6)

some_row_based_operations(df)

# It automatically reads the first 10^6 rows and processes them,
# then writes them to `output_path` and then reads the next 10^6 rows and so on

同样，这是一件相当简单的事情，但我想知道是否有规范的方法。

Answer 1

使用 dask 执行此操作的粗略代码如下：

import dask.dataframe as dd

# let's use ddf for dask df
ddf = dd.read_csv(path) # can also provide list of files

def some_row_based_operations(df):
    # a function that accepts and returns pandas df
    # implementing required logic
    return df

# the line below is fine only if the function is row-based
# (no dependencies across different rows)
modified_ddf = ddf.map_partitions(some_row_based_operations)

# single_file kwarg is only if you want one file at the end
modified_ddf.to_csv(output_path, single_file=True)

一个警告：使用上面的方法，应该不会对 some_row_based_operations 中的 df 进行就地更改，但希望像下面这样的更改是可行的：

# change this: df.drop(df[~validator].index, inplace=True)
# also note, that this logic should be part of `some_row_based_operations`
df = df.drop(df[~validator].index)

使用 pandas 处理时是否可以自动分块 csv？

Is it possible to automatically chunk a csv when processing with pandas?

python

dataframe

pandas

dask