使用 chunksize 从 s3 将 CSV 文件加载到 Pandas

Question

我正在尝试使用...从 s3 读取一个非常大的文件...

import pandas as pd
import s3fs
df = pd.read_csv('s3://bucket-name/filename', chunksize=100000)

但即使在给出块大小之后，它也会永远占用。从 s3 获取文件时 chunksize 选项是否有效？如果没有，有没有更好的方法从 s3 加载大文件？

Answer 1

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html 明明说的

filepath_or_bufferstr, path object or file-like object Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.

If you want to pass in a path object, pandas accepts any os.PathLike.

By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

读块时，pandas return 你的迭代器对象，你需要遍历它.. 类似于：

for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 100000):
    process df chunk..

如果你认为这是因为 chunksize 很大，你可以考虑只对第一个 chunk 进行尝试，对于较小的 chunksize 是这样的：

for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 1000):
    print(df.head())
    break

使用 chunksize 从 s3 将 CSV 文件加载到 Pandas

Load CSV file into Pandas from s3 using chunksize

python

amazon-s3

pandas

python-s3fs