使用 chunksize 从 s3 将 CSV 文件加载到 Pandas
Load CSV file into Pandas from s3 using chunksize
我正在尝试使用...从 s3 读取一个非常大的文件...
import pandas as pd
import s3fs
df = pd.read_csv('s3://bucket-name/filename', chunksize=100000)
但即使在给出块大小之后,它也会永远占用。从 s3 获取文件时 chunksize
选项是否有效?如果没有,有没有更好的方法从 s3 加载大文件?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
明明说的
filepath_or_bufferstr, path object or file-like object Any valid
string path is acceptable. The string could be a URL. Valid URL
schemes include http, ftp, s3, gs, and file. For file URLs, a host is
expected. A local file could be: file://localhost/path/to/table.csv.
If you want to pass in a path object, pandas accepts any os.PathLike.
By file-like object, we refer to objects with a read() method, such as
a file handle (e.g. via builtin open function) or StringIO.
读块时,pandas return 你的迭代器对象,你需要遍历它..
类似于:
for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 100000):
process df chunk..
如果你认为这是因为 chunksize 很大,你可以考虑只对第一个 chunk 进行尝试,对于较小的 chunksize 是这样的:
for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 1000):
print(df.head())
break
我正在尝试使用...从 s3 读取一个非常大的文件...
import pandas as pd
import s3fs
df = pd.read_csv('s3://bucket-name/filename', chunksize=100000)
但即使在给出块大小之后,它也会永远占用。从 s3 获取文件时 chunksize
选项是否有效?如果没有,有没有更好的方法从 s3 加载大文件?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html 明明说的
filepath_or_bufferstr, path object or file-like object Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.
If you want to pass in a path object, pandas accepts any os.PathLike.
By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.
读块时,pandas return 你的迭代器对象,你需要遍历它.. 类似于:
for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 100000):
process df chunk..
如果你认为这是因为 chunksize 很大,你可以考虑只对第一个 chunk 进行尝试,对于较小的 chunksize 是这样的:
for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 1000):
print(df.head())
break