如何在 python 中使用 pyarrow 从 S3 读取分区的镶木地板文件
How to read partitioned parquet files from S3 using pyarrow in python
我正在寻找使用 python 从 s3 的多个分区目录读取数据的方法。
data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet
data_folder/serial_number=2/cur_date=27-12-2012/asdsdfsd0324324.snappy.parquet
pyarrow 的 ParquetDataset 模块具有从分区读取的能力。所以我尝试了以下代码:
>>> import pandas as pd
>>> import pyarrow.parquet as pq
>>> import s3fs
>>> a = "s3://my_bucker/path/to/data_folder/"
>>> dataset = pq.ParquetDataset(a)
它抛出以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__
self.metadata_path) = _make_manifest(path_or_paths, self.fs)
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 601, in _make_manifest
.format(path))
OSError: Passed non-file path: s3://my_bucker/path/to/data_folder/
根据 pyarrow 的文档,我尝试使用 s3fs 作为文件系统,即:
>>> dataset = pq.ParquetDataset(a,filesystem=s3fs)
抛出以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__
self.metadata_path) = _make_manifest(path_or_paths, self.fs)
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, in _make_manifest
if is_string(path_or_paths) and fs.isdir(path_or_paths):
AttributeError: module 's3fs' has no attribute 'isdir'
我仅限于使用 ECS 集群,因此 spark/pyspark 不是一个选项。
有没有一种方法可以让我们在 python 中轻松地从 s3 中的此类分区目录轻松读取 parquet 文件?我觉得列出所有目录然后阅读并不是本 中建议的好做法。我需要将读取的数据转换为 pandas 数据帧以进行进一步处理,因此更喜欢与 fastparquet 或 pyarrow 相关的选项。我也对 python 中的其他选项持开放态度。
我设法让它与最新版本的 fastparquet 和 s3fs 一起工作。下面是相同的代码:
import s3fs
import fastparquet as fp
s3 = s3fs.S3FileSystem()
fs = s3fs.core.S3FileSystem()
#mybucket/data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet
s3_path = "mybucket/data_folder/*/*/*.parquet"
all_paths_from_s3 = fs.glob(path=s3_path)
myopen = s3.open
#use s3fs as the filesystem
fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen)
#convert to pandas dataframe
df = fp_obj.to_pandas()
感谢马丁通过我们的
为我指明了正确的方向
NB :这会比使用 pyarrow 慢,基于 benchmark . I will update my answer once s3fs support is implemented in pyarrow via ARROW-1213
我使用 pyarrow 对单个迭代进行了快速基准测试,并将文件列表作为 glob 发送到 fastparquet。 fastparquet 使用 s3fs 与 pyarrow + 我的 hackish 代码相比更快。但我认为 pyarrow +s3fs 实施后会更快。
代码和基准如下:
>>> def test_pq():
... for current_file in list_parquet_files:
... f = fs.open(current_file)
... df = pq.read_table(f).to_pandas()
... # following code is to extract the serial_number & cur_date values so that we can add them to the dataframe
... #probably not the best way to split :)
... elements_list=current_file.split('/')
... for item in elements_list:
... if item.find(date_partition) != -1:
... current_date = item.split('=')[1]
... elif item.find(dma_partition) != -1:
... current_dma = item.split('=')[1]
... df['serial_number'] = current_dma
... df['cur_date'] = current_date
... list_.append(df)
... frame = pd.concat(list_)
...
>>> timeit.timeit('test_pq()',number =10,globals=globals())
12.078817503992468
>>> def test_fp():
... fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen)
... df = fp_obj.to_pandas()
>>> timeit.timeit('test_fp()',number =10,globals=globals())
2.961556333000317
2019 年更新
所有PR后,Arrow-2038 & Fast Parquet - PR#182等Issues已解决
使用 Pyarrow 读取 parquet 文件
# pip install pyarrow
# pip install s3fs
>>> import s3fs
>>> import pyarrow.parquet as pq
>>> fs = s3fs.S3FileSystem()
>>> bucket = 'your-bucket-name'
>>> path = 'directory_name' #if its a directory omit the traling /
>>> bucket_uri = f's3://{bucket}/{path}'
's3://your-bucket-name/directory_name'
>>> dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
>>> table = dataset.read()
>>> df = table.to_pandas()
使用 Fast parquet 读取 parquet 文件
# pip install s3fs
# pip install fastparquet
>>> import s3fs
>>> import fastparquet as fp
>>> bucket = 'your-bucket-name'
>>> path = 'directory_name'
>>> root_dir_path = f'{bucket}/{path}'
# the first two wild card represents the 1st,2nd column partitions columns of your data & so forth
>>> s3_path = f"{root_dir_path}/*/*/*.parquet"
>>> all_paths_from_s3 = fs.glob(path=s3_path)
>>> fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen, root=root_dir_path)
>>> df = fp_obj.to_pandas()
快速基准测试
这可能不是对其进行基准测试的最佳方式。请阅读 blog post 以获得完整的基准测试
#pyarrow
>>> import timeit
>>> def test_pq():
... dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
... table = dataset.read()
... df = table.to_pandas()
...
>>> timeit.timeit('test_pq()',number =10,globals=globals())
1.2677053569998407
#fastparquet
>>> def test_fp():
... fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen, root=root_dir_path)
... df = fp_obj.to_pandas()
>>> timeit.timeit('test_fp()',number =10,globals=globals())
2.931876824000028
关于 Pyarrow 的进一步阅读 speed
参考:
- fastparquet
- s3fs
- pyarrow
- pyarrow 箭头代码基于 discussion 和文档
- 基于讨论的 fastparquet 代码 PR-182 , PR-182 以及文档
此问题已于 2017 年 this pull request 解决。
对于那些只想使用 pyarrow 从 S3 读取镶木地板的人,这里有一个例子:
import s3fs
import pyarrow.parquet as pq
fs = s3fs.S3FileSystem()
bucket = "your-bucket"
path = "your-path"
# Python 3.6 or later
p_dataset = pq.ParquetDataset(
f"s3://{bucket}/{path}",
filesystem=fs
)
df = p_dataset.read().to_pandas()
# Pre-python 3.6
p_dataset = pq.ParquetDataset(
"s3://{0}/{1}".format(bucket, path),
filesystem=fs
)
df = p_dataset.read().to_pandas()
对于那些只想读入分区 parquet 文件的 部分 的人,pyarrow 接受键列表以及部分目录路径以读入所有文件部分的分区。这种方法对于以有意义的方式(例如按年份或国家/地区)对 parquet 数据集进行分区的组织特别有用,允许用户指定他们需要文件的哪些部分。这将降低长期 运行 的成本,因为 AWS 在读取数据集时按字节收费。
# Read in user specified partitions of a partitioned parquet file
import s3fs
import pyarrow.parquet as pq
s3 = s3fs.S3FileSystem()
keys = ['keyname/blah_blah/part-00000-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet'\
,'keyname/blah_blah/part-00001-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet'\
,'keyname/blah_blah/part-00002-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet'\
,'keyname/blah_blah/part-00003-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet']
bucket = 'bucket_yada_yada_yada'
# Add s3 prefix and bucket name to all keys in list
parq_list=[]
for key in keys:
parq_list.append('s3://'+bucket+'/'+key)
# Create your dataframe
df = pq.ParquetDataset(parq_list, filesystem=s3).read_pandas(columns=['Var1','Var2','Var3']).to_pandas()
对于 python 3.6+ AWS 有一个名为 aws-data-wrangler 的库,它有助于 Pandas/S3/Parquet
之间的集成
安装做;
pip install awswrangler
要使用 awswrangler 1.x.x
及更高版本从 s3 读取分区镶木地板,执行;
import awswrangler as wr
df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/", dataset=True)
通过设置 dataset=True
awswrangler 需要分区的 parquet 文件。它将从您在 path
.
中指定的 s3 密钥下方的分区中读取所有单独的镶木地板文件
PyArrow 7.0.0 对新模块 pyarrow.dataset
进行了一些改进,旨在从之前的 Parquet-specific pyarrow.parquet.ParquetDataset
.[=16] 中抽象出数据集概念=]
假设您对从第一个文件推断出的数据集模式没有问题,example from the documentation for reading a partitioned dataset 应该就可以了。
这是一个 more-complete 示例,假设您要使用来自 S3 的数据:
import pyarrow.dataset as ds
from pyarrow import fs
s3 = fs.S3FileSystem()
dataset = ds.dataset(
"my-bucket-name/my-path-to-dataset-partitions",
format="parquet",
filesystem=s3,
partitioning="hive"
)
# Assuming your data is partitioned like year=2022/month=4/day=29
# this will only have to read the files for that day
expression = ((ds.field("year") == 2022) & (ds.field("month") == 4) & (ds.field("day") == 29))
pyarrow_table_2022_04_29 = dataset.to_table(filter=expression)
如果您自己定义数据集模式,请注意。上面使用分区参数 的推断会自动将分区添加到您的数据集模式 。
如果您希望分区与 manually-defined 数据集架构一起正常工作,您必须确保将分区添加到架构中:
import pyarrow as pa
my_manual_schema = pa.schema([]) # Some pyarrow.Schema instance for your dataset
# Be sure to add the partitions even though they're not in the dataset files
my_manual_schema.append(pa.field("year", pa.int16()))
my_manual_schema.append(pa.field("month", pa.int8()))
my_manual_schema.append(pa.field("day", pa.int8()))
dataset = ds.dataset(
"my-bucket-name/my-path-to-dataset-partitions",
format="parquet",
filesystem=s3,
schema=my_manual_schema,
partitioning="hive"
)
我正在寻找使用 python 从 s3 的多个分区目录读取数据的方法。
data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet data_folder/serial_number=2/cur_date=27-12-2012/asdsdfsd0324324.snappy.parquet
pyarrow 的 ParquetDataset 模块具有从分区读取的能力。所以我尝试了以下代码:
>>> import pandas as pd
>>> import pyarrow.parquet as pq
>>> import s3fs
>>> a = "s3://my_bucker/path/to/data_folder/"
>>> dataset = pq.ParquetDataset(a)
它抛出以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__
self.metadata_path) = _make_manifest(path_or_paths, self.fs)
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 601, in _make_manifest
.format(path))
OSError: Passed non-file path: s3://my_bucker/path/to/data_folder/
根据 pyarrow 的文档,我尝试使用 s3fs 作为文件系统,即:
>>> dataset = pq.ParquetDataset(a,filesystem=s3fs)
抛出以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__
self.metadata_path) = _make_manifest(path_or_paths, self.fs)
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, in _make_manifest
if is_string(path_or_paths) and fs.isdir(path_or_paths):
AttributeError: module 's3fs' has no attribute 'isdir'
我仅限于使用 ECS 集群,因此 spark/pyspark 不是一个选项。
有没有一种方法可以让我们在 python 中轻松地从 s3 中的此类分区目录轻松读取 parquet 文件?我觉得列出所有目录然后阅读并不是本
我设法让它与最新版本的 fastparquet 和 s3fs 一起工作。下面是相同的代码:
import s3fs
import fastparquet as fp
s3 = s3fs.S3FileSystem()
fs = s3fs.core.S3FileSystem()
#mybucket/data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet
s3_path = "mybucket/data_folder/*/*/*.parquet"
all_paths_from_s3 = fs.glob(path=s3_path)
myopen = s3.open
#use s3fs as the filesystem
fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen)
#convert to pandas dataframe
df = fp_obj.to_pandas()
感谢马丁通过我们的
NB :这会比使用 pyarrow 慢,基于 benchmark . I will update my answer once s3fs support is implemented in pyarrow via ARROW-1213
我使用 pyarrow 对单个迭代进行了快速基准测试,并将文件列表作为 glob 发送到 fastparquet。 fastparquet 使用 s3fs 与 pyarrow + 我的 hackish 代码相比更快。但我认为 pyarrow +s3fs 实施后会更快。
代码和基准如下:
>>> def test_pq():
... for current_file in list_parquet_files:
... f = fs.open(current_file)
... df = pq.read_table(f).to_pandas()
... # following code is to extract the serial_number & cur_date values so that we can add them to the dataframe
... #probably not the best way to split :)
... elements_list=current_file.split('/')
... for item in elements_list:
... if item.find(date_partition) != -1:
... current_date = item.split('=')[1]
... elif item.find(dma_partition) != -1:
... current_dma = item.split('=')[1]
... df['serial_number'] = current_dma
... df['cur_date'] = current_date
... list_.append(df)
... frame = pd.concat(list_)
...
>>> timeit.timeit('test_pq()',number =10,globals=globals())
12.078817503992468
>>> def test_fp():
... fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen)
... df = fp_obj.to_pandas()
>>> timeit.timeit('test_fp()',number =10,globals=globals())
2.961556333000317
2019 年更新
所有PR后,Arrow-2038 & Fast Parquet - PR#182等Issues已解决
使用 Pyarrow 读取 parquet 文件
# pip install pyarrow
# pip install s3fs
>>> import s3fs
>>> import pyarrow.parquet as pq
>>> fs = s3fs.S3FileSystem()
>>> bucket = 'your-bucket-name'
>>> path = 'directory_name' #if its a directory omit the traling /
>>> bucket_uri = f's3://{bucket}/{path}'
's3://your-bucket-name/directory_name'
>>> dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
>>> table = dataset.read()
>>> df = table.to_pandas()
使用 Fast parquet 读取 parquet 文件
# pip install s3fs
# pip install fastparquet
>>> import s3fs
>>> import fastparquet as fp
>>> bucket = 'your-bucket-name'
>>> path = 'directory_name'
>>> root_dir_path = f'{bucket}/{path}'
# the first two wild card represents the 1st,2nd column partitions columns of your data & so forth
>>> s3_path = f"{root_dir_path}/*/*/*.parquet"
>>> all_paths_from_s3 = fs.glob(path=s3_path)
>>> fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen, root=root_dir_path)
>>> df = fp_obj.to_pandas()
快速基准测试
这可能不是对其进行基准测试的最佳方式。请阅读 blog post 以获得完整的基准测试
#pyarrow
>>> import timeit
>>> def test_pq():
... dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
... table = dataset.read()
... df = table.to_pandas()
...
>>> timeit.timeit('test_pq()',number =10,globals=globals())
1.2677053569998407
#fastparquet
>>> def test_fp():
... fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen, root=root_dir_path)
... df = fp_obj.to_pandas()
>>> timeit.timeit('test_fp()',number =10,globals=globals())
2.931876824000028
关于 Pyarrow 的进一步阅读 speed
参考:
- fastparquet
- s3fs
- pyarrow
- pyarrow 箭头代码基于 discussion 和文档
- 基于讨论的 fastparquet 代码 PR-182 , PR-182 以及文档
此问题已于 2017 年 this pull request 解决。
对于那些只想使用 pyarrow 从 S3 读取镶木地板的人,这里有一个例子:
import s3fs
import pyarrow.parquet as pq
fs = s3fs.S3FileSystem()
bucket = "your-bucket"
path = "your-path"
# Python 3.6 or later
p_dataset = pq.ParquetDataset(
f"s3://{bucket}/{path}",
filesystem=fs
)
df = p_dataset.read().to_pandas()
# Pre-python 3.6
p_dataset = pq.ParquetDataset(
"s3://{0}/{1}".format(bucket, path),
filesystem=fs
)
df = p_dataset.read().to_pandas()
对于那些只想读入分区 parquet 文件的 部分 的人,pyarrow 接受键列表以及部分目录路径以读入所有文件部分的分区。这种方法对于以有意义的方式(例如按年份或国家/地区)对 parquet 数据集进行分区的组织特别有用,允许用户指定他们需要文件的哪些部分。这将降低长期 运行 的成本,因为 AWS 在读取数据集时按字节收费。
# Read in user specified partitions of a partitioned parquet file
import s3fs
import pyarrow.parquet as pq
s3 = s3fs.S3FileSystem()
keys = ['keyname/blah_blah/part-00000-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet'\
,'keyname/blah_blah/part-00001-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet'\
,'keyname/blah_blah/part-00002-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet'\
,'keyname/blah_blah/part-00003-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet']
bucket = 'bucket_yada_yada_yada'
# Add s3 prefix and bucket name to all keys in list
parq_list=[]
for key in keys:
parq_list.append('s3://'+bucket+'/'+key)
# Create your dataframe
df = pq.ParquetDataset(parq_list, filesystem=s3).read_pandas(columns=['Var1','Var2','Var3']).to_pandas()
对于 python 3.6+ AWS 有一个名为 aws-data-wrangler 的库,它有助于 Pandas/S3/Parquet
之间的集成安装做;
pip install awswrangler
要使用 awswrangler 1.x.x
及更高版本从 s3 读取分区镶木地板,执行;
import awswrangler as wr
df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/", dataset=True)
通过设置 dataset=True
awswrangler 需要分区的 parquet 文件。它将从您在 path
.
PyArrow 7.0.0 对新模块 pyarrow.dataset
进行了一些改进,旨在从之前的 Parquet-specific pyarrow.parquet.ParquetDataset
.[=16] 中抽象出数据集概念=]
假设您对从第一个文件推断出的数据集模式没有问题,example from the documentation for reading a partitioned dataset 应该就可以了。
这是一个 more-complete 示例,假设您要使用来自 S3 的数据:
import pyarrow.dataset as ds
from pyarrow import fs
s3 = fs.S3FileSystem()
dataset = ds.dataset(
"my-bucket-name/my-path-to-dataset-partitions",
format="parquet",
filesystem=s3,
partitioning="hive"
)
# Assuming your data is partitioned like year=2022/month=4/day=29
# this will only have to read the files for that day
expression = ((ds.field("year") == 2022) & (ds.field("month") == 4) & (ds.field("day") == 29))
pyarrow_table_2022_04_29 = dataset.to_table(filter=expression)
如果您自己定义数据集模式,请注意。上面使用分区参数 的推断会自动将分区添加到您的数据集模式 。
如果您希望分区与 manually-defined 数据集架构一起正常工作,您必须确保将分区添加到架构中:
import pyarrow as pa
my_manual_schema = pa.schema([]) # Some pyarrow.Schema instance for your dataset
# Be sure to add the partitions even though they're not in the dataset files
my_manual_schema.append(pa.field("year", pa.int16()))
my_manual_schema.append(pa.field("month", pa.int8()))
my_manual_schema.append(pa.field("day", pa.int8()))
dataset = ds.dataset(
"my-bucket-name/my-path-to-dataset-partitions",
format="parquet",
filesystem=s3,
schema=my_manual_schema,
partitioning="hive"
)