使用正则表达式模式从目录中读取文件
Using regex pattern to read files from directories
我有以下名称的目录:
s3://bucket/elig_date=2020-06-01/
s3://bucket/elig_date=2020-06-02/
....
s3://bucket/elig_date=2020-09-30/
s3://bucket/elig_date=2020-10-01/
...
s3://bucket/elig_date=2020-12-31/
当我想读取从 2020-06-01 到 2020-09-30 的所有目录中的所有文件时,我使用以下命令并且 它有效:
import dask.dataframe as dd
all_data = dd.read_parquet("s3://bucket/elig_date=2020-0[6-9]-*/*")
但是,我想将其扩展到目录 2020-12-31,我正在尝试以下操作,但它 不起作用:
all_data = dd.read_parquet("s3://bucket/elig_date=2020-0[6-9]|1[0-2]-*/*")
这会引发以下错误:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-61-60da829cf51e> in <module>
----> 1 all_data = dd.read_parquet("s3://bucket/elig_date=2020-0[6-9]|1[0-2]-*/*")
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, read_from_paths, chunksize, **kwargs)
333 index = [index]
334
--> 335 meta, statistics, parts, index = engine.read_metadata(
336 fs,
337 paths,
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, read_from_paths, engine, **kwargs)
497 split_row_groups,
498 gather_statistics,
--> 499 ) = cls._gather_metadata(
500 paths,
501 fs,
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in _gather_metadata(cls, paths, fs, split_row_groups, gather_statistics, filters, index, read_from_paths, dataset_kwargs)
1647
1648 # Step 1: Create a ParquetDataset object
-> 1649 dataset, base, fns = _get_dataset_object(paths, fs, filters, dataset_kwargs)
1650 if fns == [None]:
1651 # This is a single file. No danger in gathering statistics
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in _get_dataset_object(paths, fs, filters, dataset_kwargs)
1600 if proxy_metadata:
1601 dataset.metadata = proxy_metadata
-> 1602 elif fs.isdir(paths[0]):
1603 # This is a directory. We can let pyarrow do its thing.
1604 # Note: In the future, it may be best to avoid listing the
IndexError: list index out of range
我只在 regExr 上测试过,因为我没有你的文件。
但这在那里起作用:
s3://bucket/elig_date=2020-(0[6-9])|(1[0-2])-*/*
和你一样,只是带括号
我有以下名称的目录:
s3://bucket/elig_date=2020-06-01/
s3://bucket/elig_date=2020-06-02/
....
s3://bucket/elig_date=2020-09-30/
s3://bucket/elig_date=2020-10-01/
...
s3://bucket/elig_date=2020-12-31/
当我想读取从 2020-06-01 到 2020-09-30 的所有目录中的所有文件时,我使用以下命令并且 它有效:
import dask.dataframe as dd
all_data = dd.read_parquet("s3://bucket/elig_date=2020-0[6-9]-*/*")
但是,我想将其扩展到目录 2020-12-31,我正在尝试以下操作,但它 不起作用:
all_data = dd.read_parquet("s3://bucket/elig_date=2020-0[6-9]|1[0-2]-*/*")
这会引发以下错误:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-61-60da829cf51e> in <module>
----> 1 all_data = dd.read_parquet("s3://bucket/elig_date=2020-0[6-9]|1[0-2]-*/*")
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, read_from_paths, chunksize, **kwargs)
333 index = [index]
334
--> 335 meta, statistics, parts, index = engine.read_metadata(
336 fs,
337 paths,
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, read_from_paths, engine, **kwargs)
497 split_row_groups,
498 gather_statistics,
--> 499 ) = cls._gather_metadata(
500 paths,
501 fs,
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in _gather_metadata(cls, paths, fs, split_row_groups, gather_statistics, filters, index, read_from_paths, dataset_kwargs)
1647
1648 # Step 1: Create a ParquetDataset object
-> 1649 dataset, base, fns = _get_dataset_object(paths, fs, filters, dataset_kwargs)
1650 if fns == [None]:
1651 # This is a single file. No danger in gathering statistics
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in _get_dataset_object(paths, fs, filters, dataset_kwargs)
1600 if proxy_metadata:
1601 dataset.metadata = proxy_metadata
-> 1602 elif fs.isdir(paths[0]):
1603 # This is a directory. We can let pyarrow do its thing.
1604 # Note: In the future, it may be best to avoid listing the
IndexError: list index out of range
我只在 regExr 上测试过,因为我没有你的文件。 但这在那里起作用:
s3://bucket/elig_date=2020-(0[6-9])|(1[0-2])-*/*
和你一样,只是带括号