如何通过read_parquet() in pandas过滤一些数据?
How to filter some data by read_parquet() in pandas?
我想通过过滤一些 gid 来减少加载内存的使用
reg_df = pd.read_parquet('/data/2010r.pq',
columns=['timestamp', 'gid', 'uid', 'flag'])
但是在文档中没有显示 kwargs 。
例如:
gid=[100,101,102,103,104,105]
gid_i_want_load = [100,103,105]
那么,我怎样才能只加载我想计算的 gid?
将 **kwargs
引入 pandas 库已记录在案 here。看起来最初的意图是实际将 columns
传递到限制 IO volumn 的请求中。贡献者采取了下一步并为 **kwargs
添加了一个通用通行证。
pandas/io/parquet.py
以下是 read_parquet
:
def read_parquet(path, engine='auto', columns=None, **kwargs):
"""
Load a parquet object from the file path, returning a DataFrame.
.. versionadded 0.21.0
Parameters
----------
path : string
File path
columns: list, default=None
If not None, only these columns will be read from the file.
.. versionadded 0.21.1
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
Parquet library to use. If 'auto', then the option
``io.parquet.engine`` is used. The default ``io.parquet.engine``
behavior is to try 'pyarrow', falling back to 'fastparquet' if
'pyarrow' is unavailable.
kwargs are passed to the engine
Returns
-------
DataFrame
"""
impl = get_engine(engine)
return impl.read(path, columns=columns, **kwargs)
对于 pandas/io/parquet.py
以下是 pyarrow
引擎上的 read
:
def read(self, path, columns=None, **kwargs):
path, _, _, should_close = get_filepath_or_buffer(path)
if self._pyarrow_lt_070:
result = self.api.parquet.read_pandas(path, columns=columns,
**kwargs).to_pandas()
else:
kwargs['use_pandas_metadata'] = True #<-- only param for kwargs...
result = self.api.parquet.read_table(path, columns=columns,
**kwargs).to_pandas()
if should_close:
try:
path.close()
except: # noqa: flake8
pass
return result
for pyarrow/parquet.py
下面是 read_pandas
:
def read_pandas(self, **kwargs):
"""
Read dataset including pandas metadata, if any. Other arguments passed
through to ParquetDataset.read, see docstring for further details
Returns
-------
pyarrow.Table
Content of the file as a table (of columns)
"""
return self.read(use_pandas_metadata=True, **kwargs) #<-- params being passed
对于 pyarrow/parquet.py
以下是对于 read
:
def read(self, columns=None, nthreads=1, use_pandas_metadata=False): #<-- kwargs param at pyarrow
"""
Read a Table from Parquet format
Parameters
----------
columns: list
If not None, only these columns will be read from the file. A
column name may be a prefix of a nested field, e.g. 'a' will select
'a.b', 'a.c', and 'a.d.e'
nthreads : int, default 1
Number of columns to read in parallel. If > 1, requires that the
underlying file source is threadsafe
use_pandas_metadata : boolean, default False
If True and file has custom pandas schema metadata, ensure that
index columns are also loaded
Returns
-------
pyarrow.table.Table
Content of the file as a table (of columns)
"""
column_indices = self._get_column_indices(
columns, use_pandas_metadata=use_pandas_metadata)
return self.reader.read_all(column_indices=column_indices,
nthreads=nthreads)
所以,如果我理解正确的话,也许你可以访问 nthreads
和 use_pandas_metadata
- 但话又说回来,两者都没有明确分配(??)。我还没有测试过 - 但它可能是一个开始。
我想通过过滤一些 gid 来减少加载内存的使用
reg_df = pd.read_parquet('/data/2010r.pq',
columns=['timestamp', 'gid', 'uid', 'flag'])
但是在文档中没有显示 kwargs 。 例如:
gid=[100,101,102,103,104,105]
gid_i_want_load = [100,103,105]
那么,我怎样才能只加载我想计算的 gid?
将 **kwargs
引入 pandas 库已记录在案 here。看起来最初的意图是实际将 columns
传递到限制 IO volumn 的请求中。贡献者采取了下一步并为 **kwargs
添加了一个通用通行证。
pandas/io/parquet.py
以下是 read_parquet
:
def read_parquet(path, engine='auto', columns=None, **kwargs):
"""
Load a parquet object from the file path, returning a DataFrame.
.. versionadded 0.21.0
Parameters
----------
path : string
File path
columns: list, default=None
If not None, only these columns will be read from the file.
.. versionadded 0.21.1
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
Parquet library to use. If 'auto', then the option
``io.parquet.engine`` is used. The default ``io.parquet.engine``
behavior is to try 'pyarrow', falling back to 'fastparquet' if
'pyarrow' is unavailable.
kwargs are passed to the engine
Returns
-------
DataFrame
"""
impl = get_engine(engine)
return impl.read(path, columns=columns, **kwargs)
对于 pandas/io/parquet.py
以下是 pyarrow
引擎上的 read
:
def read(self, path, columns=None, **kwargs):
path, _, _, should_close = get_filepath_or_buffer(path)
if self._pyarrow_lt_070:
result = self.api.parquet.read_pandas(path, columns=columns,
**kwargs).to_pandas()
else:
kwargs['use_pandas_metadata'] = True #<-- only param for kwargs...
result = self.api.parquet.read_table(path, columns=columns,
**kwargs).to_pandas()
if should_close:
try:
path.close()
except: # noqa: flake8
pass
return result
for pyarrow/parquet.py
下面是 read_pandas
:
def read_pandas(self, **kwargs):
"""
Read dataset including pandas metadata, if any. Other arguments passed
through to ParquetDataset.read, see docstring for further details
Returns
-------
pyarrow.Table
Content of the file as a table (of columns)
"""
return self.read(use_pandas_metadata=True, **kwargs) #<-- params being passed
对于 pyarrow/parquet.py
以下是对于 read
:
def read(self, columns=None, nthreads=1, use_pandas_metadata=False): #<-- kwargs param at pyarrow
"""
Read a Table from Parquet format
Parameters
----------
columns: list
If not None, only these columns will be read from the file. A
column name may be a prefix of a nested field, e.g. 'a' will select
'a.b', 'a.c', and 'a.d.e'
nthreads : int, default 1
Number of columns to read in parallel. If > 1, requires that the
underlying file source is threadsafe
use_pandas_metadata : boolean, default False
If True and file has custom pandas schema metadata, ensure that
index columns are also loaded
Returns
-------
pyarrow.table.Table
Content of the file as a table (of columns)
"""
column_indices = self._get_column_indices(
columns, use_pandas_metadata=use_pandas_metadata)
return self.reader.read_all(column_indices=column_indices,
nthreads=nthreads)
所以,如果我理解正确的话,也许你可以访问 nthreads
和 use_pandas_metadata
- 但话又说回来,两者都没有明确分配(??)。我还没有测试过 - 但它可能是一个开始。