如何通过read_parquet() in pandas过滤一些数据？

Question

我想通过过滤一些 gid 来减少加载内存的使用

reg_df = pd.read_parquet('/data/2010r.pq',
                             columns=['timestamp', 'gid', 'uid', 'flag'])

但是在文档中没有显示 kwargs 。例如：

gid=[100,101,102,103,104,105]
gid_i_want_load = [100,103,105]

那么，我怎样才能只加载我想计算的 gid？

Answer 1

将 **kwargs 引入 pandas 库已记录在案 here。看起来最初的意图是实际将 columns 传递到限制 IO volumn 的请求中。贡献者采取了下一步并为 **kwargs 添加了一个通用通行证。

pandas/io/parquet.py 以下是 read_parquet:

def read_parquet(path, engine='auto', columns=None, **kwargs):
    """
    Load a parquet object from the file path, returning a DataFrame.
    .. versionadded 0.21.0
    Parameters
    ----------
    path : string
        File path
    columns: list, default=None
        If not None, only these columns will be read from the file.
        .. versionadded 0.21.1
    engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
        Parquet library to use. If 'auto', then the option
        ``io.parquet.engine`` is used. The default ``io.parquet.engine``
        behavior is to try 'pyarrow', falling back to 'fastparquet' if
        'pyarrow' is unavailable.
    kwargs are passed to the engine
    Returns
    -------
    DataFrame
    """

    impl = get_engine(engine)
    return impl.read(path, columns=columns, **kwargs)

对于 pandas/io/parquet.py 以下是 pyarrow 引擎上的 read：

def read(self, path, columns=None, **kwargs):
    path, _, _, should_close = get_filepath_or_buffer(path)
    if self._pyarrow_lt_070:
        result = self.api.parquet.read_pandas(path, columns=columns,
                                              **kwargs).to_pandas()
    else:
        kwargs['use_pandas_metadata'] = True    #<-- only param for kwargs...
        result = self.api.parquet.read_table(path, columns=columns,
                                             **kwargs).to_pandas()
    if should_close:
        try:
            path.close()
        except:  # noqa: flake8
            pass

    return result

for pyarrow/parquet.py 下面是 read_pandas:

def read_pandas(self, **kwargs):
    """
    Read dataset including pandas metadata, if any. Other arguments passed
    through to ParquetDataset.read, see docstring for further details

    Returns
    -------
    pyarrow.Table
        Content of the file as a table (of columns)
    """
    return self.read(use_pandas_metadata=True, **kwargs)  #<-- params being passed

对于 pyarrow/parquet.py 以下是对于 read:

def read(self, columns=None, nthreads=1, use_pandas_metadata=False):  #<-- kwargs param at pyarrow
        """
        Read a Table from Parquet format

        Parameters
        ----------
        columns: list
            If not None, only these columns will be read from the file. A
            column name may be a prefix of a nested field, e.g. 'a' will select
            'a.b', 'a.c', and 'a.d.e'
        nthreads : int, default 1
            Number of columns to read in parallel. If > 1, requires that the
            underlying file source is threadsafe
        use_pandas_metadata : boolean, default False
            If True and file has custom pandas schema metadata, ensure that
            index columns are also loaded

        Returns
        -------
        pyarrow.table.Table
            Content of the file as a table (of columns)
        """
        column_indices = self._get_column_indices(
            columns, use_pandas_metadata=use_pandas_metadata)
        return self.reader.read_all(column_indices=column_indices,
                                    nthreads=nthreads)

所以，如果我理解正确的话，也许你可以访问 nthreads 和 use_pandas_metadata - 但话又说回来，两者都没有明确分配（？？）。我还没有测试过 - 但它可能是一个开始。

如何通过read_parquet() in pandas过滤一些数据？

How to filter some data by read_parquet() in pandas?

pandas

parquet