尝试按列删除值(我将这些值转换为 nan 但可以是任何值)不起作用

Trying to Drop values by column (I convert these values to nan but could be anything) not working

尝试在 Dask 中按列删除 NA,给定某个阈值,但我收到以下错误。

我收到以下错误,但这应该有效。请指教

可重现的例子。

import pandas as pd
import dask

data = [['tom', 10], ['nick', 15], ['juli', 5]]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])

import numpy as np
df = df.replace(5, np.nan)

ddf = dd.from_pandas(df, npartitions = 2)

ddf.dropna(axis='columns') 

目前 dask 数据帧不支持传递轴。您 cvan 还通过 ddf.dropna? 打印函数的文档字符串,它会告诉您相同的信息:

Signature: ddf.dropna(how='any', subset=None, thresh=None)
Docstring:
Remove missing values.

This docstring was copied from pandas.core.frame.DataFrame.dropna.

Some inconsistencies with the Dask version may exist.

See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.

Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0  (Not supported in Dask)
    Determine if rows or columns which contain missing values are
    removed.

    * 0, or 'index' : Drop rows which contain missing values.
    * 1, or 'columns' : Drop columns which contain missing value.

    .. versionchanged:: 1.0.0

       Pass tuple or list to drop on multiple axes.
       Only a single axis is allowed.

how : {'any', 'all'}, default 'any'
    Determine if row or column is removed from DataFrame, when we have
    at least one NA or all NA.

    * 'any' : If any NA values are present, drop that row or column.
    * 'all' : If all values are NA, drop that row or column.

thresh : int, optional
    Require that many non-NA values.
subset : array-like, optional
    Labels along other axis to consider, e.g. if you are dropping rows
    these would be a list of columns to include.
inplace : bool, default False  (Not supported in Dask)
    If True, do operation inplace and return None.

Returns
-------
DataFrame or None
    DataFrame with NA entries dropped from it or None if ``inplace=True``.

值得注意的是,对于许多这样的实例,Dask 文档是从 pandas 复制而来的。但无论它在哪里,它都明确指出:

This docstring was copied from pandas.core.frame.DataFrame.drop. Some inconsistencies with the Dask version may exist.

因此,最好检查 daskpandas 驱动函数的文档字符串,而不是依赖文档

dask 不支持的原因是它需要计算整个数据帧以便 dask 知道结果的形状。这与 row-wise 的情况有很大不同,后者的列数和分区数不会改变,因此可以在不做任何工作的情况下安排操作。

Dask 不允许 pandas API 的某些部分看起来像正常的 pandas 操作可能被移植到 dask,但实际上不能在不触发的情况下安排在当前帧上计算。你 运行 故意陷入这个问题,因为虽然 .dropna(axis=0) 可以作为计划操作正常工作,但 .dropna(axis=1) 会有非常不同的含义。

您可以使用以下方法手动执行此操作:

ddf[ddf.columns[~ddf.isna().any(axis=0)]]

但过滤操作 ddf.columns[~ddf.isna().any(axis=0)] 将触发对整个数据帧的计算。如果您可以将数据帧放入集群的内存中,那么在 运行 之前坚持可能是有意义的。