为 Dask 中的列分配条件值

Question

我正在尝试对特定列的行进行条件分配：target。我做了一些研究，似乎这里给出了答案：。

我会重现我的需要。模拟数据集：

x = [3, 0, 3, 4, 0, 0, 0, 2, 0, 0, 0, 6, 9]
y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
mock = pd.DataFrame(dict(target = x, speed = y))

mock的样子是：

In [4]: mock.head(7)
Out [4]:
      speed target
    0   200 3
    1   300 0
    2   400 3
    3   215 4
    4   219 0
    5   360 0
    6   280 0

有了这个 Pandas DataFrame，我把它转换成 Dask DataFrame:

mock_dask = dd.from_pandas(mock, npartitions = 2)

我应用我的条件规则：target 中所有大于 0 的值必须为 1，所有其他值都为 0（二值化 target）。按照上面提到的线程，它应该是：

result = mock_dask.target.where(mock_dask.target > 0, 1)

我查看了结果数据集，它没有按预期工作：

In [7]: result.head(7)
Out [7]:
0    3
1    1
2    3
3    4
4    1
5    1
6    1
Name: target, dtype: object

我们可以看到，mock和result中的target列不是预期的结果。看来我的代码正在将所有 0 原始值转换为 1，而不是将大于 0 的值转换为 1（条件规则）。

这里是 Dask 新手，在此先感谢您的帮助。

Answer 1

他们对我来说似乎是一样的

In [1]: import pandas as pd

In [2]: x = [1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0, 6, 9]
   ...: y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
   ...: mock = pd.DataFrame(dict(target = x, speed = y))
   ...: 

In [3]: import dask.dataframe as dd

In [4]: mock_dask = dd.from_pandas(mock, npartitions = 2)

In [5]: mock.target.where(mock.target > 0, 1).head(5)
Out[5]: 
0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [6]: mock_dask.target.where(mock_dask.target > 0, 1).head(5)
Out[6]: 
0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

Answer 2

好的，Dask DataFrame API 中的文档非常清楚。感谢@MRocklin 的反馈，我意识到了我的错误。在文档中，where 函数（列表中的最后一个）使用以下语法：

DataFrame.where(cond[, other])      Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

因此，正确的代码行应该是：

result = mock_dask.target.where(mock_dask.target <= 0, 1)

这将输出：

In [7]: result.head(7)
Out [7]:
0    1
1    0
2    1
3    1
4    0
5    0
6    0
Name: target, dtype: int64

这是预期的输出。

为 Dask 中的列分配条件值

Assign conditional values to columns in Dask

dataframe

dask