在 DASK 中，如何将一系列整数（自动递增）添加到新列中？

Question

我需要向我的 DASK 数据框添加一列，其中应包含自动递增 ID。我知道如何在 Pandas 中做到这一点，因为我在 SO 上找到了一个 Pandas 解决方案，但我不知道如何在 DASK 中做到这一点。我最好的尝试是这样的，结果自动增量函数只对我的 100 行测试文件运行两次，所有的 id 都是 2.

def autoincrement(self):
    print('*')
    self.report_line = self.report_line + 1
    return self.report_line

self.df = self.df.map_partitions(
    lambda df: df.assign(raw_report_line=self.autoincrement())
)

Pandas 方式看起来像这样

df.insert(0, 'New_ID', range(1, 1 + len(df)))

或者，如果我可以获取特定 CSV 行的行号并将其添加到列中，那就太好了，在这个阶段，这似乎并不容易可能。

Answer 1

您可以分配一个全为 1 的虚拟列并取 cumsum

In [1]: import dask.datasets

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: df = dask.datasets.timeseries()

In [5]: df
Out[5]:
Dask DataFrame Structure:
                   id    name        x        y
npartitions=30
2000-01-01      int64  object  float64  float64
2000-01-02        ...     ...      ...      ...
...               ...     ...      ...      ...
2000-01-30        ...     ...      ...      ...
2000-01-31        ...     ...      ...      ...
Dask Name: make-timeseries, 30 tasks

In [6]: df['row_number'] = df.assign(partition_count=1).partition_count.cumsum()

In [7]: df.compute()
Out[7]:
                       id      name         x         y  row_number
timestamp
2000-01-01 00:00:00   928     Sarah -0.597784  0.160908           1
2000-01-01 00:00:01  1000     Zelda -0.034756 -0.073912           2
2000-01-01 00:00:02  1028  Patricia -0.962331 -0.458834           3
2000-01-01 00:00:03  1010    Hannah -0.225759 -0.227945           4
2000-01-01 00:00:04   958   Charlie  0.223131 -0.672307           5
...                   ...       ...       ...       ...         ...
2000-01-30 23:59:55  1052     Jerry -0.636159  0.683076     2591996
2000-01-30 23:59:56   973     Quinn -0.575324  0.272144     2591997
2000-01-30 23:59:57  1049     Jerry  0.143286 -0.122490     2591998
2000-01-30 23:59:58   971    Victor -0.866174  0.751534     2591999
2000-01-30 23:59:59   966     Edith -0.718382 -0.333261     2592000

[2592000 rows x 5 columns]

在 DASK 中，如何将一系列整数（自动递增）添加到新列中？

In DASK, how does one add a range of integers(auto-increment) to a new column?

python

pandas

dask

dask-dataframe