如何将连续 'Ident' 列添加到 Pyspark 中的数据框，而不是 monotonically_increasing_id()？

Question

我有一个数据框 'df'，我想添加一个 'Ident' 数值列，其中的值是连续的。我尝试使用 monotonically_increasing_id() 但值不连续。正如它的描述所说："The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. "

所以，我的问题是，我该怎么做？

Answer 1

你可以试试这样的方法，

df = df.rdd.zipWithIndex().map(lambda x: [x[1]] + [y for y in x[0]]).toDF(['Ident']+df.columns)

这将为您提供第一列作为您的标识符，它将具有从 0 到 N-1 的连续值，其中 N 是 df 中的记录总数。

如何将连续 'Ident' 列添加到 Pyspark 中的数据框，而不是 monotonically_increasing_id()？

How can I add continuous 'Ident' column to a dataframe in Pyspark, not as monotonically_increasing_id()?

ident

continuous

dataframe

pyspark

pyspark-sql