如何重置索引并找到特定的 id?

How to reset index and find specific id?

我每个人都有一个id列(具有相同id的数据属于一个人)。我想要这些:

  1. 现在 id 列不是基于编号,而是 10 位数字。如何用整数重置 id,例如1, 2, 3, 4?

例如:

id     col1
12a4   summer
12a4   goest
3b     yes
3b     No
3b     why
4t     Hi

输出:

id   col1
1    summer
1    goest
2    yes
2    No
2    why
3    Hi 
  1. 如何获取id=2对应的数据?

在上面的例子中:

id   col1
2    yes
2    No
2    why
from pyspark.sql import SparkSession
from pyspark.sql import Window, functions as F

spark = SparkSession.builder.getOrCreate()

data = [
('12a4', 'summer'),
('12a4', 'goest'),
('3b', 'yes'),
('3b', 'No'),
('3b', 'why'),
('4t', 'Hi')
]
df1 = spark.createDataFrame(data, ['id', 'col1'])
df1.show()
#     +----+------+
#     |  id|  col1|
#     +----+------+
#     |12a4|summer|
#     |12a4| goest|
#     |  3b|   yes|
#     |  3b|    No|
#     |  3b|   why|
#     |  4t|    Hi|
#     +----+------+

df = df1.select('id').distinct()
df = df.withColumn('new_id', F.row_number().over(Window.orderBy('id')))
df.show()
#     +----+------+
#     |  id|new_id|
#     +----+------+
#     |12a4|     1|
#     |  3b|     2|
#     |  4t|     3|
#     +----+------+

df = df.join(df1, 'id', 'full')
df.show()
#     +----+------+------+
#     |  id|new_id|  col1|
#     +----+------+------+
#     |12a4|     1|summer|
#     |12a4|     1| goest|
#     |  4t|     3|    Hi|
#     |  3b|     2|   yes|
#     |  3b|     2|    No|
#     |  3b|     2|   why|
#     +----+------+------+

df = df.drop('id').withColumnRenamed('new_id', 'id')
df.show()
#     +---+------+
#     | id|  col1|
#     +---+------+
#     |  1|summer|
#     |  1| goest|
#     |  3|    Hi|
#     |  2|   yes|
#     |  2|    No|
#     |  2|   why|
#     +---+------+

df = df.filter(F.col('id') == 2)
df.show()
#     +---+----+
#     | id|col1|
#     +---+----+
#     |  2| yes|
#     |  2|  No|
#     |  2| why|
#     +---+----+