如何重置索引并找到特定的 id?
How to reset index and find specific id?
我每个人都有一个id
列(具有相同id的数据属于一个人)。我想要这些:
- 现在
id
列不是基于编号,而是 10 位数字。如何用整数重置 id
,例如1, 2, 3, 4?
例如:
id col1
12a4 summer
12a4 goest
3b yes
3b No
3b why
4t Hi
输出:
id col1
1 summer
1 goest
2 yes
2 No
2 why
3 Hi
- 如何获取
id=2
对应的数据?
在上面的例子中:
id col1
2 yes
2 No
2 why
from pyspark.sql import SparkSession
from pyspark.sql import Window, functions as F
spark = SparkSession.builder.getOrCreate()
data = [
('12a4', 'summer'),
('12a4', 'goest'),
('3b', 'yes'),
('3b', 'No'),
('3b', 'why'),
('4t', 'Hi')
]
df1 = spark.createDataFrame(data, ['id', 'col1'])
df1.show()
# +----+------+
# | id| col1|
# +----+------+
# |12a4|summer|
# |12a4| goest|
# | 3b| yes|
# | 3b| No|
# | 3b| why|
# | 4t| Hi|
# +----+------+
df = df1.select('id').distinct()
df = df.withColumn('new_id', F.row_number().over(Window.orderBy('id')))
df.show()
# +----+------+
# | id|new_id|
# +----+------+
# |12a4| 1|
# | 3b| 2|
# | 4t| 3|
# +----+------+
df = df.join(df1, 'id', 'full')
df.show()
# +----+------+------+
# | id|new_id| col1|
# +----+------+------+
# |12a4| 1|summer|
# |12a4| 1| goest|
# | 4t| 3| Hi|
# | 3b| 2| yes|
# | 3b| 2| No|
# | 3b| 2| why|
# +----+------+------+
df = df.drop('id').withColumnRenamed('new_id', 'id')
df.show()
# +---+------+
# | id| col1|
# +---+------+
# | 1|summer|
# | 1| goest|
# | 3| Hi|
# | 2| yes|
# | 2| No|
# | 2| why|
# +---+------+
df = df.filter(F.col('id') == 2)
df.show()
# +---+----+
# | id|col1|
# +---+----+
# | 2| yes|
# | 2| No|
# | 2| why|
# +---+----+
我每个人都有一个id
列(具有相同id的数据属于一个人)。我想要这些:
- 现在
id
列不是基于编号,而是 10 位数字。如何用整数重置id
,例如1, 2, 3, 4?
例如:
id col1
12a4 summer
12a4 goest
3b yes
3b No
3b why
4t Hi
输出:
id col1
1 summer
1 goest
2 yes
2 No
2 why
3 Hi
- 如何获取
id=2
对应的数据?
在上面的例子中:
id col1
2 yes
2 No
2 why
from pyspark.sql import SparkSession
from pyspark.sql import Window, functions as F
spark = SparkSession.builder.getOrCreate()
data = [
('12a4', 'summer'),
('12a4', 'goest'),
('3b', 'yes'),
('3b', 'No'),
('3b', 'why'),
('4t', 'Hi')
]
df1 = spark.createDataFrame(data, ['id', 'col1'])
df1.show()
# +----+------+
# | id| col1|
# +----+------+
# |12a4|summer|
# |12a4| goest|
# | 3b| yes|
# | 3b| No|
# | 3b| why|
# | 4t| Hi|
# +----+------+
df = df1.select('id').distinct()
df = df.withColumn('new_id', F.row_number().over(Window.orderBy('id')))
df.show()
# +----+------+
# | id|new_id|
# +----+------+
# |12a4| 1|
# | 3b| 2|
# | 4t| 3|
# +----+------+
df = df.join(df1, 'id', 'full')
df.show()
# +----+------+------+
# | id|new_id| col1|
# +----+------+------+
# |12a4| 1|summer|
# |12a4| 1| goest|
# | 4t| 3| Hi|
# | 3b| 2| yes|
# | 3b| 2| No|
# | 3b| 2| why|
# +----+------+------+
df = df.drop('id').withColumnRenamed('new_id', 'id')
df.show()
# +---+------+
# | id| col1|
# +---+------+
# | 1|summer|
# | 1| goest|
# | 3| Hi|
# | 2| yes|
# | 2| No|
# | 2| why|
# +---+------+
df = df.filter(F.col('id') == 2)
df.show()
# +---+----+
# | id|col1|
# +---+----+
# | 2| yes|
# | 2| No|
# | 2| why|
# +---+----+