pyspark 列转换
pyspark column transformation
我有两个预定义列表如下。
East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]
我有一个 pyspark 数据框,如下所示。在列表(城市)中搜索后,我需要根据第二列中的名称在数据框中添加第三列(州)。
df:
Num City
1 Bengal
2 Goa
3 Bombay
4 Bihar
预期输出:
Num City State
1 Bengal East
2 Goa West
3 Bombay West
4 Bihar East
谢谢
我只能在 pandas 中执行以下操作。由于数据集很大,我正在尝试将其转换为 pyspark。谢谢
Pandas代码如下
def map_state(name):
#print(name)
East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]
if name in East:
return 'East'
if name in West:
return 'West'
else:
return name
df['State'] = df['City'].apply(map_state)
您可以使用isin
函数。
East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]
from pyspark.sql.functions import when, col
df.withColumn("state", when(col("City").isin(East), "East")\
.when(col("City").isin(West), "West").otherwise(None)).show()
+---+------+-----+
|Num| City|state|
+---+------+-----+
| 1|Bengal| East|
| 2| Goa| West|
| 3|Bombay| West|
| 4| Bihar| East|
+---+------+-----+
我有两个预定义列表如下。
East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]
我有一个 pyspark 数据框,如下所示。在列表(城市)中搜索后,我需要根据第二列中的名称在数据框中添加第三列(州)。
df:
Num City
1 Bengal
2 Goa
3 Bombay
4 Bihar
预期输出:
Num City State
1 Bengal East
2 Goa West
3 Bombay West
4 Bihar East
谢谢
我只能在 pandas 中执行以下操作。由于数据集很大,我正在尝试将其转换为 pyspark。谢谢
Pandas代码如下
def map_state(name):
#print(name)
East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]
if name in East:
return 'East'
if name in West:
return 'West'
else:
return name
df['State'] = df['City'].apply(map_state)
您可以使用isin
函数。
East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]
from pyspark.sql.functions import when, col
df.withColumn("state", when(col("City").isin(East), "East")\
.when(col("City").isin(West), "West").otherwise(None)).show()
+---+------+-----+
|Num| City|state|
+---+------+-----+
| 1|Bengal| East|
| 2| Goa| West|
| 3|Bombay| West|
| 4| Bihar| East|
+---+------+-----+