pyspark 列转换

pyspark column transformation

我有两个预定义列表如下。

East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]

我有一个 pyspark 数据框,如下所示。在列表(城市)中搜索后,我需要根据第二列中的名称在数据框中添加第三列(州)。

df:

Num    City     
1      Bengal   
2      Goa      
3      Bombay   
4      Bihar    

预期输出:

Num    City     State
1      Bengal   East
2      Goa      West
3      Bombay   West
4      Bihar    East

谢谢

我只能在 pandas 中执行以下操作。由于数据集很大,我正在尝试将其转换为 pyspark。谢谢

Pandas代码如下

def map_state(name):
    #print(name)
    East = ["Bengal", "Bihar", "Assam"]
    West = ["Bombay", "Gujarat", "Goa"]
    if name in East:
        return 'East'
    if name in West:
        return 'West'   
    else:
        return name

df['State'] = df['City'].apply(map_state)

您可以使用isin函数。

East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]

from pyspark.sql.functions import when, col

df.withColumn("state", when(col("City").isin(East), "East")\
    .when(col("City").isin(West), "West").otherwise(None)).show()

+---+------+-----+
|Num|  City|state|
+---+------+-----+
|  1|Bengal| East|
|  2|   Goa| West|
|  3|Bombay| West|
|  4| Bihar| East|
+---+------+-----+