如何使用字典值向 pyspark 数据框添加新列?
How to add a new colum to pyspark datafarme with dictionary values?
我试图在 pyspark 中向我现有的数据框添加一个新列。我的数据框看起来
像下面这样。我在这个 post 的帮助下尝试
查字典
水果
橙色
橙色
苹果
香蕉
苹果
我试过的代码是这样的
from pyspark.sql import functions as F
from itertools import chain
simple_dict = {'Orange': 'OR, 'Apple': 'AP', 'Banana': 'BN'}
mapping_expr = F.create_map([F.lit(x) for x in F.chain(*simple_dict.items())])
def addCols(data):
data = (data.withColumn('Fruit_code', mapping_expr[data['Fruit']]))
return data
预期输出:
Expected output:
Fruit Fruit_code
Orange OR
Orange OR
Apple AP
Banana BN
Apple AP
我遇到以下错误:我知道这是因为函数 F。但我不知道如何修复。有人可以帮助我吗?
FILE "/MYPROJECT/DATASETS/DERIVED/OPPORTUNITY_WON.PY", LINE 8, IN <MODULE>
MAPPING_EXPR = CREATE_MAP([LIT(X) FOR X IN CHAIN(*SIMPLE_DICT.ITEMS())])
FILE "/MYPROJECT/DATASETS/DERIVED/OPPORTUNITY_WON.PY", LINE 8, IN <LISTCOMP>
MAPPING_EXPR = CREATE_MAP([LIT(X) FOR X IN CHAIN(*SIMPLE_DICT.ITEMS())])
我已经修改了您的代码片段以使其正常工作。
from pyspark.sql import functions as F
from itertools import chain
simple_dict = {'Orange': 'OR', 'Apple': 'AP', 'Banana': 'BN'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*simple_dict.items())])
def addCols(data):
data = (data.withColumn('Fruit_code', mapping_expr[data['Fruit']]))
return data
data = spark.createDataFrame([("Orange", ), ("Apple", ), ("Banana", ), ], ("Fruit", ))
new_data = addCols(data)
new_data.show()
输出
+------+----------+
| Fruit|Fruit_code|
+------+----------+
|Orange| OR|
| Apple| AP|
|Banana| BN|
+------+----------+
我试图在 pyspark 中向我现有的数据框添加一个新列。我的数据框看起来
像下面这样。我在这个 post 的帮助下尝试
水果
橙色
橙色
苹果
香蕉
苹果
我试过的代码是这样的
from pyspark.sql import functions as F
from itertools import chain
simple_dict = {'Orange': 'OR, 'Apple': 'AP', 'Banana': 'BN'}
mapping_expr = F.create_map([F.lit(x) for x in F.chain(*simple_dict.items())])
def addCols(data):
data = (data.withColumn('Fruit_code', mapping_expr[data['Fruit']]))
return data
预期输出:
Expected output:
Fruit Fruit_code
Orange OR
Orange OR
Apple AP
Banana BN
Apple AP
我遇到以下错误:我知道这是因为函数 F。但我不知道如何修复。有人可以帮助我吗?
FILE "/MYPROJECT/DATASETS/DERIVED/OPPORTUNITY_WON.PY", LINE 8, IN <MODULE>
MAPPING_EXPR = CREATE_MAP([LIT(X) FOR X IN CHAIN(*SIMPLE_DICT.ITEMS())])
FILE "/MYPROJECT/DATASETS/DERIVED/OPPORTUNITY_WON.PY", LINE 8, IN <LISTCOMP>
MAPPING_EXPR = CREATE_MAP([LIT(X) FOR X IN CHAIN(*SIMPLE_DICT.ITEMS())])
我已经修改了您的代码片段以使其正常工作。
from pyspark.sql import functions as F
from itertools import chain
simple_dict = {'Orange': 'OR', 'Apple': 'AP', 'Banana': 'BN'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*simple_dict.items())])
def addCols(data):
data = (data.withColumn('Fruit_code', mapping_expr[data['Fruit']]))
return data
data = spark.createDataFrame([("Orange", ), ("Apple", ), ("Banana", ), ], ("Fruit", ))
new_data = addCols(data)
new_data.show()
输出
+------+----------+
| Fruit|Fruit_code|
+------+----------+
|Orange| OR|
| Apple| AP|
|Banana| BN|
+------+----------+