在 pyspark 中将字符串列表转换为二进制列表
Convert string list to binary list in pyspark
我有一个这样的数据框
data = [(("ID1", ['October', 'September', 'August'])), (("ID2", ['August', 'June', 'May'])),
(("ID3", ['October', 'June']))]
df = spark.createDataFrame(data, ["ID", "MonthList"])
df.show(truncate=False)
+---+----------------------------+
|ID |MonthList |
+---+----------------------------+
|ID1|[October, September, August]|
|ID2|[August, June, May] |
|ID3|[October, June] |
+---+----------------------------+
我想将每一行与默认列表进行比较,这样,如果存在值,则分配 1,否则分配 0
default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']
因此我的预期输出是这样的
+---+----------------------------+------------------+
|ID |MonthList |Binary_MonthList |
+---+----------------------------+------------------+
|ID1|[October, September, August]|[1, 1, 1, 0, 0, 0]|
|ID2|[August, June, May] |[0, 0, 1, 0, 1, 1]|
|ID3|[October, June] |[1, 0, 0, 0, 1, 0]|
+---+----------------------------+------------------+
我可以在 python 中执行此操作,但不知道如何在 pyspark
中执行此操作
你可以尝试使用这样的udf
。
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, IntegerType
default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']
def_month_list_func = udf(lambda x: [1 if i in x else 0 for i in default_month_list], ArrayType(IntegerType()))
df = df.withColumn("Binary_MonthList", def_month_list_func(col("MonthList")))
df.show()
# output
+---+--------------------+------------------+
| ID| MonthList| Binary_MonthList|
+---+--------------------+------------------+
|ID1|[October, Septemb...|[1, 1, 1, 0, 0, 0]|
|ID2| [August, June, May]|[0, 0, 1, 0, 1, 1]|
|ID3| [October, June]|[1, 0, 0, 0, 1, 0]|
+---+--------------------+------------------+
回答完全没问题。我只是发布了一个更通用的解决方案,无需 udf 即可工作,并且不需要您了解可能的值。
A CountVectorizer does exactly that what you want. This algorithm adds all distinct values to his dictionary as long as they fullfil certain criteria (e.g. minimum or maximum occurence). You can apply this model on a dataframe and it will return one-hot encoded a sparse vector column () 表示给定输入列的项目。
from pyspark.ml.feature import CountVectorizer
data = [(("ID1", ['October', 'September', 'August']))
, (("ID2", ['August', 'June', 'May', 'August']))
, (("ID3", ['October', 'June']))]
df = spark.createDataFrame(data, ["ID", "MonthList"])
df.show(truncate=False)
#binary=True checks only if a item of the dictionary is present and not how often
#vocabSize defines the maximum size of the dictionary
#minDF=1.0 defines in how much rows (1.0 means one row is enough) a values has to be present to be added to the vocabulary
cv = CountVectorizer(inputCol="MonthList", outputCol="Binary_MonthList", vocabSize=12, minDF=1.0, binary=True)
cvModel = cv.fit(df)
df = cvModel.transform(df)
df.show(truncate=False)
cvModel.vocabulary
输出:
+---+----------------------------+
|ID | MonthList |
+---+----------------------------+
|ID1|[October, September, August]|
|ID2| [August, June, May, August]|
|ID3| [October, June] |
+---+----------------------------+
+---+----------------------------+-------------------------+
|ID | MonthList | Binary_MonthList |
+---+----------------------------+-------------------------+
|ID1|[October, September, August]|(5,[1,2,3],[1.0,1.0,1.0])|
|ID2|[August, June, May, August] |(5,[0,1,4],[1.0,1.0,1.0])|
|ID3|[October, June] | (5,[0,2],[1.0,1.0]) |
+---+----------------------------+-------------------------+
['June', 'August', 'October', 'September', 'May']
如何使用array_contains():
from pyspark.sql.functions import array, array_contains
df.withColumn('Binary_MonthList', array([array_contains('MonthList', c).astype('int') for c in default_month_list])).show()
+---+--------------------+------------------+
| ID| MonthList| Binary_MonthList|
+---+--------------------+------------------+
|ID1|[October, Septemb...|[1, 1, 1, 0, 0, 0]|
|ID2| [August, June, May]|[0, 0, 1, 0, 1, 1]|
|ID3| [October, June]|[1, 0, 0, 0, 1, 0]|
+---+--------------------+------------------+
我有一个这样的数据框
data = [(("ID1", ['October', 'September', 'August'])), (("ID2", ['August', 'June', 'May'])),
(("ID3", ['October', 'June']))]
df = spark.createDataFrame(data, ["ID", "MonthList"])
df.show(truncate=False)
+---+----------------------------+
|ID |MonthList |
+---+----------------------------+
|ID1|[October, September, August]|
|ID2|[August, June, May] |
|ID3|[October, June] |
+---+----------------------------+
我想将每一行与默认列表进行比较,这样,如果存在值,则分配 1,否则分配 0
default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']
因此我的预期输出是这样的
+---+----------------------------+------------------+
|ID |MonthList |Binary_MonthList |
+---+----------------------------+------------------+
|ID1|[October, September, August]|[1, 1, 1, 0, 0, 0]|
|ID2|[August, June, May] |[0, 0, 1, 0, 1, 1]|
|ID3|[October, June] |[1, 0, 0, 0, 1, 0]|
+---+----------------------------+------------------+
我可以在 python 中执行此操作,但不知道如何在 pyspark
你可以尝试使用这样的udf
。
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, IntegerType
default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']
def_month_list_func = udf(lambda x: [1 if i in x else 0 for i in default_month_list], ArrayType(IntegerType()))
df = df.withColumn("Binary_MonthList", def_month_list_func(col("MonthList")))
df.show()
# output
+---+--------------------+------------------+
| ID| MonthList| Binary_MonthList|
+---+--------------------+------------------+
|ID1|[October, Septemb...|[1, 1, 1, 0, 0, 0]|
|ID2| [August, June, May]|[0, 0, 1, 0, 1, 1]|
|ID3| [October, June]|[1, 0, 0, 0, 1, 0]|
+---+--------------------+------------------+
A CountVectorizer does exactly that what you want. This algorithm adds all distinct values to his dictionary as long as they fullfil certain criteria (e.g. minimum or maximum occurence). You can apply this model on a dataframe and it will return one-hot encoded a sparse vector column (
from pyspark.ml.feature import CountVectorizer
data = [(("ID1", ['October', 'September', 'August']))
, (("ID2", ['August', 'June', 'May', 'August']))
, (("ID3", ['October', 'June']))]
df = spark.createDataFrame(data, ["ID", "MonthList"])
df.show(truncate=False)
#binary=True checks only if a item of the dictionary is present and not how often
#vocabSize defines the maximum size of the dictionary
#minDF=1.0 defines in how much rows (1.0 means one row is enough) a values has to be present to be added to the vocabulary
cv = CountVectorizer(inputCol="MonthList", outputCol="Binary_MonthList", vocabSize=12, minDF=1.0, binary=True)
cvModel = cv.fit(df)
df = cvModel.transform(df)
df.show(truncate=False)
cvModel.vocabulary
输出:
+---+----------------------------+
|ID | MonthList |
+---+----------------------------+
|ID1|[October, September, August]|
|ID2| [August, June, May, August]|
|ID3| [October, June] |
+---+----------------------------+
+---+----------------------------+-------------------------+
|ID | MonthList | Binary_MonthList |
+---+----------------------------+-------------------------+
|ID1|[October, September, August]|(5,[1,2,3],[1.0,1.0,1.0])|
|ID2|[August, June, May, August] |(5,[0,1,4],[1.0,1.0,1.0])|
|ID3|[October, June] | (5,[0,2],[1.0,1.0]) |
+---+----------------------------+-------------------------+
['June', 'August', 'October', 'September', 'May']
如何使用array_contains():
from pyspark.sql.functions import array, array_contains
df.withColumn('Binary_MonthList', array([array_contains('MonthList', c).astype('int') for c in default_month_list])).show()
+---+--------------------+------------------+
| ID| MonthList| Binary_MonthList|
+---+--------------------+------------------+
|ID1|[October, Septemb...|[1, 1, 1, 0, 0, 0]|
|ID2| [August, June, May]|[0, 0, 1, 0, 1, 1]|
|ID3| [October, June]|[1, 0, 0, 0, 1, 0]|
+---+--------------------+------------------+