如何使用 python pyspark 提取字符串类型字典?

How to extract string type dictionary using python pyspark?

这是我的数据框

Score   Features
74.5    {description={termFrequency=4.0, similarityScore=37.8539953, uniqueTokenMatches=4.0}, 
         code={termFrequency=4.0, similarityScore=36.7476063, uniqueTokenMatches=4.0}}
77.64   {description={termFrequency=3.0, similarityScore=36.080687, uniqueTokenMatches=3.0}, 
         code={termFrequency=3.0, similarityScore=34.2332495, uniqueTokenMatches=3.0}}

在特征列中,我只想提取描述字典,不需要提取代码字典,但是特征列的类型是字符串,我不想使用 substr() 来提取它。我如何使用 python pyspark 来做到这一点。

我想要像这样的输出数据帧

Score   termFrequency       similarityScore     uniqueTokenMatches
74.5    4.0                 37.8539953          4.0
77.64   3.0                 36.080687           3.0

可能这可以进一步优化,但是这个答案的一般想法是提取您需要的字典的字符串表示部分,然后按分隔符拆分并进行一些清理以创建结构数组,分解并旋转它们以创建新列。

进口:

from pyspark.sql import functions as F

代码:

out = (df.withColumn("Features",
              F.split(
              F.regexp_replace(
              F.regexp_extract("Features","(?:\{description=)(\{.+}),",1)
              ,"\{|\}|\s+","")
              ,",")
             )
.withColumn("Features",F.expr("""transform(
           transform(Features,x-> split(x,'='))
           ,y->struct(y[0],cast(y[1] as float)))"""))
.selectExpr("Score","inline(Features)")
.groupBy("Score").pivot("col1").agg({"col2":'first'})
)

out.show()

+-----+---------------+-------------+------------------+
|Score|similarityScore|termFrequency|uniqueTokenMatches|
+-----+---------------+-------------+------------------+
| 74.5|      37.853996|          4.0|               4.0|
|77.64|       36.08069|          3.0|               3.0|
+-----+---------------+-------------+------------------+