如何使用 python pyspark 提取字符串类型字典?
How to extract string type dictionary using python pyspark?
这是我的数据框
Score Features
74.5 {description={termFrequency=4.0, similarityScore=37.8539953, uniqueTokenMatches=4.0},
code={termFrequency=4.0, similarityScore=36.7476063, uniqueTokenMatches=4.0}}
77.64 {description={termFrequency=3.0, similarityScore=36.080687, uniqueTokenMatches=3.0},
code={termFrequency=3.0, similarityScore=34.2332495, uniqueTokenMatches=3.0}}
在特征列中,我只想提取描述字典,不需要提取代码字典,但是特征列的类型是字符串,我不想使用 substr() 来提取它。我如何使用 python pyspark 来做到这一点。
我想要像这样的输出数据帧
Score termFrequency similarityScore uniqueTokenMatches
74.5 4.0 37.8539953 4.0
77.64 3.0 36.080687 3.0
可能这可以进一步优化,但是这个答案的一般想法是提取您需要的字典的字符串表示部分,然后按分隔符拆分并进行一些清理以创建结构数组,分解并旋转它们以创建新列。
进口:
from pyspark.sql import functions as F
代码:
out = (df.withColumn("Features",
F.split(
F.regexp_replace(
F.regexp_extract("Features","(?:\{description=)(\{.+}),",1)
,"\{|\}|\s+","")
,",")
)
.withColumn("Features",F.expr("""transform(
transform(Features,x-> split(x,'='))
,y->struct(y[0],cast(y[1] as float)))"""))
.selectExpr("Score","inline(Features)")
.groupBy("Score").pivot("col1").agg({"col2":'first'})
)
out.show()
+-----+---------------+-------------+------------------+
|Score|similarityScore|termFrequency|uniqueTokenMatches|
+-----+---------------+-------------+------------------+
| 74.5| 37.853996| 4.0| 4.0|
|77.64| 36.08069| 3.0| 3.0|
+-----+---------------+-------------+------------------+
这是我的数据框
Score Features
74.5 {description={termFrequency=4.0, similarityScore=37.8539953, uniqueTokenMatches=4.0},
code={termFrequency=4.0, similarityScore=36.7476063, uniqueTokenMatches=4.0}}
77.64 {description={termFrequency=3.0, similarityScore=36.080687, uniqueTokenMatches=3.0},
code={termFrequency=3.0, similarityScore=34.2332495, uniqueTokenMatches=3.0}}
在特征列中,我只想提取描述字典,不需要提取代码字典,但是特征列的类型是字符串,我不想使用 substr() 来提取它。我如何使用 python pyspark 来做到这一点。
我想要像这样的输出数据帧
Score termFrequency similarityScore uniqueTokenMatches
74.5 4.0 37.8539953 4.0
77.64 3.0 36.080687 3.0
可能这可以进一步优化,但是这个答案的一般想法是提取您需要的字典的字符串表示部分,然后按分隔符拆分并进行一些清理以创建结构数组,分解并旋转它们以创建新列。
进口:
from pyspark.sql import functions as F
代码:
out = (df.withColumn("Features",
F.split(
F.regexp_replace(
F.regexp_extract("Features","(?:\{description=)(\{.+}),",1)
,"\{|\}|\s+","")
,",")
)
.withColumn("Features",F.expr("""transform(
transform(Features,x-> split(x,'='))
,y->struct(y[0],cast(y[1] as float)))"""))
.selectExpr("Score","inline(Features)")
.groupBy("Score").pivot("col1").agg({"col2":'first'})
)
out.show()
+-----+---------------+-------------+------------------+
|Score|similarityScore|termFrequency|uniqueTokenMatches|
+-----+---------------+-------------+------------------+
| 74.5| 37.853996| 4.0| 4.0|
|77.64| 36.08069| 3.0| 3.0|
+-----+---------------+-------------+------------------+