如何减去 pyspark 数据框中的两个字符串列?
How to substract two string columns in pyspark dataframe?
我想减去 column1 - column2
,即从 column1
中删除在 column2
中匹配的所有子字符串,并将结果放入新列 result
.
pyspark 数据框:
+--+-------------------------+--------------------------+--------------+
|ID| column1 | column2 | result |
+--+-------------------------+--------------------------+--------------+
|1 | Hi how are you fine but | Hi I am fine how about u | are you but |
|2 | javascript python XML | python XML | javascript |
|3 | include all the inform | include inform | all the |
+--+-------------------------+--------------------------+--------------+
您可以使用 array_except
从 column1
中删除 colmun2
中存在的所有子字符串:
from pyspark.sql import functions as F
df1 = df.withColumn(
"result",
F.array_join(
F.array_except(F.split("column1", " "), F.split("column2", " ")),
" "
)
)
df1.show(truncate=False)
#+---+-----------------------+------------------------+-----------+
#|ID |column1 |column2 |result |
#+---+-----------------------+------------------------+-----------+
#|1 |Hi how are you fine but|Hi I am fine how about u|are you but|
#|2 |javascript python XML |python XML |javascript |
#|3 |include all the inform |include inform |all the |
#+---+-----------------------+------------------------+-----------+
我想减去 column1 - column2
,即从 column1
中删除在 column2
中匹配的所有子字符串,并将结果放入新列 result
.
pyspark 数据框:
+--+-------------------------+--------------------------+--------------+
|ID| column1 | column2 | result |
+--+-------------------------+--------------------------+--------------+
|1 | Hi how are you fine but | Hi I am fine how about u | are you but |
|2 | javascript python XML | python XML | javascript |
|3 | include all the inform | include inform | all the |
+--+-------------------------+--------------------------+--------------+
您可以使用 array_except
从 column1
中删除 colmun2
中存在的所有子字符串:
from pyspark.sql import functions as F
df1 = df.withColumn(
"result",
F.array_join(
F.array_except(F.split("column1", " "), F.split("column2", " ")),
" "
)
)
df1.show(truncate=False)
#+---+-----------------------+------------------------+-----------+
#|ID |column1 |column2 |result |
#+---+-----------------------+------------------------+-----------+
#|1 |Hi how are you fine but|Hi I am fine how about u|are you but|
#|2 |javascript python XML |python XML |javascript |
#|3 |include all the inform |include inform |all the |
#+---+-----------------------+------------------------+-----------+