pyspark 数据框搜索和替换多个值
pyspark dataframe search and replace multiple values
下面是我的数据框,我想在 df1 中找到 character/string,然后使用 pyspark 替换 df2 中的值。
df1.show()
+--------+----------+
|find |replace |
+--------+----------+
| ( | aa |
| ) | bb |
| """" | cc |
| ' | dd |
| , | ee |
| . | ff |
| - | gg |
| — | ii |
| man | manual |
| sunday| holiday |
+--------+----------+
df2.show()
+------------------+
|Name |
+------------------+
| a,b. |
| check) |
| v(alue-1 |
| ra'in |
| human be(ing |
|OP.86-1_0743 test |
+------------------+
所需的输出:
df2.show()
+-----------------+---------------+
|Name |Replaced_Name |
+-----------------+---------------+
| a,b. | aeebff |
| check) | checkbb |
| v(alue-1 | vaaaluegg1 |
| ra'in | raddin |
|human be(ing | humanualbeing |
|OP.86-1_0743 test| OP8610743test |
+-----------------+---------------+
注意: 在这些示例中,我将列 find
重命名为 colfind
,将 replace
重命名为 colreplace
方法一
在 df1
相对较小时推荐使用,但这种方法更可靠。我们使用一个udf来替换值:
from pyspark.sql import functions as F
from pyspark.sql import Window
replacement_map = {}
for row in df1.collect():
replacement_map[row.colfind]=row.colreplace
@F.udf()
def find_and_replace(column_value):
for colfind in replacement_map:
column_value = column_value.replace(colfind,replacement_map[colfind])
return column_value
df2.withColumn("Replaced_Name",find_and_replace(F.col("Name"))).show()
输出:
+-----------------+--------------------+
| Name| Replaced_Name|
+-----------------+--------------------+
| ra'in| raddin|
| check)| checkbb|
| human be(ing| humanual beaaing|
|OP.86-1_0743 test|OPff86gg1ii0743 test|
| a,b.| aeebff|
| v(alue-1| vaaaluegg1|
+-----------------+--------------------+
方法二
如果您将 Name 列拆分为行并使用替换连接到您的数据框,则可以按如下所示完成此任务:
注意。这种方法更适合单个字符替换
df_replaced = (
df2.alias("df2").select(
F.col("Name"),
F.posexplode(F.split("Name",''))
).join(
df1.alias("df1"),
on=(
(
F.col("col")==F.col("df1.colfind")
)
|
(
F.col("Name").contains(F.col("df1.colfind"))
&
(F.col("df1.colfind").substr(0,1)==F.col("col"))
)
),
how="left"
)
.select(
F.col("Name"),
F.concat_ws(
'',
F.collect_list(
F.coalesce(
F.col("df1.colreplace"),
F.col("col")
)
).over(
Window.partitionBy("Name").orderBy("pos")
)
).alias("Replaced_Name"),
F.row_number().over(
Window.partitionBy("Name").orderBy(F.col("pos").desc())
).alias("rn")
)
.where("rn=1")
.select("Name","Replaced_Name")
)
df_replaced.show()
输出:
+-----------------+--------------------+
|Name |Replaced_Name |
+-----------------+--------------------+
|OP.86-1_0743 test|OPff86gg1ii0743 test|
|a,b. |aeebff |
|check) |checkbb |
|human be(ing |humanualan beaaing |
|ra'in |raddin |
|v(alue-1 |vaaaluegg1 |
+-----------------+--------------------+
调试输出
已共享以下输出以与此答案响应的最后一个问题更新保持一致(即操作可能会更改问题中使用的数据)。
df1 :
+-------+----------+
|colfind|colreplace|
+-------+----------+
| ,| ee|
| .| ff|
| —| ii|
| man| manual|
| )| bb|
| -| gg|
| ""| cc|
| '| dd|
| _| ii|
| (| aa|
| sunday| holiday|
+-------+----------+
df2 :
+-----------------+
| Name|
+-----------------+
| ra'in|
| check)|
| human be(ing|
|OP.86-1_0743 test|
| a,b.|
| v(alue-1|
+-----------------+
汇总前的输出
+-----------------+---+---+----------+-------+--------------------+---+
|Name |pos|col|colreplace|colfind|Replaced_Name |rn |
+-----------------+---+---+----------+-------+--------------------+---+
|OP.86-1_0743 test|0 |O |null |null |O |18 |
|OP.86-1_0743 test|1 |P |null |null |OP |17 |
|OP.86-1_0743 test|2 |. |ff |. |OPff |16 |
|OP.86-1_0743 test|3 |8 |null |null |OPff8 |15 |
|OP.86-1_0743 test|4 |6 |null |null |OPff86 |14 |
|OP.86-1_0743 test|5 |- |gg |- |OPff86gg |13 |
|OP.86-1_0743 test|6 |1 |null |null |OPff86gg1 |12 |
|OP.86-1_0743 test|7 |_ |ii |_ |OPff86gg1ii |11 |
|OP.86-1_0743 test|8 |0 |null |null |OPff86gg1ii0 |10 |
|OP.86-1_0743 test|9 |7 |null |null |OPff86gg1ii07 |9 |
|OP.86-1_0743 test|10 |4 |null |null |OPff86gg1ii074 |8 |
|OP.86-1_0743 test|11 |3 |null |null |OPff86gg1ii0743 |7 |
|OP.86-1_0743 test|12 | |null |null |OPff86gg1ii0743 |6 |
|OP.86-1_0743 test|13 |t |null |null |OPff86gg1ii0743 t |5 |
|OP.86-1_0743 test|14 |e |null |null |OPff86gg1ii0743 te |4 |
|OP.86-1_0743 test|15 |s |null |null |OPff86gg1ii0743 tes |3 |
|OP.86-1_0743 test|16 |t |null |null |OPff86gg1ii0743 test|2 |
|OP.86-1_0743 test|17 | |null |null |OPff86gg1ii0743 test|1 |
|a,b. |0 |a |null |null |a |5 |
|a,b. |1 |, |ee |, |aee |4 |
|a,b. |2 |b |null |null |aeeb |3 |
|a,b. |3 |. |ff |. |aeebff |2 |
|a,b. |4 | |null |null |aeebff |1 |
|check) |0 |c |null |null |c |7 |
|check) |1 |h |null |null |ch |6 |
|check) |2 |e |null |null |che |5 |
|check) |3 |c |null |null |chec |4 |
|check) |4 |k |null |null |check |3 |
|check) |5 |) |bb |) |checkbb |2 |
|check) |6 | |null |null |checkbb |1 |
|human be(ing |0 |h |null |null |h |13 |
|human be(ing |1 |u |null |null |hu |12 |
|human be(ing |2 |m |manual |man |humanual |11 |
|human be(ing |3 |a |null |null |humanuala |10 |
|human be(ing |4 |n |null |null |humanualan |9 |
|human be(ing |5 | |null |null |humanualan |8 |
|human be(ing |6 |b |null |null |humanualan b |7 |
|human be(ing |7 |e |null |null |humanualan be |6 |
|human be(ing |8 |( |aa |( |humanualan beaa |5 |
|human be(ing |9 |i |null |null |humanualan beaai |4 |
|human be(ing |10 |n |null |null |humanualan beaain |3 |
|human be(ing |11 |g |null |null |humanualan beaaing |2 |
|human be(ing |12 | |null |null |humanualan beaaing |1 |
|ra'in |0 |r |null |null |r |6 |
|ra'in |1 |a |null |null |ra |5 |
|ra'in |2 |' |dd |' |radd |4 |
|ra'in |3 |i |null |null |raddi |3 |
|ra'in |4 |n |null |null |raddin |2 |
|ra'in |5 | |null |null |raddin |1 |
|v(alue-1 |0 |v |null |null |v |9 |
|v(alue-1 |1 |( |aa |( |vaa |8 |
|v(alue-1 |2 |a |null |null |vaaa |7 |
|v(alue-1 |3 |l |null |null |vaaal |6 |
|v(alue-1 |4 |u |null |null |vaaalu |5 |
|v(alue-1 |5 |e |null |null |vaaalue |4 |
|v(alue-1 |6 |- |gg |- |vaaaluegg |3 |
|v(alue-1 |7 |1 |null |null |vaaaluegg1 |2 |
|v(alue-1 |8 | |null |null |vaaaluegg1 |1 |
+-----------------+---+---+----------+-------+--------------------+---+
让我知道这是否适合你。
下面是我的数据框,我想在 df1 中找到 character/string,然后使用 pyspark 替换 df2 中的值。
df1.show()
+--------+----------+
|find |replace |
+--------+----------+
| ( | aa |
| ) | bb |
| """" | cc |
| ' | dd |
| , | ee |
| . | ff |
| - | gg |
| — | ii |
| man | manual |
| sunday| holiday |
+--------+----------+
df2.show()
+------------------+
|Name |
+------------------+
| a,b. |
| check) |
| v(alue-1 |
| ra'in |
| human be(ing |
|OP.86-1_0743 test |
+------------------+
所需的输出: df2.show()
+-----------------+---------------+
|Name |Replaced_Name |
+-----------------+---------------+
| a,b. | aeebff |
| check) | checkbb |
| v(alue-1 | vaaaluegg1 |
| ra'in | raddin |
|human be(ing | humanualbeing |
|OP.86-1_0743 test| OP8610743test |
+-----------------+---------------+
注意: 在这些示例中,我将列 find
重命名为 colfind
,将 replace
重命名为 colreplace
方法一
在 df1
相对较小时推荐使用,但这种方法更可靠。我们使用一个udf来替换值:
from pyspark.sql import functions as F
from pyspark.sql import Window
replacement_map = {}
for row in df1.collect():
replacement_map[row.colfind]=row.colreplace
@F.udf()
def find_and_replace(column_value):
for colfind in replacement_map:
column_value = column_value.replace(colfind,replacement_map[colfind])
return column_value
df2.withColumn("Replaced_Name",find_and_replace(F.col("Name"))).show()
输出:
+-----------------+--------------------+
| Name| Replaced_Name|
+-----------------+--------------------+
| ra'in| raddin|
| check)| checkbb|
| human be(ing| humanual beaaing|
|OP.86-1_0743 test|OPff86gg1ii0743 test|
| a,b.| aeebff|
| v(alue-1| vaaaluegg1|
+-----------------+--------------------+
方法二
如果您将 Name 列拆分为行并使用替换连接到您的数据框,则可以按如下所示完成此任务:
注意。这种方法更适合单个字符替换
df_replaced = (
df2.alias("df2").select(
F.col("Name"),
F.posexplode(F.split("Name",''))
).join(
df1.alias("df1"),
on=(
(
F.col("col")==F.col("df1.colfind")
)
|
(
F.col("Name").contains(F.col("df1.colfind"))
&
(F.col("df1.colfind").substr(0,1)==F.col("col"))
)
),
how="left"
)
.select(
F.col("Name"),
F.concat_ws(
'',
F.collect_list(
F.coalesce(
F.col("df1.colreplace"),
F.col("col")
)
).over(
Window.partitionBy("Name").orderBy("pos")
)
).alias("Replaced_Name"),
F.row_number().over(
Window.partitionBy("Name").orderBy(F.col("pos").desc())
).alias("rn")
)
.where("rn=1")
.select("Name","Replaced_Name")
)
df_replaced.show()
输出:
+-----------------+--------------------+
|Name |Replaced_Name |
+-----------------+--------------------+
|OP.86-1_0743 test|OPff86gg1ii0743 test|
|a,b. |aeebff |
|check) |checkbb |
|human be(ing |humanualan beaaing |
|ra'in |raddin |
|v(alue-1 |vaaaluegg1 |
+-----------------+--------------------+
调试输出
已共享以下输出以与此答案响应的最后一个问题更新保持一致(即操作可能会更改问题中使用的数据)。
df1 :
+-------+----------+
|colfind|colreplace|
+-------+----------+
| ,| ee|
| .| ff|
| —| ii|
| man| manual|
| )| bb|
| -| gg|
| ""| cc|
| '| dd|
| _| ii|
| (| aa|
| sunday| holiday|
+-------+----------+
df2 :
+-----------------+
| Name|
+-----------------+
| ra'in|
| check)|
| human be(ing|
|OP.86-1_0743 test|
| a,b.|
| v(alue-1|
+-----------------+
汇总前的输出
+-----------------+---+---+----------+-------+--------------------+---+
|Name |pos|col|colreplace|colfind|Replaced_Name |rn |
+-----------------+---+---+----------+-------+--------------------+---+
|OP.86-1_0743 test|0 |O |null |null |O |18 |
|OP.86-1_0743 test|1 |P |null |null |OP |17 |
|OP.86-1_0743 test|2 |. |ff |. |OPff |16 |
|OP.86-1_0743 test|3 |8 |null |null |OPff8 |15 |
|OP.86-1_0743 test|4 |6 |null |null |OPff86 |14 |
|OP.86-1_0743 test|5 |- |gg |- |OPff86gg |13 |
|OP.86-1_0743 test|6 |1 |null |null |OPff86gg1 |12 |
|OP.86-1_0743 test|7 |_ |ii |_ |OPff86gg1ii |11 |
|OP.86-1_0743 test|8 |0 |null |null |OPff86gg1ii0 |10 |
|OP.86-1_0743 test|9 |7 |null |null |OPff86gg1ii07 |9 |
|OP.86-1_0743 test|10 |4 |null |null |OPff86gg1ii074 |8 |
|OP.86-1_0743 test|11 |3 |null |null |OPff86gg1ii0743 |7 |
|OP.86-1_0743 test|12 | |null |null |OPff86gg1ii0743 |6 |
|OP.86-1_0743 test|13 |t |null |null |OPff86gg1ii0743 t |5 |
|OP.86-1_0743 test|14 |e |null |null |OPff86gg1ii0743 te |4 |
|OP.86-1_0743 test|15 |s |null |null |OPff86gg1ii0743 tes |3 |
|OP.86-1_0743 test|16 |t |null |null |OPff86gg1ii0743 test|2 |
|OP.86-1_0743 test|17 | |null |null |OPff86gg1ii0743 test|1 |
|a,b. |0 |a |null |null |a |5 |
|a,b. |1 |, |ee |, |aee |4 |
|a,b. |2 |b |null |null |aeeb |3 |
|a,b. |3 |. |ff |. |aeebff |2 |
|a,b. |4 | |null |null |aeebff |1 |
|check) |0 |c |null |null |c |7 |
|check) |1 |h |null |null |ch |6 |
|check) |2 |e |null |null |che |5 |
|check) |3 |c |null |null |chec |4 |
|check) |4 |k |null |null |check |3 |
|check) |5 |) |bb |) |checkbb |2 |
|check) |6 | |null |null |checkbb |1 |
|human be(ing |0 |h |null |null |h |13 |
|human be(ing |1 |u |null |null |hu |12 |
|human be(ing |2 |m |manual |man |humanual |11 |
|human be(ing |3 |a |null |null |humanuala |10 |
|human be(ing |4 |n |null |null |humanualan |9 |
|human be(ing |5 | |null |null |humanualan |8 |
|human be(ing |6 |b |null |null |humanualan b |7 |
|human be(ing |7 |e |null |null |humanualan be |6 |
|human be(ing |8 |( |aa |( |humanualan beaa |5 |
|human be(ing |9 |i |null |null |humanualan beaai |4 |
|human be(ing |10 |n |null |null |humanualan beaain |3 |
|human be(ing |11 |g |null |null |humanualan beaaing |2 |
|human be(ing |12 | |null |null |humanualan beaaing |1 |
|ra'in |0 |r |null |null |r |6 |
|ra'in |1 |a |null |null |ra |5 |
|ra'in |2 |' |dd |' |radd |4 |
|ra'in |3 |i |null |null |raddi |3 |
|ra'in |4 |n |null |null |raddin |2 |
|ra'in |5 | |null |null |raddin |1 |
|v(alue-1 |0 |v |null |null |v |9 |
|v(alue-1 |1 |( |aa |( |vaa |8 |
|v(alue-1 |2 |a |null |null |vaaa |7 |
|v(alue-1 |3 |l |null |null |vaaal |6 |
|v(alue-1 |4 |u |null |null |vaaalu |5 |
|v(alue-1 |5 |e |null |null |vaaalue |4 |
|v(alue-1 |6 |- |gg |- |vaaaluegg |3 |
|v(alue-1 |7 |1 |null |null |vaaaluegg1 |2 |
|v(alue-1 |8 | |null |null |vaaaluegg1 |1 |
+-----------------+---+---+----------+-------+--------------------+---+
让我知道这是否适合你。