pyspark 数据框搜索和替换多个值

pyspark dataframe search and replace multiple values

下面是我的数据框,我想在 df1 中找到 character/string,然后使用 pyspark 替换 df2 中的值。

df1.show()

+--------+----------+
|find    |replace   |
+--------+----------+
|  (     |    aa    |
|  )     |    bb    |
|  """"  |    cc    |
|  '     |    dd    |
|  ,     |    ee    |
|  .     |    ff    |
|  -     |    gg    |
|  —     |    ii    |
|  man   |  manual  |
|  sunday| holiday  | 
+--------+----------+

df2.show()

+------------------+
|Name              |
+------------------+
|  a,b.            |
|  check)          |
|  v(alue-1        |
|  ra'in           |
| human be(ing     |
|OP.86-1_0743 test |
+------------------+

所需的输出: df2.show()

+-----------------+---------------+
|Name             |Replaced_Name  |
+-----------------+---------------+
|  a,b.           |  aeebff       |
|  check)         |  checkbb      |
|  v(alue-1       |  vaaaluegg1   |
|  ra'in          |  raddin       |
|human be(ing     | humanualbeing |
|OP.86-1_0743 test| OP8610743test |
+-----------------+---------------+

注意: 在这些示例中,我将列 find 重命名为 colfind,将 replace 重命名为 colreplace

方法一

df1 相对较小时推荐使用,但这种方法更可靠。我们使用一个udf来替换值:

from pyspark.sql import functions as F
from pyspark.sql import Window

replacement_map = {}
for row in df1.collect():
    replacement_map[row.colfind]=row.colreplace

@F.udf()
def find_and_replace(column_value):
    for colfind in replacement_map:
       column_value = column_value.replace(colfind,replacement_map[colfind])
    return column_value

df2.withColumn("Replaced_Name",find_and_replace(F.col("Name"))).show()

输出:

+-----------------+--------------------+
|             Name|       Replaced_Name|
+-----------------+--------------------+
|            ra'in|              raddin|
|           check)|             checkbb|
|     human be(ing|    humanual beaaing|
|OP.86-1_0743 test|OPff86gg1ii0743 test|
|             a,b.|              aeebff|
|         v(alue-1|          vaaaluegg1|
+-----------------+--------------------+

方法二

如果您将 Name 列拆分为行并使用替换连接到您的数据框,则可以按如下所示完成此任务:

注意。这种方法更适合单个字符替换

df_replaced = (
    df2.alias("df2").select(
        F.col("Name"),
        F.posexplode(F.split("Name",''))
    ).join(
        df1.alias("df1"),
        on=(
            (
                F.col("col")==F.col("df1.colfind") 
            ) 
            | 
            (
                F.col("Name").contains(F.col("df1.colfind"))  
                &
                (F.col("df1.colfind").substr(0,1)==F.col("col"))
            )
        ),
        how="left"
    )
    .select(
        F.col("Name"),
        F.concat_ws(
            '',
            F.collect_list(
                F.coalesce(
                    F.col("df1.colreplace"),
                    F.col("col")
                )
            ).over(
                Window.partitionBy("Name").orderBy("pos")
            )
        ).alias("Replaced_Name"),
        F.row_number().over(
            Window.partitionBy("Name").orderBy(F.col("pos").desc())
        ).alias("rn")
    )
    .where("rn=1")
    .select("Name","Replaced_Name")
)

df_replaced.show()

输出:

+-----------------+--------------------+
|Name             |Replaced_Name       |
+-----------------+--------------------+
|OP.86-1_0743 test|OPff86gg1ii0743 test|
|a,b.             |aeebff              |
|check)           |checkbb             |
|human be(ing     |humanualan beaaing  |
|ra'in            |raddin              |
|v(alue-1         |vaaaluegg1          |
+-----------------+--------------------+

调试输出

已共享以下输出以与此答案响应的最后一个问题更新保持一致(即操作可能会更改问题中使用的数据)。

df1 :

+-------+----------+
|colfind|colreplace|
+-------+----------+
|      ,|        ee|
|      .|        ff|
|      —|        ii|
|    man|    manual|
|      )|        bb|
|      -|        gg|
|     ""|        cc|
|      '|        dd|
|      _|        ii|
|      (|        aa|
| sunday|   holiday|
+-------+----------+

df2 :

+-----------------+
|             Name|
+-----------------+
|            ra'in|
|           check)|
|     human be(ing|
|OP.86-1_0743 test|
|             a,b.|
|         v(alue-1|
+-----------------+

汇总前的输出

+-----------------+---+---+----------+-------+--------------------+---+
|Name             |pos|col|colreplace|colfind|Replaced_Name       |rn |
+-----------------+---+---+----------+-------+--------------------+---+
|OP.86-1_0743 test|0  |O  |null      |null   |O                   |18 |
|OP.86-1_0743 test|1  |P  |null      |null   |OP                  |17 |
|OP.86-1_0743 test|2  |.  |ff        |.      |OPff                |16 |
|OP.86-1_0743 test|3  |8  |null      |null   |OPff8               |15 |
|OP.86-1_0743 test|4  |6  |null      |null   |OPff86              |14 |
|OP.86-1_0743 test|5  |-  |gg        |-      |OPff86gg            |13 |
|OP.86-1_0743 test|6  |1  |null      |null   |OPff86gg1           |12 |
|OP.86-1_0743 test|7  |_  |ii        |_      |OPff86gg1ii         |11 |
|OP.86-1_0743 test|8  |0  |null      |null   |OPff86gg1ii0        |10 |
|OP.86-1_0743 test|9  |7  |null      |null   |OPff86gg1ii07       |9  |
|OP.86-1_0743 test|10 |4  |null      |null   |OPff86gg1ii074      |8  |
|OP.86-1_0743 test|11 |3  |null      |null   |OPff86gg1ii0743     |7  |
|OP.86-1_0743 test|12 |   |null      |null   |OPff86gg1ii0743     |6  |
|OP.86-1_0743 test|13 |t  |null      |null   |OPff86gg1ii0743 t   |5  |
|OP.86-1_0743 test|14 |e  |null      |null   |OPff86gg1ii0743 te  |4  |
|OP.86-1_0743 test|15 |s  |null      |null   |OPff86gg1ii0743 tes |3  |
|OP.86-1_0743 test|16 |t  |null      |null   |OPff86gg1ii0743 test|2  |
|OP.86-1_0743 test|17 |   |null      |null   |OPff86gg1ii0743 test|1  |
|a,b.             |0  |a  |null      |null   |a                   |5  |
|a,b.             |1  |,  |ee        |,      |aee                 |4  |
|a,b.             |2  |b  |null      |null   |aeeb                |3  |
|a,b.             |3  |.  |ff        |.      |aeebff              |2  |
|a,b.             |4  |   |null      |null   |aeebff              |1  |
|check)           |0  |c  |null      |null   |c                   |7  |
|check)           |1  |h  |null      |null   |ch                  |6  |
|check)           |2  |e  |null      |null   |che                 |5  |
|check)           |3  |c  |null      |null   |chec                |4  |
|check)           |4  |k  |null      |null   |check               |3  |
|check)           |5  |)  |bb        |)      |checkbb             |2  |
|check)           |6  |   |null      |null   |checkbb             |1  |
|human be(ing     |0  |h  |null      |null   |h                   |13 |
|human be(ing     |1  |u  |null      |null   |hu                  |12 |
|human be(ing     |2  |m  |manual    |man    |humanual            |11 |
|human be(ing     |3  |a  |null      |null   |humanuala           |10 |
|human be(ing     |4  |n  |null      |null   |humanualan          |9  |
|human be(ing     |5  |   |null      |null   |humanualan          |8  |
|human be(ing     |6  |b  |null      |null   |humanualan b        |7  |
|human be(ing     |7  |e  |null      |null   |humanualan be       |6  |
|human be(ing     |8  |(  |aa        |(      |humanualan beaa     |5  |
|human be(ing     |9  |i  |null      |null   |humanualan beaai    |4  |
|human be(ing     |10 |n  |null      |null   |humanualan beaain   |3  |
|human be(ing     |11 |g  |null      |null   |humanualan beaaing  |2  |
|human be(ing     |12 |   |null      |null   |humanualan beaaing  |1  |
|ra'in            |0  |r  |null      |null   |r                   |6  |
|ra'in            |1  |a  |null      |null   |ra                  |5  |
|ra'in            |2  |'  |dd        |'      |radd                |4  |
|ra'in            |3  |i  |null      |null   |raddi               |3  |
|ra'in            |4  |n  |null      |null   |raddin              |2  |
|ra'in            |5  |   |null      |null   |raddin              |1  |
|v(alue-1         |0  |v  |null      |null   |v                   |9  |
|v(alue-1         |1  |(  |aa        |(      |vaa                 |8  |
|v(alue-1         |2  |a  |null      |null   |vaaa                |7  |
|v(alue-1         |3  |l  |null      |null   |vaaal               |6  |
|v(alue-1         |4  |u  |null      |null   |vaaalu              |5  |
|v(alue-1         |5  |e  |null      |null   |vaaalue             |4  |
|v(alue-1         |6  |-  |gg        |-      |vaaaluegg           |3  |
|v(alue-1         |7  |1  |null      |null   |vaaaluegg1          |2  |
|v(alue-1         |8  |   |null      |null   |vaaaluegg1          |1  |
+-----------------+---+---+----------+-------+--------------------+---+

让我知道这是否适合你。