检查一个词是否是另一个词的一部分

Question

我有两列有一些文字：

text_1	text_2
astro lumen cosm planet	microcosm astronomy planet magnitude

我需要从 text_1 列中删除一个词，如果这个词出现在 text_2 列中（即完全重复）或者是部分 text_2 列中的某个单词。

期望的输出：

text_1	text_2
lumen	microcosm astronomy planet magnitude

我如何在 PostgreSQL and/or PySpark 中执行此操作？

Answer 1

您可以将第一列拆分为单词数组，然后使用 filter 函数过滤数组，如下所示：

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [("astro lumen cosm planet", "microcosm astronomy planet magnitude")],
    ["text_1", "text_2"]
)

df1 = df.withColumn(
    "text_1",
    F.array_join(
        F.filter(F.split("text_1", "\s+"), lambda x: ~F.col("text_2").contains(x)),
        " "
    )
)

df1.show(truncate=False)
#+------+------------------------------------+
#|text_1|text_2                              |
#+------+------------------------------------+
#|lumen |microcosm astronomy planet magnitude|
#+------+------------------------------------+

注意3.1+之前的spark，高阶函数需要使用exprfilter

Answer 2

这是 SQL 中的一种方法：

WITH data AS (
   SELECT 'astro lumen cosm planet' AS needles,
          'microcosm astronomy planet magnitude' AS haystack
)
SELECT string_agg(needle.n, ' ')
FROM data
   CROSS JOIN LATERAL regexp_split_to_table(data.needles, ' +') AS needle(n)
WHERE strpos(data.haystack, needle.n) = 0;

 string_agg 
════════════
 lumen
(1 row)

检查一个词是否是另一个词的一部分

Check if a word is a part of another word

string

postgresql

apache-spark

apache-spark-sql

pyspark