Pyspark

Question

我在 Pyspark 中有两个不同的字符串类型数据框。第一个数据框是单个作品，而第二个是一串单词，即句子。我必须检查第二个数据框列中是否存在第一个数据框列。例如， df2

    +------+-------+-----------------+
    |age|height|   name|      Sentences  |
    +---+------+-------+-----------------+
    | 10|    80|  Alice|   'Grace, Sarah'|
    | 15|  null|    Bob|          'Sarah'|
    | 12|  null|    Tom|'Amy, Sarah, Bob'|
    | 13|  null| Rachel|       'Tom, Bob'|
    +---+------+-------+-----------------+

第二个数据框 df1

+-------+
| token |
+-------+
| 'Ali' |
|'Sarah'|
|'Bob'  |
|'Bob'  |
+-------+

那么，如何从 df2 语句列中搜索 df1 的每个标记。我需要为每个单词计数并添加为 df1

中的新列

我已经尝试过，但只针对单个单词，即不针对完整的数据框列

Answer 1

您可以使用 pyspark udf 在 df1 中创建新列。问题是您无法访问 udf ().

中的第二个数据帧

根据参考问题中的建议，您可以获得可广播变量的句子。

这是一个工作示例：

from pyspark.sql.types import *
from pyspark.sql.functions import udf

# Instanciate df2
cols = ["age", "height", "name", "Sentences"]
data = [
        (10, 80, "Alice", "Grace, Sarah"),
        (15, None, "Bob", "Sarah"),
        (12, None, "Tom", "Amy, Sarah, Bob"),
        (13, None, "Rachel", "Tom, Bob")
        ]

df2 = spark.createDataFrame(data).toDF(*cols)

# Instanciate df1
cols = ["token"]
data = [
        ("Ali",),
        ("Sarah",),
        ("Bob",),
        ("Bob",)
        ]

df1 = spark.createDataFrame(data).toDF(*cols)

# Creating broadcast variable for Sentences column of df2
lstSentences = [data[0] for data in df2.select('Sentences').collect()]
sentences = spark.sparkContext.broadcast(lstSentences)

def countWordInSentence(word):
    # Count if sentence contains word
    return sum(1 for item in lstSentences if word in item)

func_udf = udf(countWordInSentence, IntegerType())
df1 = df1.withColumn("COUNT",
                     func_udf(df1["token"]))
df1.show()

Answer 2

考虑上一个答案中的数据框

from pyspark.sql.functions import explode,explode_outer,split, length,trim
df3 = df2.select('Sentences',explode(split('Sentences',',')).alias('friends'))
df3 = df3.withColumn("friends", trim("friends")).withColumn("length_of_friends", length("friends")) 
display(df3)

df3 = df3.join(df1, df1.token == df3.friends,how='inner').groupby('friends').count()


display(df3)

Pyspark - 使用另一个数据框从一列数据框中查找子字符串

Pyspark - Find sub-string from a column of data-frame with another data-frame

apache-spark

apache-spark-sql