Python Pyspark - 文本分析/如果单词（行的值）在停用词字典中则删除行

Question

希望有人可以帮助在 Pyspark 中进行简单的情绪分析。我有一个 Pyspark 数据框，其中每一行都包含一个 word。我还有一本常用的字典 stopwords.

我想删除 word（行的值）在 stopwords 字典中的行。

输入：

+-------+
|  word |
+-------+
|    the|
|   food|
|     is|
|amazing|
|    and|
|  great|
+-------+

stopwords = {'the', 'is', 'and'}

预期输出：

+-------+
|  word |
+-------+
|   food|
|amazing|
|  great|
+-------+

Answer 1

使用负数isin:

df = df.filter(~F.col("word").isin(stop_words))

其中 stop_words:

stop_words = {"the", "is", "and"}

结果：

+-------+                                                                       
|word   |
+-------+
|food   |
|amazing|
|great  |
+-------+

Answer 2

您可以使用 stopwords 的集合创建数据框，然后使用 left_anti 加入输入数据框：

stopwords_df = spark.createDataFrame([[w] for w in stopwords], ["word"])

result_df = input_df.join(stopwords_df, ["word"], "left_anti")

result_df.show()
#+-------+
#|   word|
#+-------+
#|amazing|
#|   food|
#|  great|
#+-------+

Python Pyspark - 文本分析/如果单词（行的值）在停用词字典中则删除行

Python Pyspark - Text Analysis / Removing rows if word (value of row) is in a dictionary of stopwords

python

apache-spark

apache-spark-sql

pyspark