Pyspark

Question

我有一个 pyspark 数据框，其中有一列包含文本内容。

我正在计算包含感叹号“!”的句子数量连同“喜欢”和“想要”两个词。

例如：包含以下句子的行的列：

I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food. 
you don't want to!
what does he want?

我希望实现的期望输出是这样的（只计算包含“喜欢”或“想要”和“！”的句子）：

+----+-----+
|word|count|
+----+-----+
|like|   2 |
|want|   2 |
+----+-----+

有人可以帮我写一个可以做到这一点的 UDF 吗？这是我到目前为止所写的内容，但我似乎无法让它发挥作用。

nltk.tokenize import sent_tokenize

def convert_a_sentence(a_string):
    sentence = lower(nltk.sent_tokenize(a_string))
    return sentence

df = df.withColumn('a_sentence', convert_a_sentence(df['text']))

df.select(explode('a_sentence').alias('found')).filter(df['a_sentence'].isin('like', 'want', '!').groupBy('found').count().collect()

Answer 1

如果你想要的只是 uni-gram（即 1 个标记），你可以将句子拆分 space，然后展开、分组、计数然后过滤你想要的

(df
    .withColumn('words', F.split('sentence', ' '))
    .withColumn('word', F.explode('words'))
    .groupBy('word')
    .agg(
        F.count('*').alias('word_cnt')
    )
    .where(F.col('word').isin(['like', 'want']))
    .show()
)

# Output
# +----+--------+
# |word|word_cnt|
# +----+--------+
# |want|       2|
# |like|       3|
# +----+--------+

注意 #1：您可以在 groupBy 之前应用过滤器，使用 contains 功能

注意#2：如果你想做 n-gram 而不是像上面那样“黑客”，你可以考虑使用 SparkML 包 Tokenizer

from pyspark.ml.feature import Tokenizer

tokenizer = Tokenizer(inputCol='sentence', outputCol="words")
tokenized = tokenizer.transform(df)

# Output
# +----------------------+----------------------------+
# |sentence              |words                       |
# +----------------------+----------------------------+
# |I don't like to sing! |[i, don't, like, to, sing!] |
# |I like to go shopping!|[i, like, to, go, shopping!]|
# |I want to go home!    |[i, want, to, go, home!]    |
# |I like fast food.     |[i, like, fast, food.]      |
# |you don't want to!    |[you, don't, want, to!]     |
# |what does he want?    |[what, does, he, want?]     |
# +----------------------+----------------------------+

或NGram

from pyspark.ml.feature import NGram

ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramed = ngram.transform(tokenized)

# Output
# +----------------------+----------------------------+----------------------------------------+
# |col                   |words                       |ngrams                                  |
# +----------------------+----------------------------+----------------------------------------+
# |I don't like to sing! |[i, don't, like, to, sing!] |[i don't, don't like, like to, to sing!]|
# |I like to go shopping!|[i, like, to, go, shopping!]|[i like, like to, to go, go shopping!]  |
# |I want to go home!    |[i, want, to, go, home!]    |[i want, want to, to go, go home!]      |
# |I like fast food.     |[i, like, fast, food.]      |[i like, like fast, fast food.]         |
# |you don't want to!    |[you, don't, want, to!]     |[you don't, don't want, want to!]       |
# |what does he want?    |[what, does, he, want?]     |[what does, does he, he want?]          |
# +----------------------+----------------------------+----------------------------------------+

Answer 2

我不确定 pandas 或 pyspark 方法，但您可以使用函数

轻松完成此操作

from nltk.tokenize import sent_tokenize

t = """
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food. 
you don't want to!
what does he want?
"""
sentences = lower(nltk.sent_tokenize(t))
for sentence in sentences:
  if "!" in sentence and "like" in sentence:
    print(f"found in {sentence}")

你应该能够弄清楚如何 count/put 它在 table...

Pyspark - 计算句子中的特定单词

Pyspark - counting particular words in sentences

python

nltk

apache-spark

data-science