Pyspark - 计算句子中的特定单词
Pyspark - counting particular words in sentences
我有一个 pyspark 数据框,其中有一列包含文本内容。
我正在计算包含感叹号“!”的句子数量连同“喜欢”和“想要”两个词。
例如:包含以下句子的行的列:
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food.
you don't want to!
what does he want?
我希望实现的期望输出是这样的(只计算包含“喜欢”或“想要”和“!”的句子):
+----+-----+
|word|count|
+----+-----+
|like| 2 |
|want| 2 |
+----+-----+
有人可以帮我写一个可以做到这一点的 UDF 吗?这是我到目前为止所写的内容,但我似乎无法让它发挥作用。
nltk.tokenize import sent_tokenize
def convert_a_sentence(a_string):
sentence = lower(nltk.sent_tokenize(a_string))
return sentence
df = df.withColumn('a_sentence', convert_a_sentence(df['text']))
df.select(explode('a_sentence').alias('found')).filter(df['a_sentence'].isin('like', 'want', '!').groupBy('found').count().collect()
如果你想要的只是 uni-gram(即 1 个标记),你可以将句子拆分 space,然后展开、分组、计数然后过滤你想要的
(df
.withColumn('words', F.split('sentence', ' '))
.withColumn('word', F.explode('words'))
.groupBy('word')
.agg(
F.count('*').alias('word_cnt')
)
.where(F.col('word').isin(['like', 'want']))
.show()
)
# Output
# +----+--------+
# |word|word_cnt|
# +----+--------+
# |want| 2|
# |like| 3|
# +----+--------+
注意 #1:您可以在 groupBy
之前应用过滤器,使用 contains 功能
注意#2:如果你想做 n-gram 而不是像上面那样“黑客”,你可以考虑使用 SparkML 包 Tokenizer
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol='sentence', outputCol="words")
tokenized = tokenizer.transform(df)
# Output
# +----------------------+----------------------------+
# |sentence |words |
# +----------------------+----------------------------+
# |I don't like to sing! |[i, don't, like, to, sing!] |
# |I like to go shopping!|[i, like, to, go, shopping!]|
# |I want to go home! |[i, want, to, go, home!] |
# |I like fast food. |[i, like, fast, food.] |
# |you don't want to! |[you, don't, want, to!] |
# |what does he want? |[what, does, he, want?] |
# +----------------------+----------------------------+
from pyspark.ml.feature import NGram
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramed = ngram.transform(tokenized)
# Output
# +----------------------+----------------------------+----------------------------------------+
# |col |words |ngrams |
# +----------------------+----------------------------+----------------------------------------+
# |I don't like to sing! |[i, don't, like, to, sing!] |[i don't, don't like, like to, to sing!]|
# |I like to go shopping!|[i, like, to, go, shopping!]|[i like, like to, to go, go shopping!] |
# |I want to go home! |[i, want, to, go, home!] |[i want, want to, to go, go home!] |
# |I like fast food. |[i, like, fast, food.] |[i like, like fast, fast food.] |
# |you don't want to! |[you, don't, want, to!] |[you don't, don't want, want to!] |
# |what does he want? |[what, does, he, want?] |[what does, does he, he want?] |
# +----------------------+----------------------------+----------------------------------------+
我不确定 pandas 或 pyspark 方法,但您可以使用函数
轻松完成此操作
from nltk.tokenize import sent_tokenize
t = """
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food.
you don't want to!
what does he want?
"""
sentences = lower(nltk.sent_tokenize(t))
for sentence in sentences:
if "!" in sentence and "like" in sentence:
print(f"found in {sentence}")
你应该能够弄清楚如何 count/put 它在 table...
我有一个 pyspark 数据框,其中有一列包含文本内容。
我正在计算包含感叹号“!”的句子数量连同“喜欢”和“想要”两个词。
例如:包含以下句子的行的列:
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food.
you don't want to!
what does he want?
我希望实现的期望输出是这样的(只计算包含“喜欢”或“想要”和“!”的句子):
+----+-----+
|word|count|
+----+-----+
|like| 2 |
|want| 2 |
+----+-----+
有人可以帮我写一个可以做到这一点的 UDF 吗?这是我到目前为止所写的内容,但我似乎无法让它发挥作用。
nltk.tokenize import sent_tokenize
def convert_a_sentence(a_string):
sentence = lower(nltk.sent_tokenize(a_string))
return sentence
df = df.withColumn('a_sentence', convert_a_sentence(df['text']))
df.select(explode('a_sentence').alias('found')).filter(df['a_sentence'].isin('like', 'want', '!').groupBy('found').count().collect()
如果你想要的只是 uni-gram(即 1 个标记),你可以将句子拆分 space,然后展开、分组、计数然后过滤你想要的
(df
.withColumn('words', F.split('sentence', ' '))
.withColumn('word', F.explode('words'))
.groupBy('word')
.agg(
F.count('*').alias('word_cnt')
)
.where(F.col('word').isin(['like', 'want']))
.show()
)
# Output
# +----+--------+
# |word|word_cnt|
# +----+--------+
# |want| 2|
# |like| 3|
# +----+--------+
注意 #1:您可以在 groupBy
之前应用过滤器,使用 contains 功能
注意#2:如果你想做 n-gram 而不是像上面那样“黑客”,你可以考虑使用 SparkML 包 Tokenizer
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol='sentence', outputCol="words")
tokenized = tokenizer.transform(df)
# Output
# +----------------------+----------------------------+
# |sentence |words |
# +----------------------+----------------------------+
# |I don't like to sing! |[i, don't, like, to, sing!] |
# |I like to go shopping!|[i, like, to, go, shopping!]|
# |I want to go home! |[i, want, to, go, home!] |
# |I like fast food. |[i, like, fast, food.] |
# |you don't want to! |[you, don't, want, to!] |
# |what does he want? |[what, does, he, want?] |
# +----------------------+----------------------------+
from pyspark.ml.feature import NGram
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramed = ngram.transform(tokenized)
# Output
# +----------------------+----------------------------+----------------------------------------+
# |col |words |ngrams |
# +----------------------+----------------------------+----------------------------------------+
# |I don't like to sing! |[i, don't, like, to, sing!] |[i don't, don't like, like to, to sing!]|
# |I like to go shopping!|[i, like, to, go, shopping!]|[i like, like to, to go, go shopping!] |
# |I want to go home! |[i, want, to, go, home!] |[i want, want to, to go, go home!] |
# |I like fast food. |[i, like, fast, food.] |[i like, like fast, fast food.] |
# |you don't want to! |[you, don't, want, to!] |[you don't, don't want, want to!] |
# |what does he want? |[what, does, he, want?] |[what does, does he, he want?] |
# +----------------------+----------------------------+----------------------------------------+
我不确定 pandas 或 pyspark 方法,但您可以使用函数
轻松完成此操作from nltk.tokenize import sent_tokenize
t = """
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food.
you don't want to!
what does he want?
"""
sentences = lower(nltk.sent_tokenize(t))
for sentence in sentences:
if "!" in sentence and "like" in sentence:
print(f"found in {sentence}")
你应该能够弄清楚如何 count/put 它在 table...