如何从文档语料库/数据框列中的预先列出的 unigram 中获取 Bigram/Trigram 个单词
How to get Bigram/Trigram of word from prelisted unigram from a document corpus / dataframe column
我有一个数据框,其中一列包含文本。
我列出了一些预定义的关键词,我需要分析这些关键词和与之相关的词(稍后制作词云和出现次数计数器)以了解与这些关键词相关的主题/上下文。
用例:
df.text_column()
keywordlist = [coca , food, soft, aerated, soda]
假设文本列的其中一行有文本:' coca cola is expanding its business in soft drinks and aerated water'
.
另一个条目如:'lime soda is the best selling item in fast food stores'
我的 objective 是 Bigram/trigram 喜欢:
'coca_cola','coca_cola_expanding', 'soft_drinks', 'aerated_water', 'business_soft_drinks', 'lime_soda', 'food_stores'
请帮助我做到这一点 [Python 仅]
首先,您可以选择加载 nltk 的 停用词列表 并从文本中删除任何停用词(例如“is”、“its”、“in”和“和”)。或者,您可以定义自己的停用词列表,甚至可以使用其他词扩展 nltk 的列表。接下来,您可以使用 nltk.bigrams()
和 nltk.trigrams()
方法来获取用下划线 _
连接的双字母组和三字母组,如您所问。另外,看看 Collocations.
编辑:
如果您还没有,您需要在代码中包含一次以下内容,以便下载停用词列表。
nltk.download('stopwords')
代码:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
word_data = "coca cola is expanding its business in soft drinks and aerated water"
#word_data = "lime soda is the best selling item in fast food stores"
# load nltk's stop word list
stop_words = list(stopwords.words('english'))
# extend the stop words list
#stop_words.extend(["best", "selling", "item", "fast"])
# tokenise the string and remove stop words
word_tokens = word_tokenize(word_data)
clean_word_data = [w for w in word_tokens if not w.lower() in stop_words]
# get bigrams
bigrams_list = ["_".join(item) for item in nltk.bigrams(clean_word_data)]
print(bigrams_list)
# get trigrams
trigrams_list = ["_".join(item) for item in nltk.trigrams(clean_word_data)]
print(trigrams_list)
更新
获得二元组和三元组列表后,您可以根据关键字列表检查匹配项,以仅保留相关的。
keywordlist = ['coca' , 'food', 'soft', 'aerated', 'soda']
def find_matches(n_grams_list):
matches = []
for k in keywordlist:
matching_list = [s for s in n_grams_list if k in s]
[matches.append(m) for m in matching_list if m not in matches]
return matches
all_matching_bigrams = find_matches(bigrams_list) # find all mathcing bigrams
all_matching_trigrams = find_matches(trigrams_list) # find all mathcing trigrams
# join the two lists
all_matches = all_matching_bigrams + all_matching_trigrams
print(all_matches)
输出:
['coca_cola', 'business_soft', 'soft_drinks', 'drinks_aerated', 'aerated_water', 'coca_cola_expanding', 'expanding_business_soft', 'business_soft_drinks', 'soft_drinks_aerated', 'drinks_aerated_water']
我有一个数据框,其中一列包含文本。
我列出了一些预定义的关键词,我需要分析这些关键词和与之相关的词(稍后制作词云和出现次数计数器)以了解与这些关键词相关的主题/上下文。
用例:
df.text_column()
keywordlist = [coca , food, soft, aerated, soda]
假设文本列的其中一行有文本:' coca cola is expanding its business in soft drinks and aerated water'
.
另一个条目如:'lime soda is the best selling item in fast food stores'
我的 objective 是 Bigram/trigram 喜欢:
'coca_cola','coca_cola_expanding', 'soft_drinks', 'aerated_water', 'business_soft_drinks', 'lime_soda', 'food_stores'
请帮助我做到这一点 [Python 仅]
首先,您可以选择加载 nltk 的 停用词列表 并从文本中删除任何停用词(例如“is”、“its”、“in”和“和”)。或者,您可以定义自己的停用词列表,甚至可以使用其他词扩展 nltk 的列表。接下来,您可以使用 nltk.bigrams()
和 nltk.trigrams()
方法来获取用下划线 _
连接的双字母组和三字母组,如您所问。另外,看看 Collocations.
编辑: 如果您还没有,您需要在代码中包含一次以下内容,以便下载停用词列表。
nltk.download('stopwords')
代码:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
word_data = "coca cola is expanding its business in soft drinks and aerated water"
#word_data = "lime soda is the best selling item in fast food stores"
# load nltk's stop word list
stop_words = list(stopwords.words('english'))
# extend the stop words list
#stop_words.extend(["best", "selling", "item", "fast"])
# tokenise the string and remove stop words
word_tokens = word_tokenize(word_data)
clean_word_data = [w for w in word_tokens if not w.lower() in stop_words]
# get bigrams
bigrams_list = ["_".join(item) for item in nltk.bigrams(clean_word_data)]
print(bigrams_list)
# get trigrams
trigrams_list = ["_".join(item) for item in nltk.trigrams(clean_word_data)]
print(trigrams_list)
更新
获得二元组和三元组列表后,您可以根据关键字列表检查匹配项,以仅保留相关的。
keywordlist = ['coca' , 'food', 'soft', 'aerated', 'soda']
def find_matches(n_grams_list):
matches = []
for k in keywordlist:
matching_list = [s for s in n_grams_list if k in s]
[matches.append(m) for m in matching_list if m not in matches]
return matches
all_matching_bigrams = find_matches(bigrams_list) # find all mathcing bigrams
all_matching_trigrams = find_matches(trigrams_list) # find all mathcing trigrams
# join the two lists
all_matches = all_matching_bigrams + all_matching_trigrams
print(all_matches)
输出:
['coca_cola', 'business_soft', 'soft_drinks', 'drinks_aerated', 'aerated_water', 'coca_cola_expanding', 'expanding_business_soft', 'business_soft_drinks', 'soft_drinks_aerated', 'drinks_aerated_water']