使用 nltk 在 pandas 数据框中统计最 'two words combination' 流行的希伯来语单词
Count of most 'two words combination' popular Hebrew words in a pandas Dataframe with nltk
我有一个包含 'notes' 列的 csv 数据文件,其中包含希伯来语的满意答案。
我想找到最流行的词和最流行的“2 个词组合”,它们出现的次数并将它们绘制在条形图中。
到目前为止我的代码:
PYTHONIOENCODING="UTF-8"
df= pd.read_csv('keep.csv', encoding='utf-8' , usecols=['notes'])
words= df.notes.str.split(expand=True).stack().value_counts()
这会生成带有计数器的单词列表,但会考虑希伯来语中的所有停用词,并且不会生成“2 个单词组合”的频率。
我也试过这段代码,但它不是我要找的:
top_N = 30
txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
print('=' * 60)
我如何使用 nltk 来做到这一点?
计算所有值的双字母组的解决方案:
df = pd.DataFrame({'notes':['aa bb cc','cc cc aa aa']})
top_N = 3
txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
bigrm = list(nltk.bigrams(words))
print (bigrm)
[('aa', 'bb'), ('bb', 'cc'), ('cc', 'cc'), ('cc', 'cc'), ('cc', 'aa'), ('aa', 'aa')]
word_dist = nltk.FreqDist([' '.join(x) for x in bigrm])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
Word Frequency
0 cc cc 2
1 aa bb 1
2 bb cc 1
每个列拆分值的二元组解决方案:
df = pd.DataFrame({'notes':['aa bb cc','cc cc aa aa']})
top_N = 3
f = lambda x: list(nltk.bigrams(nltk.tokenize.word_tokenize(x)))
b = df.notes.str.lower().str.replace(r'\|', ' ').apply(f)
print (b)
word_dist = nltk.FreqDist([' '.join(y) for x in b for y in x])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
Word Frequency
0 aa bb 1
1 bb cc 1
2 cc cc 1
如果需要统计分词的二元组:
top_N = 3
f = lambda x: list(nltk.everygrams(nltk.tokenize.word_tokenize(x, 1, 2)))
b = df.notes.str.lower().str.replace(r'\|', ' ').apply(f)
print (b)
word_dist = nltk.FreqDist([' '.join(y) for x in b for y in x])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
和 DataFrame.plot.bar
的最后一个情节:
rslt.plot.bar(x='Word', y='Frequency')
除了 jezrael 发布的内容之外,我还想介绍另一种实现此目的的 hack。由于您正在尝试获得单个词以及两个词的频率,因此您还可以利用 everygram 函数。
给定一个数据框:
import pandas as pd
df = pd.DataFrame()
df['notes'] = ['this is sentence one', 'is sentence two this one', 'sentence one was good']
使用everygrams(word_tokenize(x), 1, 2)
得到一字和二字的形式,得到一、二、三字组合的组合,可以把2改成3,以此类推。所以在你的情况下应该是:
from nltk import everygrams, word_tokenize
x = df['notes'].apply(lambda x: [' '.join(ng) for ng in everygrams(word_tokenize(x), 1, 2)]).to_frame()
此时你应该看到:
notes
0 [this, is, sentence, one, this is, is sentence...
1 [is, sentence, two, this, one, is sentence, se...
2 [sentence, one, was, good, sentence one, one w...
您现在可以通过展平列表和 value_counts:
来获取计数
import numpy as np
flattenList = pd.Series(np.concatenate(x.notes))
freqDf = flattenList.value_counts().sort_index().rename_axis('notes').reset_index(name = 'frequency')
最终输出:
notes frequency
0 good 1
1 is 2
2 is sentence 2
3 one 3
4 one was 1
5 sentence 3
6 sentence one 2
7 sentence two 1
8 this 2
9 this is 1
10 this one 1
11 two 1
12 two this 1
13 was 1
14 was good 1
现在绘制图形很容易:
import matplotlib.pyplot as plt
plt.figure()
flattenList.value_counts().plot(kind = 'bar', title = 'Count of 1-word and 2-word frequencies')
plt.xlabel('Words')
plt.ylabel('Count')
plt.show()
输出:
我有一个包含 'notes' 列的 csv 数据文件,其中包含希伯来语的满意答案。
我想找到最流行的词和最流行的“2 个词组合”,它们出现的次数并将它们绘制在条形图中。
到目前为止我的代码:
PYTHONIOENCODING="UTF-8"
df= pd.read_csv('keep.csv', encoding='utf-8' , usecols=['notes'])
words= df.notes.str.split(expand=True).stack().value_counts()
这会生成带有计数器的单词列表,但会考虑希伯来语中的所有停用词,并且不会生成“2 个单词组合”的频率。 我也试过这段代码,但它不是我要找的:
top_N = 30
txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
print('=' * 60)
我如何使用 nltk 来做到这一点?
计算所有值的双字母组的解决方案:
df = pd.DataFrame({'notes':['aa bb cc','cc cc aa aa']})
top_N = 3
txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
bigrm = list(nltk.bigrams(words))
print (bigrm)
[('aa', 'bb'), ('bb', 'cc'), ('cc', 'cc'), ('cc', 'cc'), ('cc', 'aa'), ('aa', 'aa')]
word_dist = nltk.FreqDist([' '.join(x) for x in bigrm])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
Word Frequency
0 cc cc 2
1 aa bb 1
2 bb cc 1
每个列拆分值的二元组解决方案:
df = pd.DataFrame({'notes':['aa bb cc','cc cc aa aa']})
top_N = 3
f = lambda x: list(nltk.bigrams(nltk.tokenize.word_tokenize(x)))
b = df.notes.str.lower().str.replace(r'\|', ' ').apply(f)
print (b)
word_dist = nltk.FreqDist([' '.join(y) for x in b for y in x])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
Word Frequency
0 aa bb 1
1 bb cc 1
2 cc cc 1
如果需要统计分词的二元组:
top_N = 3
f = lambda x: list(nltk.everygrams(nltk.tokenize.word_tokenize(x, 1, 2)))
b = df.notes.str.lower().str.replace(r'\|', ' ').apply(f)
print (b)
word_dist = nltk.FreqDist([' '.join(y) for x in b for y in x])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
和 DataFrame.plot.bar
的最后一个情节:
rslt.plot.bar(x='Word', y='Frequency')
除了 jezrael 发布的内容之外,我还想介绍另一种实现此目的的 hack。由于您正在尝试获得单个词以及两个词的频率,因此您还可以利用 everygram 函数。
给定一个数据框:
import pandas as pd
df = pd.DataFrame()
df['notes'] = ['this is sentence one', 'is sentence two this one', 'sentence one was good']
使用everygrams(word_tokenize(x), 1, 2)
得到一字和二字的形式,得到一、二、三字组合的组合,可以把2改成3,以此类推。所以在你的情况下应该是:
from nltk import everygrams, word_tokenize
x = df['notes'].apply(lambda x: [' '.join(ng) for ng in everygrams(word_tokenize(x), 1, 2)]).to_frame()
此时你应该看到:
notes
0 [this, is, sentence, one, this is, is sentence...
1 [is, sentence, two, this, one, is sentence, se...
2 [sentence, one, was, good, sentence one, one w...
您现在可以通过展平列表和 value_counts:
来获取计数import numpy as np
flattenList = pd.Series(np.concatenate(x.notes))
freqDf = flattenList.value_counts().sort_index().rename_axis('notes').reset_index(name = 'frequency')
最终输出:
notes frequency
0 good 1
1 is 2
2 is sentence 2
3 one 3
4 one was 1
5 sentence 3
6 sentence one 2
7 sentence two 1
8 this 2
9 this is 1
10 this one 1
11 two 1
12 two this 1
13 was 1
14 was good 1
现在绘制图形很容易:
import matplotlib.pyplot as plt
plt.figure()
flattenList.value_counts().plot(kind = 'bar', title = 'Count of 1-word and 2-word frequencies')
plt.xlabel('Words')
plt.ylabel('Count')
plt.show()
输出: