计算指定单词的数量
Counting the number of specified words
我想统计'america'和'citizen'的个数 'inaugural' 文件以 1789 和 1793 开头的文件。
cfd = nltk.ConditionalFreqDist(
(target, file[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))
year = ['1789', '1793']
word = ['america', 'citizen']
cfd.tabulate(conditions=year, samples=word)
它没有正确计算单词。有什么问题?
注意:我想将 'america' 和 'citizen' 显示为列,将年份显示为行。
我的输出:
america citizen
1789 0 0
1793 0 0
这里是算法,可以用count函数;
print (mystring.count("specificword"))
一个演示;
mystring = "hey hey hi hello hey hello hi"
print (mystring.count("hey"))
>>>
3
>>>
剩下的,就看你了。像 table 一样显示它们基本上是用 print
函数操作它们。另一个演示;
mystring = "hey hey hi hello hey hello hi"
a = mystring.count("hey")
b = mystring.count("hi")
c = mystring.count("hello")
obj = """hey: {}
hi: {}
hello {}"""
print (obj.format(a,b,c))
输出;
>>>
hey: 3
hi: 2
hello 2
>>>
你可以使用nltk.sent_tokenize
创建一个单词列表,然后使用collections.Counter
来grub一个字典,单词是它的键,单词频率是值:
从集合中导入计数器
with open(file) as f:
C=Counter(nltk.sent_tokenize(f.lower()))
B = ['america', 'citizen']
for i in B:
print C[i]
你的条件和样本顺序相反,ConditionalFreqDist
构造函数取condition, sample
,你给它sample, condition
。尝试:
cfd = nltk.ConditionalFreqDist(
(fileid[:4], target)
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))
A = ['1789', '1793']
B = ['america', 'citizen']
cfd.tabulate(conditions=A, samples=B)
产出
america citizen
1789 2 5
1793 1 1
在一般情况下,您会希望使用词干分析器,从而得到如下内容:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
cfd = nltk.ConditionalFreqDist(
(fileid[:4], stemmer.stem(word))
for fileid in inaugural.fileids()
for word in inaugural.words(fileid))
A = ['2009', '2005']
B = [stemmer.stem(i) for i in ['freedom', 'war']]
cfd.tabulate(conditions=A, samples=B)
导致输出
freedom war
2009 3 2
2005 27 0
我想统计'america'和'citizen'的个数 'inaugural' 文件以 1789 和 1793 开头的文件。
cfd = nltk.ConditionalFreqDist(
(target, file[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))
year = ['1789', '1793']
word = ['america', 'citizen']
cfd.tabulate(conditions=year, samples=word)
它没有正确计算单词。有什么问题? 注意:我想将 'america' 和 'citizen' 显示为列,将年份显示为行。 我的输出:
america citizen
1789 0 0
1793 0 0
这里是算法,可以用count函数;
print (mystring.count("specificword"))
一个演示;
mystring = "hey hey hi hello hey hello hi"
print (mystring.count("hey"))
>>>
3
>>>
剩下的,就看你了。像 table 一样显示它们基本上是用 print
函数操作它们。另一个演示;
mystring = "hey hey hi hello hey hello hi"
a = mystring.count("hey")
b = mystring.count("hi")
c = mystring.count("hello")
obj = """hey: {}
hi: {}
hello {}"""
print (obj.format(a,b,c))
输出;
>>>
hey: 3
hi: 2
hello 2
>>>
你可以使用nltk.sent_tokenize
创建一个单词列表,然后使用collections.Counter
来grub一个字典,单词是它的键,单词频率是值:
从集合中导入计数器
with open(file) as f:
C=Counter(nltk.sent_tokenize(f.lower()))
B = ['america', 'citizen']
for i in B:
print C[i]
你的条件和样本顺序相反,ConditionalFreqDist
构造函数取condition, sample
,你给它sample, condition
。尝试:
cfd = nltk.ConditionalFreqDist(
(fileid[:4], target)
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))
A = ['1789', '1793']
B = ['america', 'citizen']
cfd.tabulate(conditions=A, samples=B)
产出
america citizen
1789 2 5
1793 1 1
在一般情况下,您会希望使用词干分析器,从而得到如下内容:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
cfd = nltk.ConditionalFreqDist(
(fileid[:4], stemmer.stem(word))
for fileid in inaugural.fileids()
for word in inaugural.words(fileid))
A = ['2009', '2005']
B = [stemmer.stem(i) for i in ['freedom', 'war']]
cfd.tabulate(conditions=A, samples=B)
导致输出
freedom war
2009 3 2
2005 27 0