过滤停用词
Filtering out stopwords
我创建了一个简单的字数统计程序,我正在尝试使用 nltk(见下文)从我的列表中过滤掉常用字词。
我的问题是如何将 "stop" 过滤器应用于 "frequency" 列表?
#Start
from nltk.corpus import stopwords
import re
import string
frequency = {}
document_text = open('Import.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
count = frequency.get(word,0)
frequency[word] = count + 1
frequency = {k:v for k,v in frequency.items() if v>1}
stop = set(stopwords.words('english'))
stop = list(stop)
stop.append(".")
import csv
with open('Export.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
for key, value in frequency.items():
writer.writerow([key, value])
stop = set(stopwords.words('english'))
stop.(".")
frequency = {k:v for k,v in frequency.items() if v>1 and k not in stop}
虽然 stop
仍然是 set
,但在进行理解时检查 frequency
字典的键。之后您仍然可以再次制作停止列表。
我将其作为集合保留的原因是搜索集合比搜索列表更有效。
我创建了一个简单的字数统计程序,我正在尝试使用 nltk(见下文)从我的列表中过滤掉常用字词。
我的问题是如何将 "stop" 过滤器应用于 "frequency" 列表?
#Start
from nltk.corpus import stopwords
import re
import string
frequency = {}
document_text = open('Import.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
count = frequency.get(word,0)
frequency[word] = count + 1
frequency = {k:v for k,v in frequency.items() if v>1}
stop = set(stopwords.words('english'))
stop = list(stop)
stop.append(".")
import csv
with open('Export.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
for key, value in frequency.items():
writer.writerow([key, value])
stop = set(stopwords.words('english'))
stop.(".")
frequency = {k:v for k,v in frequency.items() if v>1 and k not in stop}
虽然 stop
仍然是 set
,但在进行理解时检查 frequency
字典的键。之后您仍然可以再次制作停止列表。
我将其作为集合保留的原因是搜索集合比搜索列表更有效。