计数没有。标记化、停用词删除和词干提取后的标记
Count no. of tokens after tokenization, stop words removal and stemming
我有以下功能:
def preprocessText (data):
stemmer = nltk.stem.porter.PorterStemmer()
preprocessed = []
for each in data:
tokens = nltk.word_tokenize(each.lower().translate(string.punctuation))
filtered = [word for word in tokens if word not in nltk.corpus.stopwords.words('english')]
preprocessed.append([stemmer.stem(item) for item in filtered])
print(Counter(tokens).most_common(10))
return (np.array(preprocessed))
应该使用 Porter Stemmer 删除标点符号、标记化、删除停用词和词干。但是,它不能正常工作。例如,当我 运行 这段代码时:
s = ["The cow and of.", "and of dog the."]
print (Counter(preprocessText(s)))
它产生这个输出:
[('and', 1), ('.', 1), ('dog', 1), ('the', 1), ('of', 1)]
不会删除标点符号或停用词。
您的翻译无法删除标点符号。这是一些工作代码。我做了一些更改,其中最重要的是:
代码:
xlate = {ord(x): y for x, y in
zip(string.punctuation, ' ' * len(string.punctuation))}
tokens = nltk.word_tokenize(each.lower().translate(xlate))
测试代码:
from collections import Counter
import nltk
import string
stopwords = set(nltk.corpus.stopwords.words('english'))
try:
# python 2
xlate = string.maketrans(
string.punctuation, ' ' * len(string.punctuation))
except AttributeError:
xlate = {ord(x): y for x, y in
zip(string.punctuation, ' ' * len(string.punctuation))}
def preprocessText(data):
stemmer = nltk.stem.porter.PorterStemmer()
preprocessed = []
for each in data:
tokens = nltk.word_tokenize(each.lower().translate(xlate))
filtered = [word for word in tokens if word not in stopwords]
preprocessed.append([stemmer.stem(item) for item in filtered])
return np.array(preprocessed)
s = ["The cow and of.", "and of dog the."]
print(Counter(sum([list(x) for x in preprocessText(s)], [])))
结果:
Counter({'dog': 1, 'cow': 1})
问题是您误用了 translate
。要正确使用它,您需要创建一个映射 table(正如帮助字符串会告诉您的那样)映射 "Unicode ordinals to Unicode ordinals, strings, or None."。例如,像这样:
>>> mapping = dict((ord(x), None) for x in string.punctuation) # `None` means "delete"
>>> print("This.and.that".translate(mapping))
'Thisandthat'
但是如果您对单词标记执行此操作,您将只是用空字符串替换标点符号。您可以添加一个步骤来摆脱它们,但我建议您只选择您想要的:即字母数字词。
tokens = nltk.word_tokenize(each.lower() if each.isalnum())
这就是您需要更改代码的全部内容。
我有以下功能:
def preprocessText (data):
stemmer = nltk.stem.porter.PorterStemmer()
preprocessed = []
for each in data:
tokens = nltk.word_tokenize(each.lower().translate(string.punctuation))
filtered = [word for word in tokens if word not in nltk.corpus.stopwords.words('english')]
preprocessed.append([stemmer.stem(item) for item in filtered])
print(Counter(tokens).most_common(10))
return (np.array(preprocessed))
应该使用 Porter Stemmer 删除标点符号、标记化、删除停用词和词干。但是,它不能正常工作。例如,当我 运行 这段代码时:
s = ["The cow and of.", "and of dog the."]
print (Counter(preprocessText(s)))
它产生这个输出:
[('and', 1), ('.', 1), ('dog', 1), ('the', 1), ('of', 1)]
不会删除标点符号或停用词。
您的翻译无法删除标点符号。这是一些工作代码。我做了一些更改,其中最重要的是:
代码:
xlate = {ord(x): y for x, y in
zip(string.punctuation, ' ' * len(string.punctuation))}
tokens = nltk.word_tokenize(each.lower().translate(xlate))
测试代码:
from collections import Counter
import nltk
import string
stopwords = set(nltk.corpus.stopwords.words('english'))
try:
# python 2
xlate = string.maketrans(
string.punctuation, ' ' * len(string.punctuation))
except AttributeError:
xlate = {ord(x): y for x, y in
zip(string.punctuation, ' ' * len(string.punctuation))}
def preprocessText(data):
stemmer = nltk.stem.porter.PorterStemmer()
preprocessed = []
for each in data:
tokens = nltk.word_tokenize(each.lower().translate(xlate))
filtered = [word for word in tokens if word not in stopwords]
preprocessed.append([stemmer.stem(item) for item in filtered])
return np.array(preprocessed)
s = ["The cow and of.", "and of dog the."]
print(Counter(sum([list(x) for x in preprocessText(s)], [])))
结果:
Counter({'dog': 1, 'cow': 1})
问题是您误用了 translate
。要正确使用它,您需要创建一个映射 table(正如帮助字符串会告诉您的那样)映射 "Unicode ordinals to Unicode ordinals, strings, or None."。例如,像这样:
>>> mapping = dict((ord(x), None) for x in string.punctuation) # `None` means "delete"
>>> print("This.and.that".translate(mapping))
'Thisandthat'
但是如果您对单词标记执行此操作,您将只是用空字符串替换标点符号。您可以添加一个步骤来摆脱它们,但我建议您只选择您想要的:即字母数字词。
tokens = nltk.word_tokenize(each.lower() if each.isalnum())
这就是您需要更改代码的全部内容。