使用 sklearn 和 Snowballstemmer 表示词袋
Bag of words representation using sklearn plus Snowballstemmer
我有一个歌曲列表,比如
list2 = ["first song", "second song", "third song"...]
这是我的代码:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
bagOfWords = vectorizer.fit(list2)
bagOfWords = vectorizer.transform(list2)
它正在运行,但我想列出我的单词列表。
我试过这样做
def tokeni(self,data):
return [SnowballStemmer("english").stem(word) for word in data.split()]
vectorizer = CountVectorizer(stop_words=stopwords.words('english'),
tokenizer=self.tokeni)
但是没用。我做错了什么?
更新:
使用 tokenizer 我有像 "oh...", "s-like..." , "knees," 这样的词
没有标记器时,我没有任何带点、逗号等的单词
您可以传递一个自定义 preprocessor
,它应该同样有效,但保留 tokenizer
:
的功能
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import SnowballStemmer
list2 = ["rain", "raining", "rainy", "rainful", "rains", "raining!", "rain?"]
def preprocessor(data):
return " ".join([SnowballStemmer("english").stem(word) for word in data.split()])
vectorizer = CountVectorizer(preprocessor=preprocessor).fit(list2)
print vectorizer.vocabulary_
# Should print this:
# {'raining': 2, 'raini': 1, 'rain': 0}
我有一个歌曲列表,比如
list2 = ["first song", "second song", "third song"...]
这是我的代码:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
bagOfWords = vectorizer.fit(list2)
bagOfWords = vectorizer.transform(list2)
它正在运行,但我想列出我的单词列表。
我试过这样做
def tokeni(self,data):
return [SnowballStemmer("english").stem(word) for word in data.split()]
vectorizer = CountVectorizer(stop_words=stopwords.words('english'),
tokenizer=self.tokeni)
但是没用。我做错了什么?
更新: 使用 tokenizer 我有像 "oh...", "s-like..." , "knees," 这样的词 没有标记器时,我没有任何带点、逗号等的单词
您可以传递一个自定义 preprocessor
,它应该同样有效,但保留 tokenizer
:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import SnowballStemmer
list2 = ["rain", "raining", "rainy", "rainful", "rains", "raining!", "rain?"]
def preprocessor(data):
return " ".join([SnowballStemmer("english").stem(word) for word in data.split()])
vectorizer = CountVectorizer(preprocessor=preprocessor).fit(list2)
print vectorizer.vocabulary_
# Should print this:
# {'raining': 2, 'raini': 1, 'rain': 0}