TfidfVectorizer 使用我自己的停用词字典
TfidfVectorizer using my own stopwords dictionary
我想问一下我是否可以使用我自己的停用词词典来代替 TfidfVectorizer
中已有的词典。我建立了一个更大的停用词词典,我更愿意使用它。但是我很难将它包含在下面的代码中(尽管显示了标准代码)。
def preprocessing(line):
line = line.lower()
line = re.sub(r"[{}]".format(string.punctuation), " ", line)
return line
tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing,stop_words_='english')
tfidf = tfidf_vectorizer.fit_transform(df["0"]['Words']) # multiple dataframes
kmeans = KMeans(n_clusters=2).fit(tfidf)
但我收到以下错误:
TypeError: __init__() got an unexpected keyword argument 'stop_words_'
假设我的字典是:
stopwords["a","an", ... "been", "had",...]
我怎样才能包含它?
如有任何帮助,我们将不胜感激。
TfidfVectorizer 没有参数 'stop_words_'。
如果您有如下自定义 stop_words 列表:
smart_stoplist = ['a', 'an', 'the']
这样使用:
tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing,stop_words=smart_stoplist)
对于您要执行的操作,这是一种更好的方法:请注意 TfidfVectorizer 有一个 Tokenizer 方法,它接受经过清理的单词数组。
我想这也许对你有用!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
nltk.download(['stopwords'])
# here you can add to stopword_list any other word that you want or define your own array_like stopwords_list
stop_words = stopwords.words('english')
def preprocessing(line):
line = re.sub(r"[^a-zA-Z]", " ", line.lower())
words = word_tokenize(line)
words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
return words_lemmed
tfidf_vectorizer = TfidfVectorizer(tokenizer=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(df['Texts'])
kmeans = KMeans(n_clusters=2).fit(tfidf)
我想问一下我是否可以使用我自己的停用词词典来代替 TfidfVectorizer
中已有的词典。我建立了一个更大的停用词词典,我更愿意使用它。但是我很难将它包含在下面的代码中(尽管显示了标准代码)。
def preprocessing(line):
line = line.lower()
line = re.sub(r"[{}]".format(string.punctuation), " ", line)
return line
tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing,stop_words_='english')
tfidf = tfidf_vectorizer.fit_transform(df["0"]['Words']) # multiple dataframes
kmeans = KMeans(n_clusters=2).fit(tfidf)
但我收到以下错误:
TypeError: __init__() got an unexpected keyword argument 'stop_words_'
假设我的字典是:
stopwords["a","an", ... "been", "had",...]
我怎样才能包含它?
如有任何帮助,我们将不胜感激。
TfidfVectorizer 没有参数 'stop_words_'。
如果您有如下自定义 stop_words 列表:
smart_stoplist = ['a', 'an', 'the']
这样使用:
tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing,stop_words=smart_stoplist)
对于您要执行的操作,这是一种更好的方法:请注意 TfidfVectorizer 有一个 Tokenizer 方法,它接受经过清理的单词数组。 我想这也许对你有用!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
nltk.download(['stopwords'])
# here you can add to stopword_list any other word that you want or define your own array_like stopwords_list
stop_words = stopwords.words('english')
def preprocessing(line):
line = re.sub(r"[^a-zA-Z]", " ", line.lower())
words = word_tokenize(line)
words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
return words_lemmed
tfidf_vectorizer = TfidfVectorizer(tokenizer=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(df['Texts'])
kmeans = KMeans(n_clusters=2).fit(tfidf)