为什么 scikit-learn 中的 TfidfVectorizer 显示此行为?
Why is TfidfVectorizer in scikit-learn showing this behavior?
在创建 TfidfVectorizer 对象时,如果我显式传递 token_pattern 参数的默认值,它会在我执行 fit_transform 时抛出错误。以下是错误:
ValueError: empty vocabulary; perhaps the documents only contain stop words
我这样做是因为最终我想为 token_pattern 参数传递一个不同的值,这样我也可以将单字母标记作为我的 tfidf 矩阵的一部分。
示例如下:
from sklearn.feature_extraction.text import TfidfVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
vectorizer1 = TfidfVectorizer(ngram_range=(1, 2), max_df=1.0, min_df=1)
train_set_tfidf = vectorizer1.fit_transform(train_set) #works fine
vectorizer2 = TfidfVectorizer(token_pattern=u'(?u)\b\w\w+\b', ngram_range=(1, 2), max_df=1.0, min_df=1)
train_set_tfidf = vectorizer2.fit_transform(train_set) #throws error
建议在正则表达式前添加 r
,这应该有效:
vectorizer2 = TfidfVectorizer(token_pattern=r'(?u)\b\w\w+\b', ngram_range=(1, 2), max_df=1.0, min_df=1)
train_set_tfidf = vectorizer2.fit_transform(train_set)
这是 known bug in the documentation, but if you look at the source code 他们确实使用原始文字。
在创建 TfidfVectorizer 对象时,如果我显式传递 token_pattern 参数的默认值,它会在我执行 fit_transform 时抛出错误。以下是错误:
ValueError: empty vocabulary; perhaps the documents only contain stop words
我这样做是因为最终我想为 token_pattern 参数传递一个不同的值,这样我也可以将单字母标记作为我的 tfidf 矩阵的一部分。
示例如下:
from sklearn.feature_extraction.text import TfidfVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
vectorizer1 = TfidfVectorizer(ngram_range=(1, 2), max_df=1.0, min_df=1)
train_set_tfidf = vectorizer1.fit_transform(train_set) #works fine
vectorizer2 = TfidfVectorizer(token_pattern=u'(?u)\b\w\w+\b', ngram_range=(1, 2), max_df=1.0, min_df=1)
train_set_tfidf = vectorizer2.fit_transform(train_set) #throws error
建议在正则表达式前添加 r
,这应该有效:
vectorizer2 = TfidfVectorizer(token_pattern=r'(?u)\b\w\w+\b', ngram_range=(1, 2), max_df=1.0, min_df=1)
train_set_tfidf = vectorizer2.fit_transform(train_set)
这是 known bug in the documentation, but if you look at the source code 他们确实使用原始文字。