用于提取 ngram 的 TF-IDF 矢量器
TF-IDF vectorizer to extract ngrams
如何使用 scikit-learn 库中的 TF-IDF vectorizer
提取 unigrams
和 bigrams
推文?我想用输出训练一个分类器。
这是来自 scikit-learn 的代码:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
TfidfVectorizer
有一个 ngram_range
参数来确定你想要在最终矩阵中作为新特征的 n-grams 的范围。在您的情况下,您希望 (1,2)
从双字母组合到双字母组合:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus).todense()
pd.DataFrame(X, columns=vectorizer.get_feature_names())
and and this document document is first first document \
0 0.000000 0.000000 0.314532 0.000000 0.388510 0.388510
1 0.000000 0.000000 0.455513 0.356824 0.000000 0.000000
2 0.357007 0.357007 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.282940 0.000000 0.349487 0.349487
is is the is this one ... the the first \
0 0.257151 0.314532 0.000000 0.000000 ... 0.257151 0.388510
1 0.186206 0.227756 0.000000 0.000000 ... 0.186206 0.000000
2 0.186301 0.227873 0.000000 0.357007 ... 0.186301 0.000000
3 0.231322 0.000000 0.443279 0.000000 ... 0.231322 0.349487
...
你在初始化TfidfVectorizer时指定了n-grams,TfidfVectorizer(ngram_range(min_n, max_n))
提取不同n-grams的范围n-values的上下边界
ngram_range
of (1, 1)
表示只有unigrams
,(1, 2)
表示unigrams
和bigrams
,(2, 2)
表示只有bigrams
.
答案是
vectorizer = TfidfVectorizer(ngram_range=(1,2))
如何使用 scikit-learn 库中的 TF-IDF vectorizer
提取 unigrams
和 bigrams
推文?我想用输出训练一个分类器。
这是来自 scikit-learn 的代码:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
TfidfVectorizer
有一个 ngram_range
参数来确定你想要在最终矩阵中作为新特征的 n-grams 的范围。在您的情况下,您希望 (1,2)
从双字母组合到双字母组合:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus).todense()
pd.DataFrame(X, columns=vectorizer.get_feature_names())
and and this document document is first first document \
0 0.000000 0.000000 0.314532 0.000000 0.388510 0.388510
1 0.000000 0.000000 0.455513 0.356824 0.000000 0.000000
2 0.357007 0.357007 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.282940 0.000000 0.349487 0.349487
is is the is this one ... the the first \
0 0.257151 0.314532 0.000000 0.000000 ... 0.257151 0.388510
1 0.186206 0.227756 0.000000 0.000000 ... 0.186206 0.000000
2 0.186301 0.227873 0.000000 0.357007 ... 0.186301 0.000000
3 0.231322 0.000000 0.443279 0.000000 ... 0.231322 0.349487
...
你在初始化TfidfVectorizer时指定了n-grams,TfidfVectorizer(ngram_range(min_n, max_n))
提取不同n-grams的范围n-values的上下边界
ngram_range
of (1, 1)
表示只有unigrams
,(1, 2)
表示unigrams
和bigrams
,(2, 2)
表示只有bigrams
.
答案是
vectorizer = TfidfVectorizer(ngram_range=(1,2))