如何使用 CountVectorizer 提取 TF?
How to extract TF using CountVectorizer?
如何获取 sklearn.feature_extraction.text.CountVectorizer
创建的词汇表中每个术语的词频 (TF) 并将它们放入列表或字典中?
词汇表中key对应的值似乎都是小于我在初始化CountVectorizer时手动设置的max_features的int数,而不是TF——应该是float数。谁能帮帮我?
CV=CountVectorizer(ngram_range(ngram_min_file_opcode,ngram_max_file_opcode),
decode_error="ignore", max_features=max_features_file_re,
token_pattern=r'\b\w+\b', min_df=1, max_df=1.0)
x = CV.fit_transform(x).toarray()
如果您需要浮点值,您可能正在寻找 TFIDF. In that case, use either sklearn.feature_extraction.text.TfidfVectorizer or sklearn.feature_extraction.text.CountVectorizer followed by sklearn.feature_extraction.text.TfidfTransformer、
如果你真的只是想要TF,你仍然可以使用TfidfVectorizer or CountVectorizer followed by TfidfTransformer,只需确保将TfidfVectorizer
/Transformer
的use_idf
参数设置为False
和 norm
(规范化)参数到 'l1'
或 'l2'
。这使 TF 计数正常化。
来自 SKLearn 文档:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
第[0 1 1 1 0 0 1 0 1]
行对应第一个文档。第一个元素对应文档中出现了多少次and
,第二个document
,第三个first
,等等
如何获取 sklearn.feature_extraction.text.CountVectorizer
创建的词汇表中每个术语的词频 (TF) 并将它们放入列表或字典中?
词汇表中key对应的值似乎都是小于我在初始化CountVectorizer时手动设置的max_features的int数,而不是TF——应该是float数。谁能帮帮我?
CV=CountVectorizer(ngram_range(ngram_min_file_opcode,ngram_max_file_opcode),
decode_error="ignore", max_features=max_features_file_re,
token_pattern=r'\b\w+\b', min_df=1, max_df=1.0)
x = CV.fit_transform(x).toarray()
如果您需要浮点值,您可能正在寻找 TFIDF. In that case, use either sklearn.feature_extraction.text.TfidfVectorizer or sklearn.feature_extraction.text.CountVectorizer followed by sklearn.feature_extraction.text.TfidfTransformer、
如果你真的只是想要TF,你仍然可以使用TfidfVectorizer or CountVectorizer followed by TfidfTransformer,只需确保将TfidfVectorizer
/Transformer
的use_idf
参数设置为False
和 norm
(规范化)参数到 'l1'
或 'l2'
。这使 TF 计数正常化。
来自 SKLearn 文档:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
第[0 1 1 1 0 0 1 0 1]
行对应第一个文档。第一个元素对应文档中出现了多少次and
,第二个document
,第三个first
,等等