CountVectorizer 返回零

Question

我有一个词汇文本文件，其中每一行都是一个单词。词汇表中的几个单词如下所示：

AccountsAndTransactions_/get/v2/accounts/details_DELETE
AccountsAndTransactions_/get/v2/accounts/details_GET
AccountsAndTransactions_/get/v2/accounts/details_POST
AccountsAndTransactions_/get/v2/accounts/{accountId}/transactions_DELETE
AccountsAndTransactions_/get/v2/accounts/{accountId}/transactions_GET
AccountsAndTransactions_/get/v2/accounts/{accountId}/transactions_POST

重要提示：AccountsAndTransactions_/get/v2/accounts/details_DELETE这是本题中的一个单词。

正在从文本文件中读取词汇：

with open(Path(VOCAB_FILE), "r") as f:
    vocab = f.read().splitlines()

正在生成 doc_paths:

doc_paths = [f for f in listdir(DOC_DIR) if isfile(join(DOC_DIR, f))]
r = re.compile(".*txt")
doc_paths = list(filter(r.match, doc_paths))
doc_paths = [Path(join(DOC_DIR, i)) for i in doc_paths]

我在运行 CountVectorizer 上文件。

tf_vectorizer = CountVectorizer(input='filename', lowercase=False, vocabulary=vocab)
tf = tf_vectorizer.fit_transform(doc_paths) # doc_paths is list of pathlib.Path(...) object.
X = tf.toarray() # returns zero matrix

问题是 X 中的所有值都是零。（语料库文档不为空。）

有人可以帮助我吗？我想要每个文档的词汇表中每个词的词频。

Answer 1

我通过覆盖 CountVectorizer 的默认值 analyzer 解决了这个问题：

def analyzer_custom(doc):
    return doc.split()

tf_vectorizer = CountVectorizer(input='filename',
                                lowercase=False,
                                vocabulary=vocab,
                                analyzer=analyzer_custom)

感谢@Chris 解释 CountVectorizer 的内部细节。

CountVectorizer 返回零

CountVectorizer returning zeros

python

python-3.x

scikit-learn

countvectorizer