PYTHON：如何将带有关键字参数的分词器传递给 scikit 的 CountVectorizer？

Question

我有一个带有一些关键字参数的自定义分词器函数：

def tokenizer(text, stem=True, lemmatize=False, char_lower_limit=2, char_upper_limit=30):
    do things...
    return tokens

现在，我怎样才能将这个分词器及其所有参数传递给 CountVectorizer？我尝试过的都没有用；这也不起作用：

from sklearn.feature_extraction.text import CountVectorizer
args = {"stem": False, "lemmatize": True}
count_vect = CountVectorizer(tokenizer=tokenizer(**args), stop_words='english', strip_accents='ascii', min_df=0, max_df=1., vocabulary=None)

非常感谢任何帮助。提前致谢。

Answer 1

tokenizer 应该是可调用的或 None。

（tokenizer=tokenize(**args) 打错了吗？你上面的函数名是 tokenizer。）

你可以试试这个：

count_vect = CountVectorizer(tokenizer=lambda text: tokenizer(text, **args), stop_words='english', strip_accents='ascii', min_df=0, max_df=1., vocabulary=None)

PYTHON：如何将带有关键字参数的分词器传递给 scikit 的 CountVectorizer？

PYTHON: How to pass tokenizer with keyword arguments to scikit's CountVectorizer?

python

tokenize

feature-extraction

keyword-argument

scikit-learn