python 中 TfidfVectorizer 中 n-gram 的标记模式

Question

TfidfVectorizer 是否使用 python regular expressions 识别 n-gram？

这个问题是在阅读 scikit-learn TfidfVectorizer 的文档时出现的，我看到在单词级别识别 n-gram 的模式是 token_pattern=u'(?u)\b\w\w+\b'。我无法理解这是如何工作的。考虑二元语法的情况。如果我这样做：

    In [1]: import re
    In [2]: re.findall(u'(?u)\b\w\w+\b',u'this is a sentence! this is another one.')
    Out[2]: []

我没有找到任何二元字母。鉴于：

    In [2]: re.findall(u'(?u)\w+ \w*',u'this is a sentence! this is another one.')
    Out[2]: [u'this is', u'a sentence', u'this is', u'another one']

找到一些（但不是全部，例如 u'is a' 和所有其他偶数双字母组都丢失了）。我在解释 \b 字符函数时做错了什么？

注意：根据正则表达式模块文档，re 中的 \b 字符应该是：

\b Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.

我在 python 中看到了解决识别 n-gram 问题的问题（参见 1,2），所以第二个问题是：我应该这样做并在喂养我的之前添加连接的 n-gram文本到 TfidfVectorizer?

Answer 1

您应该在正则表达式前加上 r。以下作品：

>>> re.findall(r'(?u)\b\w\w+\b',u'this is a sentence! this is another one.')
[u'this', u'is', u'sentence', u'this', u'is', u'another', u'one']

这是 known bug in the documentation, but if you look at the source code 他们确实使用原始文字。

python 中 TfidfVectorizer 中 n-gram 的标记模式

Token pattern for n-gram in TfidfVectorizer in python

python

regex

n-gram

scikit-learn