CountVectorizer token_pattern 不捕获下划线

Question

CountVectorizer 默认标记模式将下划线定义为字母

corpus = ['The rain in spain_stays' ]
vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w\w+\b')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

给出：

['in', 'rain', 'spain_stays', 'the']

这是有道理的，因为 AFAIK '/w' 与 [a-zA-z0-9_] 等效，我需要的是：

['in', 'rain', 'spain', 'stays', 'the']

所以我尝试用 [a-zA-Z0-9]

替换 '/w'

vectorizer = CountVectorizer(token_pattern=r'(?u)\b[a-zA-Z0-9][a-zA-Z0-9]+\b')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

但我明白了

['in', 'rain', 'the']

我怎样才能得到我需要的东西？欢迎任何想法

Answer 1

n_ 之间没有单词边界，因为 \w 也匹配下划线。

匹配2个或更多不带下划线的单词字符，且左右只允许空格边界或下划线：

(?<![^\s_])[^\W_]{2,}(?![^\s_])

模式匹配：

(?<![^\s_]) 否定向后看，在左侧断言空白边界或下划线
[^\W_]{2,} 匹配一个单词字符 2 次或更多次，不包括下划线
(?![^\s_]) 否定前瞻，断言空白边界或右边的下划线

看到一个regex demo。

非常广泛的匹配可以是 [^\W_]{2,}，但请注意，这不考虑边界。它只匹配没有下划线的单词字符。

查看此 regex demo 中不同数量的匹配项。

CountVectorizer token_pattern 不捕获下划线

CountVectorizer token_pattern to not catch underscore

python

regex

scikit-learn

countvectorizer