与正则表达式相关的 CountVectorizer 预处理

Question

我正在使用 CountVectorizer/logistic 回归进行文本处理，并比较无预处理与预处理的 f1 分数。我想使用正则表达式进行预处理，所以我构建了如下代码

def better_preprocessor(s):
    lower = s.lower()
    lower = re.sub(r'^\w{8,}$', lambda x:x[:7], lower)
    return lower

def a():
    cv = CountVectorizer()
    train = cv.fit_transform(train_data)
    features = cv.get_feature_names()
    cv_dev = CountVectorizer(vocabulary = features)
    dev = cv_dev.fit_transform(dev_data)
    print(features)

    lgr = LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")
    lgr.fit(train, train_labels)
    lgr_pred = lgr.predict(dev)
    score = metrics.f1_score(dev_labels, lgr_pred, average="weighted")
    print('No preprocessing score:', score)

    cv_im = CountVectorizer(preprocessor=better_preprocessor)
    train_im = cv_im.fit_transform(train_data)
    features_im = cv_im.get_feature_names()
    cv_im_dev = CountVectorizer(preprocessor=better_preprocessor, vocabulary = features_im)
    dev_im = cv_im_dev.fit_transform(dev_data)

    lgr.fit(train_im, train_labels)
    lgr_pred_im = lgr.predict(dev_im)
    score_im = metrics.f1_score(dev_labels, lgr_pred_im, average="weighted")
    print('Preprocessing score', score_im)
    print(len(features)-len(features_im))
    print(features_im)

a()

我试图将大于或等于8的单词长度截断为7，但是当我使用get_feature_names检查词汇表时，没有任何变化。我不知道应该在哪里解决这个问题。

Answer 1

您不需要为此使用任何正则表达式。使用

def better_preprocessor(s):
    if len(s) >= 8:
        return s.lower()[:7]
    else:
        return s.lower()

re.sub(r'^\w{8,}$', lambda x:x[:7], lower) 代码采用 lower 字符串并尝试匹配 ^\w{8,}$:

^ - 字符串开头
\w{8,} - 八个或更多字字符
$ - 字符串结尾。

然后 lambda x:x[:7] 尝试进行匹配（其中 x 是匹配数据对象）并且您尝试对匹配数据对象进行切片。可能你打算使用 x.group()[:7]，但它在这里仍然是一个大材小用。

如果您打算从字符串中提取所有单词并截断它们，则需要指定适合您的单词并使用

def better_preprocessor(s):
    return re.sub(r'\b(\w{7})\w+', r'', s.lower())

见regex demo

\b - 单词边界
(\w{7}) - 第 1 组（从替换模式中引用 </code>）：七个字字符 </li> <li><code>\w+ - 1+ 个单词字符

Answer 2

这是一种使用 analyzer 参数的方法：

from sklearn.feature_extraction.text import CountVectorizer

def better_preprocessor(s):
    lower = s.lower().split()
    lower = [x[:7] for x in lower]
    for l in lower:
        yield l

lower = ["hello how are you doing today crocodilesaway","hello how are you"]

cv = CountVectorizer(analyzer=better_preprocessor, )
cv.fit_transform(lower)
cv.get_feature_names()

['are', 'crocodi', 'doing', 'hello', 'how', 'today', 'you']

与正则表达式相关的 CountVectorizer 预处理

CountVectorizer preprocessing related to regex

python

regex

scikit-learn

logistic-regression

countvectorizer