如何避免 Gensim Simple Preprocess 删除数字？

Question

我在使用 gensim.utils.simple_preprocess 预处理某些数据时遇到了一些问题。简而言之，我注意到 simple_preprocess 函数从我的文本中删除了数字，但我不想这样做！例如，我有这个代码：

import gensim
from gensim.utils import simple_preprocess

my_text = ["I am doing activity number 1", "Instead, I am doing the number 2"]

def gen_words(texts):
    final = []
    for text in texts:
        new = gensim.utils.simple_preprocess(text, deacc=True, min_len=1)
        final.append(new)
    return (final)

solution = gen_words(my_text)

print (solution)

输出如下：

[['i', 'am', 'doing', 'activity', 'number'], ['instead', 'i', 'am', 'doing', 'the', 'number']]

我想用这个作为解决方案：

[['i', 'am', 'doing', 'activity', 'number', '1'], ['instead', 'i', 'am', 'doing', 'the', 'number', '2']]

如何避免看到从我的代码中删除的数字？我也试过设置 min_len=0 但还是不行。

Answer 1

simple_preprocess() 函数只是一个相当简单的便利选项，用于将字符串中的文本标记为标记列表。

它并没有针对任何特定需求进行特别好的调整——并且它没有可配置的选项来保留与其特定硬编码模式 (PAT_ALPHABETIC) 不匹配的标记，该模式排除了带有前导数字的标记。

许多项目将希望应用他们自己的 tokenization/preprocessing，以更适合他们的数据和问题领域。如果您需要有关如何开始的想法，您可以查阅 Gensim 使用的 simple_preprocess()（以及它所依赖的其他函数，如 tokenize() 和 simple_tokenize()）的实际源代码：

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/utils.py

如何避免 Gensim Simple Preprocess 删除数字？

How to avoid Gensim Simple Preprocess to remove digits?

python

gensim