NLP 清洁函数的向量化形式

Question

我做了以下功能来清理我的数据集的文本注释：

import spacy
nlp = spacy.load("en")
def clean(text):
    """
    Text preprocessing for english text
    """
    # Apply spacy to the text
    doc=nlp(text)
    # Lemmatization, remotion of noise (stopwords, digit, puntuaction and singol characters)
    tokens=[token.lemma_.strip() for token in doc if 
            not token.is_stop and not nlp.vocab[token.lemma_].is_stop # Remotion StopWords
            and not token.is_punct # Remove puntuaction
            and not token.is_digit # Remove digit
           ]
    # Recreation of the text
    text=" ".join(tokens)

    return text.lower()

问题是当我想清理我所有的数据集文本时，这需要花很多时间。（我的数据集是 70k 行，每行 100 到 5000 个单词）

我尝试在多线程上使用 swifter 到运行 apply 方法：data.note_line_comment.swifter.apply(clean)

但它并没有真正变得更好，因为它花了将近一个小时。

我想知道是否有任何方法可以制作我的函数的矢量化形式，或者是否有其他方法可以加快该过程。有什么想法吗？

Answer 1

简答

这类问题本来就需要时间。

长答案

使用正则表达式
更改 spacy 管道

做出决定所需的字符串信息越多，所需的时间就越长。

好消息是，如果您对文本的清理相对简单，一些正则表达式就可以解决问题。

否则你正在使用 spacy 管道来帮助删除文本位，这是昂贵的，因为它默认情况下会做很多事情：

代币化
词形还原
依赖解析
内尔
分块

或者，您可以再次尝试您的任务并关闭您不想要的 spacy 管道的某些方面，这可能会大大加快它的速度。

例如，可能关闭命名实体识别、标记和依赖解析...

nlp = spacy.load("en", disable=["parser", "tagger", "ner"])

然后再试，速度会加快。

NLP 清洁函数的向量化形式

Vectorized form of cleaning function for NLP

python

multithreading

pandas

spacy

swifter