运行 GPU 中的代码或以某种方式并行化

Question

我正在运行一个 NLP 程序，在运行主要算法之前我会在其中进行文本预处理。预处理很简单：我有一个非常长的字符串数组（每个字符串大约 20K 个单词，总共 30K 个字符串）。我想用 nltk.stem.porter.PorterStemmer:

标记每个字符串

from nltk.stem.porter import PorterStemmer
from nltk import word_tokenize
import pandas as pd

    def tokenize_item(item):
        tokens = word_tokenize(item)
        stems = []
        for token in tokens:
            stems.append(PorterStemmer().stem(token))
        return stems

    def tokenize_text(text):
        return [' '.join(tokenize_item(txt.lower())) for txt in text]

text = pd.read_csv('texts.csv')['input_texts'].to_numpy()
tokenized_text = tokenize_text(text)

我想高效地并行化这个过程，或者（最好）运行在 GPU 上。有谁知道我该怎么做（或同时做这两项）？谢谢

Answer 1

在关注速度的地方，SpaCy is often preferable over NLTK. It offers both batch processing as well as GPU integration。

对于可迭代的字符串，这是执行批处理的基本过程（请注意，有很多选项可以调整，例如禁用您不需要的管道的某些部分和设置批量大小，所有这些都在 SpaCy 文档中有详细说明。

import spacy

nlp = spacy.load("en_core_web_trf") # Or another model

docs = nlp.pipe(text) # with text being an iterable of strings

您最终会得到一个 doc 对象列表，其中包含许多有用的方法。看起来您想要一个只有小写形式的引理的字符串，您可以这样做：

def get_lemmas(doc):
    return ' '.join(tok.lemma_.lower() for tok in doc)

lemma_docs = (get_lemmas(doc) for doc in docs)
# lemma_docs = list(get_lemmas(doc) for doc in docs) # if you need all texts at once

运行 GPU 中的代码或以某种方式并行化

Run a code in the GPU or parallelize somehow

python

parallel-processing

multithreading

gpu