spacy- 为什么 nlp() 适用于单个字符串，而 nlp.pipe() 适用于字符串列表？

Question

我最近运行在使用 spacy 时出现了 st运行ge 行为，这是我处理字符串的时候，

如果字符串是单个字符串对象，我必须使用 nlp(string),

而我必须使用 nlp.pipe（列表）来表示由字符串元素组成的列表。

例子如下

string='this is a string to be process by nlp'

doc =['this','is','a','string','list','to','be','processed','by','spacy']

stringprocess= list(nlp(string))

listprocess = list(nlp.pipe(doc))

listprocess

stringprocess

这是为什么？我认为这一定与生成器的 nlp.pipe() 行为有关。

这是什么原因？

谢谢。

Answer 1

Spacy 这样做是因为生成器效率更高。由于生成器仅在使用一次时使用，因此它们比列表更节省内存。

根据他们的文档，它不是逐个处理文本并应用 nlp 管道，而是分批处理文本。

此外，您可以在 nlp.pipe 中配置批大小以根据您的系统优化性能

Process the texts as a stream using nlp.pipe and buffer them in batches, instead of one-by-one. This is usually much more efficient.

如果您的目标是使用 nlp.pipe 处理大量数据流，那么编写 streamer/generator 以根据需要从 database/filesystem 生成结果比加载更有效一切都在内存中，然后一一处理。

spacy pipe

spacy- 为什么 nlp() 适用于单个字符串，而 nlp.pipe() 适用于字符串列表？

spacy- why nlp() works for single string while nlp.pipe() works fine for a list of strings?

python

nlp

spacy