如何在 spaCy 文档中找到任意字符偏移后的第一个标记？

Question

我有一个 spaCy 文档和该文档中的任意字符偏移量 n？我如何找到该偏移量之后的第一个标记边界，即最小的 m ≥ n 使得 m 是标记的开始？

除了循环遍历所有标记之外，还有什么方法可以使用 spaCy 接口来做到这一点吗？

Answer 1

问题 1：令牌偏移量

How do I find the first token boundary following that offset...

sPacy 中的任何对象都有一个 .text 字段。所以 Tokens 和 Documents 可以与这个原始文本字段一起使用。

此外，sPacy 提供了两种获取令牌偏移量的方法。

i 代币列表中的索引
idx .text

因此，在您的示例中，我相信您只需要如下内容：

>>> n = 10
>>> doc = nlp("here is a document with tokens in it")
>>> for token in doc:
...     if token.idx > n:
...             m = token.idx
...             break
... 
>>> m
19
>>> doc.text[m]
'w'
>>> token.i
4
>>> token
with
>>>

问题二：不循环查找

Is there some way to do this ... other than looping ...

遗憾的是，我认为 Document 级别上没有任何其他接口允许通过字符偏移查找标记。

如何在 spaCy 文档中找到任意字符偏移后的第一个标记？

How do I find the first token after an arbitrary character offset in a spaCy document?

spacy

问题 1：令牌偏移量

问题二：不循环查找