当我手动构建文档时，Spacy 标记化为带有连字符分隔符的日期添加额外的白色 space

Question

一段时间以来，我一直在尝试解决 spacy Tokenizer 的问题，但没有成功。另外，我不确定是分词器还是管道的其他部分有问题。

欢迎任何帮助！

描述

我有一个应用程序，出于除此之外的原因，它从 spacy 词汇和字符串中的标记列表创建了一个 spacy Doc（请参见下面的代码）。请注意，虽然这不是执行此操作的最简单和最常用的方法，但根据 spacy doc 这是可以做到的。

但是，当我为包含以连字符作为分隔符的复合词或日期的文本创建 Doc 时，我得到的行为不是我所期望的。

import spacy
from spacy.language import Doc

# My current way
doc = Doc(nlp.vocab, words=tokens)  # Tokens is a well defined list of tokens for a certein string

# Standard way
doc = nlp("My text...")

例如，对于以下文本，如果我使用标准程序创建 Doc，spacy Tokenizer 会将 "-" 识别为标记，但 Doc text 与输入文本相同，此外 spacy NER 模型正确识别 DATE 实体。

import spacy

doc = nlp("What time will sunset be on 2022-12-24?")
print(doc.text)

tokens = [str(token) for token in doc]
print(tokens)

# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)

输出：

What time will sunset be on 2022-12-24?
['What', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']

DATE
2022-12-24

另一方面，如果我从模型的 vocab 和之前计算的标记创建 Doc，得到的结果是不同的。请注意，为了简单起见，我使用了 doc 中的标记，因此我确信标记没有差异。另请注意，我手动运行每个管道模型以 doc 的正确顺序排列，因此在该过程结束时，理论上我会得到相同的结果。

但是，正如您在下面的输出中看到的，虽然 Doc 的标记相同，但 Doc 的文本不同，数字和日期分隔符之间有空白 space。

doc2 = Doc(nlp.vocab, words=tokens)

# Run each model in pipeline
for model_name in nlp.pipe_names:
    pipe = nlp.get_pipe(model_name)
    doc2 = pipe(doc2)

# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)

# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)

输出：

what time will sunset be on 2022 - 12 - 24 ? 
['what', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']

DATE
2022 - 12 - 24

我知道我错过的一定是愚蠢的东西，但我没有意识到。

有人可以向我解释我做错了什么并指出正确的方向吗？

非常感谢！

编辑

按照 Talha Tayyab 的建议，我必须创建一个布尔值数组，其长度与我的标记列表中的每个标记的长度相同，如果标记后跟一个空 space。然后在 doc 构造中传递这个数组，如下所示：doc = Doc(nlp.vocab, words=words, spaces=spaces).

为了根据我的原始文本字符串和标记列表计算这个布尔值列表，我实现了以下普通函数：

def get_spaces(self, text: str, tokens: List[str]) -> List[bool]:
     
    # Spaces
    spaces = []
    # Copy text to easy operate
    t = text.lower()

    # Iterate over tokens
    for token in tokens:

        if t.startswith(token.lower()):

            t = t[len(token):]  # Remove token

            # If after removing token we have an empty space
            if len(t) > 0 and t[0] == " ":
                spaces.append(True)
                t = t[1:]  # Remove space
            else:
                spaces.append(False)

    return spaces

通过我的代码中的这两个改进，获得的结果符合预期。但是，现在我有以下问题：

是否有更类似 spacy 的方法来计算白色space，而不是使用我的普通实现？

Answer 1

请试试这个：

from spacy.language import Doc
doc2 = Doc(nlp.vocab, words=tokens,spaces=[1,1,1,1,1,1,0,0,0,0,0,0])
# Run each model in pipeline
for model_name in nlp.pipe_names:
    pipe = nlp.get_pipe(model_name)
    doc2 = pipe(doc2)

# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)

# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)

# You can also replace 0 with False and 1 with True

这是完整的语法：

doc = Doc(nlp.vocab, words=words, spaces=spaces)

space是一个布尔值列表，表示每个单词是否有后续的space。如果指定，必须与单词具有相同的长度。默认为 True 序列。

因此您可以选择要拥有哪些 space 以及不需要哪些。

参考：https://spacy.io/api/doc

当我手动构建文档时，Spacy 标记化为带有连字符分隔符的日期添加额外的白色 space

Spacy tokenization add extra white space for dates with hyphen separator when I manually build the Doc

python

nlp

tokenize

spacy-3