Tweek 空间跨度

Question

我正在使用 spacy un 一些 nlp 项目。

我的文本中出现了这样的文本：

 text='The car comprises 4 brakes 4.1, 4.2, 4.3 and 4.4 in fig. 5, all include an ESP system. This is shown in Fig. 6. Fig. 5 shows how the motors 56 and 57 are blocked. Besides the doors (44, 45) are painted blue.'

我想将“4.1、4.2、4.3 和 4.4”视为一个实体。为了提取前置名词短语。

spacy 经常将那个块分成不同的标记。

假设我有这些跨度的正则表达式。

定义跨度的方法是什么？

到目前为止的代码：

nlp = spacy.load('/home/jovyan/shared/public/spacy/en_core_web_sm-3.2.0')

text='The car comprises 4 brakes 4.1, 4.2, 4.3 and 4.4 in fig. 5, all include an ESP system. This is shown in Fig. 6. Fig. 5 shows how the motors 56 and 57 are blocked. Besides the doors (44, 45) are painted blue.'

doc = nlp(text)
print([token.text for token in doc])

我如何根据正则表达式定义跨度？

Answer 1

Spacy doc中有一个chapter专门用于根据规则进行匹配。您可以使用 Spacy 来匹配基于“类似正则表达式”规则的跨度，您还可以扩展管道以包含您的规则，例如使用您的规则识别具有名称的实体。

来自文档：

Compared to using regular expressions on raw text, spaCy’s rule-based matcher engines and components not only let you find the words and phrases you’re looking for – they also give you access to the tokens within the document and their relationships. This means you can easily access and analyze the surrounding tokens, merge spans into single tokens or add entries to the named entities in doc.ents

正如您在以下取自文档的示例中所见，使用 spacy 的 Matcher class 定义规则并迭代结果非常容易。

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])

doc = nlp("Hello, world! Hello world!")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

另一方面，如果您想扩展 spacy 管道并根据常规 expression-like 规则识别命名实体，您也可以使用 EntityRuler class。

我修改了您的代码，或多或少地向您展示了它的样子。当然，您必须稍微研究一下规则才能准确识别您感兴趣的格式的数字。

如您现在所见，我没有遍历文本标记，而是遍历管道识别的实体列表，并仅保留名称为 2_DIGIT 的实体，这是感兴趣的实体我.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_lg')
text='The car comprises 4 brakes 4.1, 4.2, 4.3 and 4.4 in fig. 5, all include an ESP system. This is shown in Fig. 6. Fig. 5 shows how the motors 56 and 57 are blocked. Besides the doors (44, 45) are painted blue.'

# Add EntityRuler to pipeline
ruler = nlp.add_pipe("entity_ruler", before="ner", config={"validate": True})
patterns = [{"label": "2_DIGIT", "pattern": [{"IS_DIGIT": True}, {"IS_PUNCT": True}, {"IS_DIGIT": True}]}]
ruler.add_patterns(patterns)

# Print 2-Digit Ents
print([(ent.label_, text[ent.start_char:ent.end_char]) for ent in doc.ents if ent.label_ == "2_DIGIT"])

很抱歉，我无法为您提供 100% 的工作代码来满足您的需求，但我认为这是获得所需内容的良好起点。

Tweek 空间跨度

Tweek spacy spans

python

spacy