在 spacy 中用 ## 替换数字的更正 POS 标签

Question

gigaword 数据集是一个巨大的语料库，用于训练抽象摘要模型。它包含如下摘要：

spain 's colonial posts #.## billion euro loss
taiwan shares close down #.## percent

我想用 spacy 处理这些摘要，并为每个标记获取正确的 pos 标签。问题是数据集中的所有数字都被替换为 # 符号，spacy 不将其归类为数字 (NUM)，而是将其归类为其他标签。

>>> import spacy
>>> from spacy.tokens import Doc
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.tokenizer = lambda raw: Doc(nlp.vocab, words=raw.split(' '))
>>> text = "spain 's colonial posts #.## billion euro loss"
>>> doc = nlp(text)
>>> [(token.text, token.pos_) for token in doc]
[('spain', 'PROPN'), ("'s", 'PART'), ('colonial', 'ADJ'), ('posts', 'NOUN'), ('#.##', 'PROPN'), ('billion', 'NUM'), ('euro', 'PROPN'), ('loss', 'NOUN')]

有没有办法自定义词性标注器，使其将所有仅由#符号和点组成的标记分类为数字？

我知道您将 spacy POS 标记器替换为您自己的或使用其他数据针对您的域对其进行微调，但我没有标记训练数据，其中所有数字都替换为 #，我想更改标签尽可能少。我更喜欢有一个正则表达式或固定的标记列表，它们总是被识别为数字。

Answer 1

用数字替换 # 怎么样？

在这个答案的第一个版本中，我选择了数字 9，因为它让我想起了我 30 年前使用的 COBOL 数字字段格式......但后来我查看了数据集，并意识到要进行适当的 NLP 处理，至少应该弄清楚两件事：

序数词（第 1、第 2、...）
日期

序数需要对数字的任何选择进行特殊处理，但数字 1 会产生合理的日期，年份除外（当然，1111 可能会也可能不会被解释为有效年份，但让我们谨慎行事）。 11/11/2020 明显优于 99/99/9999...

代码如下：

import re

ic = re.IGNORECASE
subs = [
    (re.compile(r'\b1(nd)\b', flags=ic), r'2'),  # 1nd -> 2nd
    (re.compile(r'\b1(rd)\b', flags=ic), r'3'),  # 1rd -> 3rd
    (re.compile(r'\b1(th)\b', flags=ic), r'4'),  # 1th -> 4th
    (re.compile(r'11(st)\b', flags=ic), r'21'),  # ...11st -> ...21st
    (re.compile(r'11(nd)\b', flags=ic), r'22'),  # ...11nd -> ...22nd
    (re.compile(r'11(rd)\b', flags=ic), r'23'),  # ...11rd -> ...23rd
    (re.compile(r'\b1111\b'), '2020')              # 1111 -> 2020
]

text = '''spain 's colonial posts #.## billion euro loss
#nd, #rd, #th, ##st, ##nd, ##RD, ##TH, ###st, ###nd, ###rd, ###th.
ID=#nd#### year=#### OK'''

text = text.replace('#', '1')
for pattern, repl in subs:
    text = re.sub(pattern, repl, text)

print(text)
# spain 's colonial posts 1.11 billion euro loss
# 2nd, 3rd, 4th, 21st, 22nd, 23RD, 11TH, 121st, 122nd, 123rd, 111th.
# ID=1nd1111 year=2020 OK

如果语料库的预处理无论如何都将任何数字转换为 #，则此转换不会丢失任何信息。一些“真”# 会变成 1，但与未被识别的数字相比，这可能是一个小问题。此外，在对大约 500000 行数据集的目视检查中，我无法找到任何“真实”的候选者 #.

N.B.: 上述正则表达式中的\b代表“词边界”，即\w（词）和[=23=之间的边界]（非单词）字符，其中单词字符是任何字母数字字符（更多信息 here). The </code> in the replacement stands for the first group, i.e., the first pair of parentheses (further info <a href="https://www.regular-expressions.info/replacebackref.html" rel="nofollow noreferrer">here</a>）。使用 <code> 保留所有文本的大小写，这对于 2nd 这样的替换字符串是不可能的。后来我发现您的数据集已标准化为所有小写字母，但我决定保留它的通用性。

如果需要从词性中取回带有#s的文本，只需

token.text.replace('0','#').replace('1','#').replace('2','#').replace('3','#').replace('4','#')

在 spacy 中用 ## 替换数字的更正 POS 标签

Correct POS tags for numbers substituted with ## in spacy

python

pos-tagger

spacy