为什么 spaCy 不像 Stanford CoreNLP 那样在标记化过程中保留词内连字符?
Why does spaCy not preserve intra-word-hyphens during tokenization like Stanford CoreNLP does?
SpaCy 版本:2.0.11
Python版本:3.6.5
OS: Ubuntu 16.04
我的句子样本:
Marketing-Representative- won't die in car accident.
或
Out-of-box implementation
预期代币:
["Marketing-Representative", "-", "wo", "n't", "die", "in", "car", "accident", "."]
["Out-of-box", "implementation"]
SpaCy 令牌(默认分词器):
["Marketing", "-", "Representative-", "wo", "n't", "die", "in", "car", "accident", "."]
["Out", "-", "of", "-", "box", "implementation"]
我尝试创建自定义分词器,但它不会处理 spaCy 使用 tokenizer_exceptions(下面的代码)处理的所有边缘情况:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("Marketing-Representative- won't die in car accident.")
for token in doc:
print(token.text)
输出:
Marketing-Representative-
won
'
t
die
in
car
accident
.
我需要有人指导我以适当的方式进行此操作。
要么在上面的正则表达式中进行更改,要么使用任何其他方法,或者我什至尝试了 spaCy 的基于规则的匹配器,但无法创建规则来处理超过 2 个单词之间的连字符,例如"out-of-box" 以便可以创建匹配器以与 span.merge() 一起使用。
无论哪种方式,我都需要让包含单词内连字符的单词成为 Stanford CoreNLP 处理的单个标记。
虽然 spacey
usage site 中没有记录,
看起来我们只需要为我们正在使用的 *fix 添加 regex
,在本例中为中缀。
此外,我们似乎可以使用自定义 regex
扩展 nlp.Defaults.prefixes
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
这会给你想要的结果。无需将默认设置为 prefix
和 suffix
,因为我们不使用它们。
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
infix_re = spacy.util.compile_infix_regex(infixes)
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp.tokenizer = custom_tokenizer(nlp)
s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"
for s in s1,s2:
doc = nlp("{}".format(s))
print([token.text for token in doc])
结果
$python3 /tmp/nlp.py
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']
['Out-of-box', 'implementation']
您可能想要修复插件正则表达式,以使其对接近所应用正则表达式的其他类型的标记更加健壮。
我还想修改 spaCy 的分词器以更接近 CoreNLP 的语义。下面粘贴的是我想出的,它解决了这个线程中的连字符问题(包括尾随的连字符)和一些额外的修复。我不得不复制默认的中缀表达式并对它们进行修改,但能够简单地附加一个新的后缀表达式:
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
def initializeTokenizer(nlp):
prefixes = nlp.Defaults.prefixes
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r'(?<=[0-9])[+\-\*^](?=[0-9-])',
r'(?<=[{al}{q}])\.(?=[{au}{q}])'.format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
# REMOVE: commented out regex that splits on hyphens between letters:
#r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
# EDIT: remove split on slash between letters, and add comma
#r'(?<=[{a}0-9])[:<>=/](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}0-9])[:<>=,](?=[{a}])'.format(a=ALPHA),
# ADD: ampersand as an infix character except for dual upper FOO&FOO variant
r'(?<=[{a}0-9])[&](?=[{al}0-9])'.format(a=ALPHA, al=ALPHA_LOWER),
r'(?<=[{al}0-9])[&](?=[{a}0-9])'.format(a=ALPHA, al=ALPHA_LOWER),
]
)
# ADD: add suffix to split on trailing hyphen
custom_suffixes = [r'[-]']
suffixes = nlp.Defaults.suffixes
suffixes = tuple(list(suffixes) + custom_suffixes)
infix_re = spacy.util.compile_infix_regex(infixes)
suffix_re = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_re.search
nlp.tokenizer.infix_finditer = infix_re.finditer
SpaCy 版本:2.0.11
Python版本:3.6.5
OS: Ubuntu 16.04
我的句子样本:
Marketing-Representative- won't die in car accident.
或
Out-of-box implementation
预期代币:
["Marketing-Representative", "-", "wo", "n't", "die", "in", "car", "accident", "."]
["Out-of-box", "implementation"]
SpaCy 令牌(默认分词器):
["Marketing", "-", "Representative-", "wo", "n't", "die", "in", "car", "accident", "."]
["Out", "-", "of", "-", "box", "implementation"]
我尝试创建自定义分词器,但它不会处理 spaCy 使用 tokenizer_exceptions(下面的代码)处理的所有边缘情况:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("Marketing-Representative- won't die in car accident.")
for token in doc:
print(token.text)
输出:
Marketing-Representative-
won
'
t
die
in
car
accident
.
我需要有人指导我以适当的方式进行此操作。
要么在上面的正则表达式中进行更改,要么使用任何其他方法,或者我什至尝试了 spaCy 的基于规则的匹配器,但无法创建规则来处理超过 2 个单词之间的连字符,例如"out-of-box" 以便可以创建匹配器以与 span.merge() 一起使用。
无论哪种方式,我都需要让包含单词内连字符的单词成为 Stanford CoreNLP 处理的单个标记。
虽然 spacey
usage site 中没有记录,
看起来我们只需要为我们正在使用的 *fix 添加 regex
,在本例中为中缀。
此外,我们似乎可以使用自定义 regex
nlp.Defaults.prefixes
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
这会给你想要的结果。无需将默认设置为 prefix
和 suffix
,因为我们不使用它们。
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
infix_re = spacy.util.compile_infix_regex(infixes)
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp.tokenizer = custom_tokenizer(nlp)
s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"
for s in s1,s2:
doc = nlp("{}".format(s))
print([token.text for token in doc])
结果
$python3 /tmp/nlp.py
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']
['Out-of-box', 'implementation']
您可能想要修复插件正则表达式,以使其对接近所应用正则表达式的其他类型的标记更加健壮。
我还想修改 spaCy 的分词器以更接近 CoreNLP 的语义。下面粘贴的是我想出的,它解决了这个线程中的连字符问题(包括尾随的连字符)和一些额外的修复。我不得不复制默认的中缀表达式并对它们进行修改,但能够简单地附加一个新的后缀表达式:
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
def initializeTokenizer(nlp):
prefixes = nlp.Defaults.prefixes
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r'(?<=[0-9])[+\-\*^](?=[0-9-])',
r'(?<=[{al}{q}])\.(?=[{au}{q}])'.format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
# REMOVE: commented out regex that splits on hyphens between letters:
#r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
# EDIT: remove split on slash between letters, and add comma
#r'(?<=[{a}0-9])[:<>=/](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}0-9])[:<>=,](?=[{a}])'.format(a=ALPHA),
# ADD: ampersand as an infix character except for dual upper FOO&FOO variant
r'(?<=[{a}0-9])[&](?=[{al}0-9])'.format(a=ALPHA, al=ALPHA_LOWER),
r'(?<=[{al}0-9])[&](?=[{a}0-9])'.format(a=ALPHA, al=ALPHA_LOWER),
]
)
# ADD: add suffix to split on trailing hyphen
custom_suffixes = [r'[-]']
suffixes = nlp.Defaults.suffixes
suffixes = tuple(list(suffixes) + custom_suffixes)
infix_re = spacy.util.compile_infix_regex(infixes)
suffix_re = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_re.search
nlp.tokenizer.infix_finditer = infix_re.finditer