Spacy 自定义分词器使用 Infix 正则表达式仅包含连字符词作为分词
Spacy custom tokenizer to include only hyphen words as tokens using Infix regex
我想包括带连字符的单词,例如:long-term、self-esteem、 等作为 Spacy 中的单个标记。在查看 Whosebug 上的一些类似帖子后,Github, its documentation and elsewhere,我还编写了一个自定义分词器,如下所示:
import re
from spacy.tokenizer import Tokenizer
prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp = spacy.load('en_core_web_lg')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]
所以对于这句话:
'注:自十四世纪以来,“医学”实践已成为一种职业;更重要的是,它\'s a male-dominated profession.'
现在,合并自定义 Spacy Tokenizer 后的标记为:
'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice' , 'of',
'“药”, '”', 'has', ';', 'become', 'a' ,
'profession', ',', 'and', 'more', 'importantly', ',',
"it's", 'a', '男性主导', 'profession', '.'
早些时候,此更改之前的令牌是:
'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice' , 'of', '“', '医药', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', ' 它', "的", 'a', '男', '-', '支配', 'profession', '.'
而且,预期的标记应该是:
'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice' , 'of', '“', '医药', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', ' it', "'s", 'a', '男性主导', 'profession', '.'
总结: 可以看出...
- 包含连字符,除双引号和撇号外的其他标点符号也包含...
- ...但是现在,撇号和双引号没有早期或预期的行为。
- 我已经为中缀的正则表达式编译尝试了不同的排列和组合,但没有解决这个问题的进展。
使用默认的 prefix_re 和 suffix_re 给出了预期的输出:
import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
def custom_tokenizer(nlp):
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]
['Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession' , '.']
如果你想深入了解为什么你的正则表达式不像 SpaCy 那样工作,这里有相关源代码的链接:
此处定义的前缀和后缀:
https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py
参考此处定义的字符(例如引号、连字符等):
https://github.com/explosion/spaCy/blob/master/spacy/lang/char_classes.py
以及用于编译它们的函数(例如,compile_prefix_regex):
https://github.com/explosion/spaCy/blob/master/spacy/util.py
我想包括带连字符的单词,例如:long-term、self-esteem、 等作为 Spacy 中的单个标记。在查看 Whosebug 上的一些类似帖子后,Github, its documentation and elsewhere,我还编写了一个自定义分词器,如下所示:
import re
from spacy.tokenizer import Tokenizer
prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp = spacy.load('en_core_web_lg')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]
所以对于这句话: '注:自十四世纪以来,“医学”实践已成为一种职业;更重要的是,它\'s a male-dominated profession.'
现在,合并自定义 Spacy Tokenizer 后的标记为:
'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice' , 'of', '“药”, '”', 'has', ';', 'become', 'a' , 'profession', ',', 'and', 'more', 'importantly', ',', "it's", 'a', '男性主导', 'profession', '.'
早些时候,此更改之前的令牌是:
'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice' , 'of', '“', '医药', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', ' 它', "的", 'a', '男', '-', '支配', 'profession', '.'
而且,预期的标记应该是:
'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice' , 'of', '“', '医药', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', ' it', "'s", 'a', '男性主导', 'profession', '.'
总结: 可以看出...
- 包含连字符,除双引号和撇号外的其他标点符号也包含...
- ...但是现在,撇号和双引号没有早期或预期的行为。
- 我已经为中缀的正则表达式编译尝试了不同的排列和组合,但没有解决这个问题的进展。
使用默认的 prefix_re 和 suffix_re 给出了预期的输出:
import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
def custom_tokenizer(nlp):
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]
['Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession' , '.']
如果你想深入了解为什么你的正则表达式不像 SpaCy 那样工作,这里有相关源代码的链接:
此处定义的前缀和后缀:
https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py
参考此处定义的字符(例如引号、连字符等):
https://github.com/explosion/spaCy/blob/master/spacy/lang/char_classes.py
以及用于编译它们的函数(例如,compile_prefix_regex):
https://github.com/explosion/spaCy/blob/master/spacy/util.py