如何修改 spacy tokenizer 以将 URL 拆分为单个单词
How to modify spacy tokenizer to split URLs into individual words
我想修改默认分词器以将 URL 拆分为单个单词。这是我目前拥有的
import spacy
nlp = spacy.blank('en')
infixes = nlp.Defaults.infixes + [r'\.']
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
print(list(nlp('www.internet.com')))
# ['www.internet.com']
# want it to be ['www', '.', 'internet', '.', 'com']
我正在查看分词器的 usage examples and source code,但我无法解决这个特殊情况。
您没有看到想要的结果,因为 url 首先被 URL_MATCH
捕获(它有 higher precedence):
import spacy
nlp = spacy.blank('en')
txt = 'Check this out www.internet.com'
doc = nlp(txt)
nlp.tokenizer.explain(txt)
[('TOKEN', 'Check'),
('TOKEN', 'this'),
('TOKEN', 'out'),
('URL_MATCH', 'www.internet.com')]
可能的解决方案之一:
nlp.tokenizer.url_match = None
infixes = nlp.Defaults.infixes + [r'\.']
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
doc = nlp(txt)
list(doc)
[Check, this, out, www, ., internet, ., com]
我想修改默认分词器以将 URL 拆分为单个单词。这是我目前拥有的
import spacy
nlp = spacy.blank('en')
infixes = nlp.Defaults.infixes + [r'\.']
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
print(list(nlp('www.internet.com')))
# ['www.internet.com']
# want it to be ['www', '.', 'internet', '.', 'com']
我正在查看分词器的 usage examples and source code,但我无法解决这个特殊情况。
您没有看到想要的结果,因为 url 首先被 URL_MATCH
捕获(它有 higher precedence):
import spacy
nlp = spacy.blank('en')
txt = 'Check this out www.internet.com'
doc = nlp(txt)
nlp.tokenizer.explain(txt)
[('TOKEN', 'Check'),
('TOKEN', 'this'),
('TOKEN', 'out'),
('URL_MATCH', 'www.internet.com')]
可能的解决方案之一:
nlp.tokenizer.url_match = None
infixes = nlp.Defaults.infixes + [r'\.']
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
doc = nlp(txt)
list(doc)
[Check, this, out, www, ., internet, ., com]