如何使用 python spacy 匹配器匹配 phone 类型 (0)20 111 2222
How can I match a phone number of type (0)20 111 2222 using python spacy matcher
我正在尝试以下模式:
pattern = [ {'ORTH': '('}, {'SHAPE': 'd'},
{'ORTH': ')'},
{'SHAPE': 'dd'},
{'ORTH': '-', 'OP': '?'},
{'SHAPE': 'ddd'},
{'ORTH': '-', 'OP': '?'},
{'SHAPE': 'dddd'}]
matcher.add('PHONE_NUMBER_E', None, pattern)
如果我将 space 添加到括号后的 phone 数字(例如 (0) 20 111 2222),则此方法有效,否则无效。我才开始使用 python,所以我确信我遗漏了一些简单的东西。感谢您的帮助。
使用 spacy 匹配字符串与使用正则表达式匹配字符串的问题在于,使用 spacy 你[几乎] 永远不会事先知道分词器会对你的字符串做什么:
与 space:
doc = nlp("This is my telephone number (0) 20 111 2222")
[tok.text for tok in doc]
['This', 'is', 'my', 'telephone', 'number', '(', '0', ')', '20', '111', '2222']
没有space:
doc = nlp("This is my telephone number (0)20 111 2222")
[tok.text for tok in doc]
['This', 'is', 'my', 'telephone', 'number', '(', '0)20', '111', '2222']
考虑到这一点,您可以编写两种格式的模式:
doc = nlp("My telephone number is either (0)20 111 2222 or (0) 20 111 2222")
matcher = Matcher(nlp.vocab, validate=True)
pattern1 = [ {'ORTH': '('}, {'SHAPE': 'd'},
{'ORTH': ')'},
{'SHAPE': 'dd'},
{'ORTH': '-', 'OP': '?'},
{'SHAPE': 'ddd'},
{'ORTH': '-', 'OP': '?'},
{'SHAPE': 'dddd'}]
pattern2 = [ {'ORTH': '('},
{'TEXT':{'REGEX':'[\d]\)[\d]*'}},
{'ORTH': '-', 'OP': '?'},
{'SHAPE': 'ddd'},
{'ORTH': '-', 'OP': '?'},
{'SHAPE': 'dddd'}]
matcher.add('PHONE_NUMBER_E', None, pattern1, pattern2)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(span)
(0)20 111 2222
(0) 20 111 2222
我正在尝试以下模式:
pattern = [ {'ORTH': '('}, {'SHAPE': 'd'},
{'ORTH': ')'},
{'SHAPE': 'dd'},
{'ORTH': '-', 'OP': '?'},
{'SHAPE': 'ddd'},
{'ORTH': '-', 'OP': '?'},
{'SHAPE': 'dddd'}]
matcher.add('PHONE_NUMBER_E', None, pattern)
如果我将 space 添加到括号后的 phone 数字(例如 (0) 20 111 2222),则此方法有效,否则无效。我才开始使用 python,所以我确信我遗漏了一些简单的东西。感谢您的帮助。
使用 spacy 匹配字符串与使用正则表达式匹配字符串的问题在于,使用 spacy 你[几乎] 永远不会事先知道分词器会对你的字符串做什么:
与 space:
doc = nlp("This is my telephone number (0) 20 111 2222")
[tok.text for tok in doc]
['This', 'is', 'my', 'telephone', 'number', '(', '0', ')', '20', '111', '2222']
没有space:
doc = nlp("This is my telephone number (0)20 111 2222")
[tok.text for tok in doc]
['This', 'is', 'my', 'telephone', 'number', '(', '0)20', '111', '2222']
考虑到这一点,您可以编写两种格式的模式:
doc = nlp("My telephone number is either (0)20 111 2222 or (0) 20 111 2222")
matcher = Matcher(nlp.vocab, validate=True)
pattern1 = [ {'ORTH': '('}, {'SHAPE': 'd'},
{'ORTH': ')'},
{'SHAPE': 'dd'},
{'ORTH': '-', 'OP': '?'},
{'SHAPE': 'ddd'},
{'ORTH': '-', 'OP': '?'},
{'SHAPE': 'dddd'}]
pattern2 = [ {'ORTH': '('},
{'TEXT':{'REGEX':'[\d]\)[\d]*'}},
{'ORTH': '-', 'OP': '?'},
{'SHAPE': 'ddd'},
{'ORTH': '-', 'OP': '?'},
{'SHAPE': 'dddd'}]
matcher.add('PHONE_NUMBER_E', None, pattern1, pattern2)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(span)
(0)20 111 2222
(0) 20 111 2222