Python 确保地址符合特定格式
Python make sure address matches specific format
我一直在研究正则表达式,但还没有成功。我需要介绍一些地址验证。我需要确保用户定义的地址符合以下格式:
"717 N 2ND ST, MANKATO, MN 56001"
也可能是这个:
"717 N 2ND ST, MANKATO, MN, 56001"
并丢弃所有其他内容并警告用户这是不正确的格式。我一直在查看文档,并尝试了许多正则表达式模式,但都失败了。我已经尝试过这个(以及许多变体)但没有任何运气:
pat = r'\d{1,6}(\w+),\s(w+),\s[A-Za-z]{2}\s{1,6}'
这个有效,但它允许太多垃圾,因为它只确保它以门牌号开头并以邮政编码结尾(我认为):
pat = r'\d{1,6}( \w+){1,6}'
逗号位置很重要,因为我用逗号分割输入字符串,所以我的第一项是地址,然后是城市,然后是州和邮编由 space 分割(在这里我想使用第二个正则表达式,以防它们在 state 和 zip 之间有逗号)。
基本上我想这样做:
# check for this format "717 N 2ND ST, MANKATO, MN 56001"
pat_1 = 'regex to match above pattern'
if re.match(pat_1, addr, re.IGNORECASE):
# extract address
# check for this pattern "717 N 2ND ST, MANKATO, MN, 56001"
pat_2 = 'regex to match above format'
if re.match(pat_2, addr, re.IGNORECASE):
# extract address
else:
raise ValueError('"{}" must match this format: "717 N 2ND ST, MANKATO, MN 56001"'.format(addr))
# do stuff with address
如果有人可以帮助我构建正则表达式以确保存在模式匹配,我将不胜感激!
这个怎么样:
((\w|\s)+),((\w|\s)+),\s*(\w{2})\s*,?\s*(\d{5 }).*
您还可以使用它分别提取\1、\3、\5和\6中的街道、城市、州和邮编。它将分别匹配街道和城市的最后一个字母,但这不影响有效性。
\d{1,6}\s\w+\s\w+\s[A-Za-z]{2},\s([A-Za-z]+),\s[A-Za-z]{2}(,\s\d{1,6}|\s\d{1,6})
您可以在 link 中测试正则表达式:https://regex101.com/r/yN7hU9/1
你可以使用这个:
\d{1,6}(\s\w+)+,(\s\w+)+,\s[A-Z]{2},?\s\d{1,6}
它将匹配以门牌号开头的字符串,然后是任意数量的单词,后跟逗号。然后它会寻找一个至少由一个单词后跟一个逗号组成的城市名称。接下来它会查找恰好 2 个大写字母后跟一个可选的逗号。然后是邮政编码。
这可能会有所帮助。只要有可能,为了便于维护,我更喜欢使用带有嵌入式注释的冗长正则表达式。
还要注意 (?P<name>pattern)
的用法。这有助于记录匹配的意图,并且还提供了一种有用的机制来提取数据,如果您的需求超出了简单的正则表达式验证范围的话。
import re
# Goal: '717 N 2ND ST, MANKATO, MN 56001',
# Goal: '717 N 2ND ST, MANKATO, MN, 56001',
regex = r'''
(?x) # verbose regular expression
(?i) # ignore case
(?P<HouseNumber>\d+)\s+ # Matches '717 '
(?P<Direction>[news])\s+ # Matches 'N '
(?P<StreetName>\w+)\s+ # Matches '2ND '
(?P<StreetDesignator>\w+),\s+ # Matches 'ST, '
(?P<TownName>.*),\s+ # Matches 'MANKATO, '
(?P<State>[A-Z]{2}),?\s+ # Matches 'MN ' and 'MN, '
(?P<ZIP>\d{5}) # Matches '56001'
'''
regex = re.compile(regex)
for item in (
'717 N 2ND ST, MANKATO, MN 56001',
'717 N 2ND ST, MANKATO, MN, 56001',
'717 N 2ND, Makata, 56001', # Should reject this one
'1234 N D AVE, East Boston, MA, 02134',
):
match = regex.match(item)
print item
if match:
print " House is on {Direction} side of {TownName}".format(**match.groupdict())
else:
print " invalid entry"
为了使某些字段可选,我们将 +
替换为 *
,因为 +
表示一个或多个,而 *
表示零个或多个.这是符合评论中新要求的版本:
import re
# Goal: '717 N 2ND ST, MANKATO, MN 56001',
# Goal: '717 N 2ND ST, MANKATO, MN, 56001',
# Goal: '717 N 2ND ST NE, MANKATO, MN, 56001',
# Goal: '717 N 2ND, MANKATO, MN, 56001',
regex = r'''
(?x) # verbose regular expression
(?i) # ignore case
(?P<HouseNumber>\d+)\s+ # Matches '717 '
(?P<Direction>[news])\s+ # Matches 'N '
(?P<StreetName>\w+)\s* # Matches '2ND ', with optional trailing space
(?P<StreetDesignator>\w*)\s* # Optionally Matches 'ST '
(?P<StreetDirection>[news]*)\s* # Optionally Matches 'NE'
,\s+ # Force a comma after the street
(?P<TownName>.*),\s+ # Matches 'MANKATO, '
(?P<State>[A-Z]{2}),?\s+ # Matches 'MN ' and 'MN, '
(?P<ZIP>\d{5}) # Matches '56001'
'''
regex = re.compile(regex)
for item in (
'717 N 2ND ST, MANKATO, MN 56001',
'717 N 2ND ST, MANKATO, MN, 56001',
'717 N 2ND, Makata, 56001', # Should reject this one
'1234 N D AVE, East Boston, MA, 02134',
'717 N 2ND ST NE, MANKATO, MN, 56001',
'717 N 2ND, MANKATO, MN, 56001',
):
match = regex.match(item)
print item
if match:
print " House is on {Direction} side of {TownName}".format(**match.groupdict())
else:
print " invalid entry"
接下来,考虑 OR 运算符 |
和非捕获组运算符 (?:pattern)
。它们可以一起描述输入格式中的复杂替代方案。此版本符合新要求,有的地址在街道名称前有方向,有的地址在街道名称后有方向,但没有一个地址在两个地方都有方向。
import re
# Goal: '717 N 2ND ST, MANKATO, MN 56001',
# Goal: '717 N 2ND ST, MANKATO, MN, 56001',
# Goal: '717 2ND ST NE, MANKATO, MN, 56001',
# Goal: '717 N 2ND, MANKATO, MN, 56001',
regex = r'''
(?x) # verbose regular expression
(?i) # ignore case
(?: # Matches any sort of street address
(?: # Matches '717 N 2ND ST' or '717 N 2ND'
(?P<HouseNumber>\d+)\s+ # Matches '717 '
(?P<Direction>[news])\s+ # Matches 'N '
(?P<StreetName>\w+)\s* # Matches '2ND ', with optional trailing space
(?P<StreetDesignator>\w*)\s* # Optionally Matches 'ST '
)
| # OR
(?: # Matches '717 2ND ST NE' or '717 2ND NE'
(?P<HouseNumber2>\d+)\s+ # Matches '717 '
(?P<StreetName2>\w+)\s+ # Matches '2ND '
(?P<StreetDesignator2>\w*)\s* # Optionally Matches 'ST '
(?P<Direction2>[news]+) # Matches 'NE'
)
)
,\s+ # Force a comma after the street
(?P<TownName>.*),\s+ # Matches 'MANKATO, '
(?P<State>[A-Z]{2}),?\s+ # Matches 'MN ' and 'MN, '
(?P<ZIP>\d{5}) # Matches '56001'
'''
regex = re.compile(regex)
for item in (
'717 N 2ND ST, MANKATO, MN 56001',
'717 N 2ND ST, MANKATO, MN, 56001',
'717 N 2ND, Makata, 56001', # Should reject this one
'1234 N D AVE, East Boston, MA, 02134',
'717 2ND ST NE, MANKATO, MN, 56001',
'717 N 2ND, MANKATO, MN, 56001',
):
match = regex.match(item)
print item
if match:
d = match.groupdict()
print " House is on {0} side of {1}".format(
d['Direction'] or d['Direction2'],
d['TownName'])
else:
print " invalid entry"
我一直在研究正则表达式,但还没有成功。我需要介绍一些地址验证。我需要确保用户定义的地址符合以下格式:
"717 N 2ND ST, MANKATO, MN 56001"
也可能是这个:
"717 N 2ND ST, MANKATO, MN, 56001"
并丢弃所有其他内容并警告用户这是不正确的格式。我一直在查看文档,并尝试了许多正则表达式模式,但都失败了。我已经尝试过这个(以及许多变体)但没有任何运气:
pat = r'\d{1,6}(\w+),\s(w+),\s[A-Za-z]{2}\s{1,6}'
这个有效,但它允许太多垃圾,因为它只确保它以门牌号开头并以邮政编码结尾(我认为):
pat = r'\d{1,6}( \w+){1,6}'
逗号位置很重要,因为我用逗号分割输入字符串,所以我的第一项是地址,然后是城市,然后是州和邮编由 space 分割(在这里我想使用第二个正则表达式,以防它们在 state 和 zip 之间有逗号)。
基本上我想这样做:
# check for this format "717 N 2ND ST, MANKATO, MN 56001"
pat_1 = 'regex to match above pattern'
if re.match(pat_1, addr, re.IGNORECASE):
# extract address
# check for this pattern "717 N 2ND ST, MANKATO, MN, 56001"
pat_2 = 'regex to match above format'
if re.match(pat_2, addr, re.IGNORECASE):
# extract address
else:
raise ValueError('"{}" must match this format: "717 N 2ND ST, MANKATO, MN 56001"'.format(addr))
# do stuff with address
如果有人可以帮助我构建正则表达式以确保存在模式匹配,我将不胜感激!
这个怎么样:
((\w|\s)+),((\w|\s)+),\s*(\w{2})\s*,?\s*(\d{5 }).*
您还可以使用它分别提取\1、\3、\5和\6中的街道、城市、州和邮编。它将分别匹配街道和城市的最后一个字母,但这不影响有效性。
\d{1,6}\s\w+\s\w+\s[A-Za-z]{2},\s([A-Za-z]+),\s[A-Za-z]{2}(,\s\d{1,6}|\s\d{1,6})
您可以在 link 中测试正则表达式:https://regex101.com/r/yN7hU9/1
你可以使用这个:
\d{1,6}(\s\w+)+,(\s\w+)+,\s[A-Z]{2},?\s\d{1,6}
它将匹配以门牌号开头的字符串,然后是任意数量的单词,后跟逗号。然后它会寻找一个至少由一个单词后跟一个逗号组成的城市名称。接下来它会查找恰好 2 个大写字母后跟一个可选的逗号。然后是邮政编码。
这可能会有所帮助。只要有可能,为了便于维护,我更喜欢使用带有嵌入式注释的冗长正则表达式。
还要注意 (?P<name>pattern)
的用法。这有助于记录匹配的意图,并且还提供了一种有用的机制来提取数据,如果您的需求超出了简单的正则表达式验证范围的话。
import re
# Goal: '717 N 2ND ST, MANKATO, MN 56001',
# Goal: '717 N 2ND ST, MANKATO, MN, 56001',
regex = r'''
(?x) # verbose regular expression
(?i) # ignore case
(?P<HouseNumber>\d+)\s+ # Matches '717 '
(?P<Direction>[news])\s+ # Matches 'N '
(?P<StreetName>\w+)\s+ # Matches '2ND '
(?P<StreetDesignator>\w+),\s+ # Matches 'ST, '
(?P<TownName>.*),\s+ # Matches 'MANKATO, '
(?P<State>[A-Z]{2}),?\s+ # Matches 'MN ' and 'MN, '
(?P<ZIP>\d{5}) # Matches '56001'
'''
regex = re.compile(regex)
for item in (
'717 N 2ND ST, MANKATO, MN 56001',
'717 N 2ND ST, MANKATO, MN, 56001',
'717 N 2ND, Makata, 56001', # Should reject this one
'1234 N D AVE, East Boston, MA, 02134',
):
match = regex.match(item)
print item
if match:
print " House is on {Direction} side of {TownName}".format(**match.groupdict())
else:
print " invalid entry"
为了使某些字段可选,我们将 +
替换为 *
,因为 +
表示一个或多个,而 *
表示零个或多个.这是符合评论中新要求的版本:
import re
# Goal: '717 N 2ND ST, MANKATO, MN 56001',
# Goal: '717 N 2ND ST, MANKATO, MN, 56001',
# Goal: '717 N 2ND ST NE, MANKATO, MN, 56001',
# Goal: '717 N 2ND, MANKATO, MN, 56001',
regex = r'''
(?x) # verbose regular expression
(?i) # ignore case
(?P<HouseNumber>\d+)\s+ # Matches '717 '
(?P<Direction>[news])\s+ # Matches 'N '
(?P<StreetName>\w+)\s* # Matches '2ND ', with optional trailing space
(?P<StreetDesignator>\w*)\s* # Optionally Matches 'ST '
(?P<StreetDirection>[news]*)\s* # Optionally Matches 'NE'
,\s+ # Force a comma after the street
(?P<TownName>.*),\s+ # Matches 'MANKATO, '
(?P<State>[A-Z]{2}),?\s+ # Matches 'MN ' and 'MN, '
(?P<ZIP>\d{5}) # Matches '56001'
'''
regex = re.compile(regex)
for item in (
'717 N 2ND ST, MANKATO, MN 56001',
'717 N 2ND ST, MANKATO, MN, 56001',
'717 N 2ND, Makata, 56001', # Should reject this one
'1234 N D AVE, East Boston, MA, 02134',
'717 N 2ND ST NE, MANKATO, MN, 56001',
'717 N 2ND, MANKATO, MN, 56001',
):
match = regex.match(item)
print item
if match:
print " House is on {Direction} side of {TownName}".format(**match.groupdict())
else:
print " invalid entry"
接下来,考虑 OR 运算符 |
和非捕获组运算符 (?:pattern)
。它们可以一起描述输入格式中的复杂替代方案。此版本符合新要求,有的地址在街道名称前有方向,有的地址在街道名称后有方向,但没有一个地址在两个地方都有方向。
import re
# Goal: '717 N 2ND ST, MANKATO, MN 56001',
# Goal: '717 N 2ND ST, MANKATO, MN, 56001',
# Goal: '717 2ND ST NE, MANKATO, MN, 56001',
# Goal: '717 N 2ND, MANKATO, MN, 56001',
regex = r'''
(?x) # verbose regular expression
(?i) # ignore case
(?: # Matches any sort of street address
(?: # Matches '717 N 2ND ST' or '717 N 2ND'
(?P<HouseNumber>\d+)\s+ # Matches '717 '
(?P<Direction>[news])\s+ # Matches 'N '
(?P<StreetName>\w+)\s* # Matches '2ND ', with optional trailing space
(?P<StreetDesignator>\w*)\s* # Optionally Matches 'ST '
)
| # OR
(?: # Matches '717 2ND ST NE' or '717 2ND NE'
(?P<HouseNumber2>\d+)\s+ # Matches '717 '
(?P<StreetName2>\w+)\s+ # Matches '2ND '
(?P<StreetDesignator2>\w*)\s* # Optionally Matches 'ST '
(?P<Direction2>[news]+) # Matches 'NE'
)
)
,\s+ # Force a comma after the street
(?P<TownName>.*),\s+ # Matches 'MANKATO, '
(?P<State>[A-Z]{2}),?\s+ # Matches 'MN ' and 'MN, '
(?P<ZIP>\d{5}) # Matches '56001'
'''
regex = re.compile(regex)
for item in (
'717 N 2ND ST, MANKATO, MN 56001',
'717 N 2ND ST, MANKATO, MN, 56001',
'717 N 2ND, Makata, 56001', # Should reject this one
'1234 N D AVE, East Boston, MA, 02134',
'717 2ND ST NE, MANKATO, MN, 56001',
'717 N 2ND, MANKATO, MN, 56001',
):
match = regex.match(item)
print item
if match:
d = match.groupdict()
print " House is on {0} side of {1}".format(
d['Direction'] or d['Direction2'],
d['TownName'])
else:
print " invalid entry"