在 python 中使用正则表达式提取多个特定单词之间的子字符串
Extract sub-string between multiple certain words using regex in python
正则表达式子字符串
我想提取 Phone、传真、手机 如果不是,我从字符串中获取 它可以 return 空字符串。我想要来自任何给定文本字符串示例的 Phone、传真、手机的 3 个列表,如下所示。
ex1 = "miramar road margie shoop san diego ca 12793 manager phone 6035550160 fax 6035550161 mobile 6035550178 marsgies travel wwwmarpiestravelcom"
ex2 = "david packard electrical engineering 350 serra mall room 170 phone 650 7259327 stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu"
ex3 = "stanford electrical engineering vijay chandrasekhar electrical engineering 17 comstock circle apt 101 stanford ca 94305 phone 9162210411"
像这样使用正则表达式是可能的:
phone_regex = re.match(".*phone(.*)fax(.*)mobile(.*)",ex1)
phone = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][0]
mobile = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][2]
fax = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][1]
Result from ex1
:
phone = 6035550160
fax = 6035550161
mobile = 6035550178
ex2
没有 mobile 条目,所以我得到:
Traceback (most recent call last):
phone = [re.sub("[^0-9]", "", x) for x in phone_regex.groups()][0]
AttributeError: 'NoneType' object has no attribute 'groups'
问题
我需要一个更好的正则表达式解决方案,因为我是正则表达式的新手,
或者,一个解决方案,捕获 AttributeError 并分配 null string
.
我想我明白你想要什么..它与准确地获得关键字后的第一个匹配有关。在这种情况下你需要的是问号 ?:
" '?'也是一个量词,是{0,1}的简写,意思是"Match zero or one of the group preceding this question mark.",也可以理解为问号前面的部分是可选的
这里有一些代码应该可以工作,以防定义不够
import re
res_dict = {}
list_keywords = ['phone', 'cell', 'fax']
for i_key in list_keywords:
temp_res = re.findall(i_key + '(.*?) [a-zA-Z]', ex1)
res_dict[i_key] = temp_res
使用re.search
演示:
import re
ex1 = "miramar road margie shoop san diego ca 12793 manager phone 6035550160 fax 6035550161 mobile 6035550178 marsgies travel wwwmarpiestravelcom"
ex2 = "david packard electrical engineering 350 serra mall room 170 phone 650 7259327 stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu"
ex3 = "stanford electrical engineering vijay chandrasekhar electrical engineering 17 comstock circle apt 101 stanford ca 94305 phone 9162210411"
for i in [ex1, ex2, ex3]:
phone = re.search(r"(?P<phone>(?<=\phone\b).*?(?=([a-z]|$)))", i)
if phone:
print "Phone: ", phone.group("phone")
fax = re.search(r"(?P<fax>(?<=\bfax\b).*?(?=([a-z]|$)))", i)
if fax:
print "Fax: ", fax.group("fax")
mob = re.search(r"(?P<mob>(?<=\bmobile\b).*?(?=([a-z]|$)))", i)
if mob:
print "mob: ", mob.group("mob")
print("-----")
输出:
Phone: 6035550160
Fax: 6035550161
mob: 6035550178
-----
Phone: 650 7259327
Fax: 650 723 1882
-----
Phone: 9162210411
-----
您可以像这样使用简单的 re.findall
:
dict(re.findall(r'\b({})\s*(\d+)'.format("|".join(keys)), ex))
正则表达式看起来像
\b(phone|fax|mobile)\s*(\d+)
图案详情
\b
- 单词边界
(phone|fax|mobile)
- 第 1 组:所列单词之一
\s*
- 0+ 个空格
(\d+)
- 第 2 组:一个或多个数字
参见Python demo:
import re
exs = ["miramar road margie shoop san diego ca 12793 manager phone 6035550160 fax 6035550161 mobile 6035550178 marsgies travel wwwmarpiestravelcom",
"david packard electrical engineering 350 serra mall room 170 phone 650 7259327 stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu",
"stanford electrical engineering vijay chandrasekhar electrical engineering 17 comstock circle apt 101 stanford ca 94305 phone 9162210411"]
keys = ['phone', 'fax', 'mobile']
for ex in exs:
res = dict(re.findall(r'\b({})\s*(\d+)'.format("|".join(keys)), ex))
print(res)
输出:
{'fax': '6035550161', 'phone': '6035550160', 'mobile': '6035550178'}
{'fax': '650', 'phone': '650'}
{'phone': '9162210411'}
我认为以下正则表达式应该可以正常工作:
mobile = re.findall('mobile([0-9]*)', ex1.replace(" ",""))[0]
fax = re.findall('fax([0-9]*)', ex1.replace(" ",""))[0]
phone = re.findall('phone([0-9]*)', ex1.replace(" ",""))[0]
正则表达式子字符串
我想提取 Phone、传真、手机 如果不是,我从字符串中获取 它可以 return 空字符串。我想要来自任何给定文本字符串示例的 Phone、传真、手机的 3 个列表,如下所示。
ex1 = "miramar road margie shoop san diego ca 12793 manager phone 6035550160 fax 6035550161 mobile 6035550178 marsgies travel wwwmarpiestravelcom"
ex2 = "david packard electrical engineering 350 serra mall room 170 phone 650 7259327 stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu"
ex3 = "stanford electrical engineering vijay chandrasekhar electrical engineering 17 comstock circle apt 101 stanford ca 94305 phone 9162210411"
像这样使用正则表达式是可能的:
phone_regex = re.match(".*phone(.*)fax(.*)mobile(.*)",ex1)
phone = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][0]
mobile = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][2]
fax = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][1]
Result from
ex1
:
phone = 6035550160
fax = 6035550161
mobile = 6035550178
ex2
没有 mobile 条目,所以我得到:
Traceback (most recent call last):
phone = [re.sub("[^0-9]", "", x) for x in phone_regex.groups()][0]
AttributeError: 'NoneType' object has no attribute 'groups'
问题
我需要一个更好的正则表达式解决方案,因为我是正则表达式的新手,
或者,一个解决方案,捕获 AttributeError 并分配 null string
.
我想我明白你想要什么..它与准确地获得关键字后的第一个匹配有关。在这种情况下你需要的是问号 ?:
" '?'也是一个量词,是{0,1}的简写,意思是"Match zero or one of the group preceding this question mark.",也可以理解为问号前面的部分是可选的
这里有一些代码应该可以工作,以防定义不够
import re
res_dict = {}
list_keywords = ['phone', 'cell', 'fax']
for i_key in list_keywords:
temp_res = re.findall(i_key + '(.*?) [a-zA-Z]', ex1)
res_dict[i_key] = temp_res
使用re.search
演示:
import re
ex1 = "miramar road margie shoop san diego ca 12793 manager phone 6035550160 fax 6035550161 mobile 6035550178 marsgies travel wwwmarpiestravelcom"
ex2 = "david packard electrical engineering 350 serra mall room 170 phone 650 7259327 stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu"
ex3 = "stanford electrical engineering vijay chandrasekhar electrical engineering 17 comstock circle apt 101 stanford ca 94305 phone 9162210411"
for i in [ex1, ex2, ex3]:
phone = re.search(r"(?P<phone>(?<=\phone\b).*?(?=([a-z]|$)))", i)
if phone:
print "Phone: ", phone.group("phone")
fax = re.search(r"(?P<fax>(?<=\bfax\b).*?(?=([a-z]|$)))", i)
if fax:
print "Fax: ", fax.group("fax")
mob = re.search(r"(?P<mob>(?<=\bmobile\b).*?(?=([a-z]|$)))", i)
if mob:
print "mob: ", mob.group("mob")
print("-----")
输出:
Phone: 6035550160
Fax: 6035550161
mob: 6035550178
-----
Phone: 650 7259327
Fax: 650 723 1882
-----
Phone: 9162210411
-----
您可以像这样使用简单的 re.findall
:
dict(re.findall(r'\b({})\s*(\d+)'.format("|".join(keys)), ex))
正则表达式看起来像
\b(phone|fax|mobile)\s*(\d+)
图案详情
\b
- 单词边界(phone|fax|mobile)
- 第 1 组:所列单词之一\s*
- 0+ 个空格(\d+)
- 第 2 组:一个或多个数字
参见Python demo:
import re
exs = ["miramar road margie shoop san diego ca 12793 manager phone 6035550160 fax 6035550161 mobile 6035550178 marsgies travel wwwmarpiestravelcom",
"david packard electrical engineering 350 serra mall room 170 phone 650 7259327 stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu",
"stanford electrical engineering vijay chandrasekhar electrical engineering 17 comstock circle apt 101 stanford ca 94305 phone 9162210411"]
keys = ['phone', 'fax', 'mobile']
for ex in exs:
res = dict(re.findall(r'\b({})\s*(\d+)'.format("|".join(keys)), ex))
print(res)
输出:
{'fax': '6035550161', 'phone': '6035550160', 'mobile': '6035550178'}
{'fax': '650', 'phone': '650'}
{'phone': '9162210411'}
我认为以下正则表达式应该可以正常工作:
mobile = re.findall('mobile([0-9]*)', ex1.replace(" ",""))[0]
fax = re.findall('fax([0-9]*)', ex1.replace(" ",""))[0]
phone = re.findall('phone([0-9]*)', ex1.replace(" ",""))[0]