常规 exp 在特定情况下不起作用
Regular exp not working in particular cases
我无法在使用正则表达式 re.search 方法时提取特定字段。
显示的错误是:
raw_add = re.search(search_add.decode('utf-8'),i.decode('utf-8')).group()
AttributeError: 'NoneType' object has no attribute 'group'
我的代码如下:
import urllib2
import re
from json import dump
dumped_data = []
url = 'http://levi.in/store-finder/content/cityAddress.xml'
data = urllib2.urlopen(url).read()
class theAddress():
city = ""
state = ""
lat = ""
lng = ""
area = ""
addr = ""
broken_pieces = re.compile('(?<=marker ).+?(?="\/>)')
all_broken_pieces = re.findall(broken_pieces,data)
search_add = '(?<=html=").+?(?=Tel|<\/p>)'
for i in all_broken_pieces:
obj = theAddress()
obj.city = re.search('(?<=city=").+?(?=")',i).group()
obj.state = re.search('(?<=state=").+?(?=")',i).group()
obj.lat = re.search('(?<=lat=").+?(?=")',i).group()
obj.lng = re.search('(?<=lng=").+?(?=")',i).group()
obj.area = re.search('(?<=label=").+?(?=")',i).group()
raw_add = re.search(search_add.decode('utf-8'),i.decode('utf-8')).group()
try:
process1 = re.sub('<h5>','',raw_add)
process2 = re.sub('</h5>',' ',process1)
process3 = re.sub('<p>','',process2)
process4 = re.sub('<br />',' ',process3)
process5 = re.sub('</p>','',process4)
process6 = re.sub('&','&',process5)
obj.addr = process6
except:
pass
dumped_data.append(obj.__dict__)
f = open('levis_address1111.json','w')
dump(dumped_data, f, indent = 1)
这里的问题是,只要正则表达式匹配的地址以 'Tel' 结尾,数据就会被提取,但当它以“
”结尾时,就会弹出错误。
刚刚调试了您的代码,似乎字符串 html 已转义,因此您应该将正则表达式更改为:
search_add = '(?<=html=").+?(?=Tel|<\/p>)'
正如 beerbajay 已经建议的那样,如果您想绕过错误,请在尝试提取组之前检查是否存在匹配项(如错误所述,它不适用于 NoneType (没有正则表达式匹配)).
尝试您的示例并打印一些调试信息,我发现:
debug i: city="Amravati" state="Maharashtra" lat="20.930138" lng="77.754321" html="<h5>Tri Star Retail Pvt. Ltd(OLS):</h5> <p>Near HDFC Bank,<br />Main Market Road, <br />Jaystambh Chowk Road,<br />Amravati-440601. <br />Tel: 0721-561396</p>" label="Amravati" icontype="Levi\'s" category="<h5>Levi\'s Showroom:</h5> <p>Near HDFC Bank,<br />Main Market Road, <br />Jaystambh Chowk Road,<br />Amravati-440601.</p>
raw_add: <h5>Tri Star Retail Pvt. Ltd(OLS):</h5> <p>Near HDFC Bank,<br />Main Market Road, <br />Jaystambh Chowk Road,<br />Amravati-440601. <br />
debug i: city="Bangalore" state="Karnataka" lat="12.935816" lng="77.610294" html="<img src=\'../Images/FindUs/LoopProgram.gif\' style=\'float:right; padding-left:5px;\' alt=\'Levi\xe2\x80\x99s\xc2\xae Loop Program\' /><h5>Prakruthi Apparels(OLS):</h5> <p>Housur road, Forum mall,<br /> Bangalore.</p>" label="Forum mall" icontype="Levi\'s" category="<img src=\'../Images/FindUs/LoopProgramW.gif\' style=\'float:right; padding-right:5px;\' alt=\'Levi\xe2\x80\x99s\xc2\xae Loop Program\' /><h5>Levi\'s Showroom:</h5><p>Housur road,<br />Forum mall,<br /> Bangalore.</p>
首先'debug i'是一个确实包含"Tel"的字符串,所以匹配。在第二个中,我没有看到任何
,因此您的正则表达式不匹配。您可能需要对 regex/include 一些更多可能的场景进行更多调试。
确实;通常最好不要使用正则表达式进行 html/xml 解析。
我无法在使用正则表达式 re.search 方法时提取特定字段。 显示的错误是:
raw_add = re.search(search_add.decode('utf-8'),i.decode('utf-8')).group()
AttributeError: 'NoneType' object has no attribute 'group'
我的代码如下:
import urllib2
import re
from json import dump
dumped_data = []
url = 'http://levi.in/store-finder/content/cityAddress.xml'
data = urllib2.urlopen(url).read()
class theAddress():
city = ""
state = ""
lat = ""
lng = ""
area = ""
addr = ""
broken_pieces = re.compile('(?<=marker ).+?(?="\/>)')
all_broken_pieces = re.findall(broken_pieces,data)
search_add = '(?<=html=").+?(?=Tel|<\/p>)'
for i in all_broken_pieces:
obj = theAddress()
obj.city = re.search('(?<=city=").+?(?=")',i).group()
obj.state = re.search('(?<=state=").+?(?=")',i).group()
obj.lat = re.search('(?<=lat=").+?(?=")',i).group()
obj.lng = re.search('(?<=lng=").+?(?=")',i).group()
obj.area = re.search('(?<=label=").+?(?=")',i).group()
raw_add = re.search(search_add.decode('utf-8'),i.decode('utf-8')).group()
try:
process1 = re.sub('<h5>','',raw_add)
process2 = re.sub('</h5>',' ',process1)
process3 = re.sub('<p>','',process2)
process4 = re.sub('<br />',' ',process3)
process5 = re.sub('</p>','',process4)
process6 = re.sub('&','&',process5)
obj.addr = process6
except:
pass
dumped_data.append(obj.__dict__)
f = open('levis_address1111.json','w')
dump(dumped_data, f, indent = 1)
这里的问题是,只要正则表达式匹配的地址以 'Tel' 结尾,数据就会被提取,但当它以“
”结尾时,就会弹出错误。刚刚调试了您的代码,似乎字符串 html 已转义,因此您应该将正则表达式更改为:
search_add = '(?<=html=").+?(?=Tel|<\/p>)'
正如 beerbajay 已经建议的那样,如果您想绕过错误,请在尝试提取组之前检查是否存在匹配项(如错误所述,它不适用于 NoneType (没有正则表达式匹配)).
尝试您的示例并打印一些调试信息,我发现:
debug i: city="Amravati" state="Maharashtra" lat="20.930138" lng="77.754321" html="<h5>Tri Star Retail Pvt. Ltd(OLS):</h5> <p>Near HDFC Bank,<br />Main Market Road, <br />Jaystambh Chowk Road,<br />Amravati-440601. <br />Tel: 0721-561396</p>" label="Amravati" icontype="Levi\'s" category="<h5>Levi\'s Showroom:</h5> <p>Near HDFC Bank,<br />Main Market Road, <br />Jaystambh Chowk Road,<br />Amravati-440601.</p>
raw_add: <h5>Tri Star Retail Pvt. Ltd(OLS):</h5> <p>Near HDFC Bank,<br />Main Market Road, <br />Jaystambh Chowk Road,<br />Amravati-440601. <br />
debug i: city="Bangalore" state="Karnataka" lat="12.935816" lng="77.610294" html="<img src=\'../Images/FindUs/LoopProgram.gif\' style=\'float:right; padding-left:5px;\' alt=\'Levi\xe2\x80\x99s\xc2\xae Loop Program\' /><h5>Prakruthi Apparels(OLS):</h5> <p>Housur road, Forum mall,<br /> Bangalore.</p>" label="Forum mall" icontype="Levi\'s" category="<img src=\'../Images/FindUs/LoopProgramW.gif\' style=\'float:right; padding-right:5px;\' alt=\'Levi\xe2\x80\x99s\xc2\xae Loop Program\' /><h5>Levi\'s Showroom:</h5><p>Housur road,<br />Forum mall,<br /> Bangalore.</p>
首先'debug i'是一个确实包含"Tel"的字符串,所以匹配。在第二个中,我没有看到任何
,因此您的正则表达式不匹配。您可能需要对 regex/include 一些更多可能的场景进行更多调试。 确实;通常最好不要使用正则表达式进行 html/xml 解析。