通过创建一个单独的函数使 try-except 的变通方法应用于单行中的许多语句
Making workaround of try-except to apply on many statement in single line by creating a separate function
我正在 从 https://www.dictionary.com/ 网站抓取 词典数据。目的是从字典页面中删除不需要的元素并将它们离线保存以供进一步处理。由于网页有些非结构化,因此可能存在也可能不存在下面代码中提到的要删除的元素;缺少元素会导致异常(在片段 2 中)。而且由于在实际代码中,有很多元素需要删除,它们可能存在或不存在,如果我们将 try - except
应用于每个这样的语句,代码行数将急剧增加。
因此,我正在通过为 try - except
(在片段 3 中)创建一个单独的函数来解决这个问题,我的想法是从 中得到的。但是我无法让片段 3 中的代码正常工作,因为 soup.find_all('style')
之类的命令是 returning None
应该是 return 所有 [=] 的列表17=] 标签类似于片段 2。我不能直接应用引用的解决方案,因为有时我必须通过引用它的 parent
或 sibling
来间接到达预期的元素以移除,例如 soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent
代码段1用于设置代码执行环境。
如果您能提供一些建议以使代码段 3 正常工作,那就太好了。
片段1(设置代码执行环境):
import urllib.request
import requests
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',}
folder = "dictionary_com"
片段 2(有效):
def makedefinition(url):
success = False
while success==False:
try:
request=urllib.request.Request(url,headers=headers)
final_url = urllib.request.urlopen(request, timeout=5).geturl()
r = requests.get(final_url, headers=headers, timeout=5)
success=True
except:
success=False
soup = BeautifulSoup(r.text, 'lxml')
soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})
# there are many more elements to remove. mentioned only 2 for shortness
remove = soup.find_all("style") # style tags
remove.extend(safe_execute(soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent)) # related content in the page
for x in remove: x.decompose()
return(soup)
# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)
with open(folder+"/demo.html", "w") as file:
file.write(str(maggi))
代码段 3(无效):
soup = None
def safe_execute(command):
global soup
try:
print(soup) # correct soup is printed
print(exec(command)) # this should print the list of style tags but printing None, and for related content this should throw some exception
return exec(command) # None is being returned for style
except Exception:
print(Exception.with_traceback())
return []
def makedefinition(url):
global soup
success = False
while success==False:
try:
request=urllib.request.Request(url,headers=headers)
final_url = urllib.request.urlopen(request, timeout=5).geturl()
r = requests.get(final_url, headers=headers, timeout=5)
success=True
except:
success=False
soup = BeautifulSoup(r.text, 'lxml')
soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})
# there are many more elements to remove. mentioned only 2 for shortness
remove = safe_execute("soup.find_all('style')") # style tags
remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent")) # related content in the page
for x in remove: x.decompose()
return(soup)
# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)
with open(folder+"/demo.html", "w") as file:
file.write(str(maggi))
在片段 3 的代码中,您使用 exec
内置方法 returns None
而不管它对其参数做了什么。有关详细信息,请参阅 this SO 线程。
补救措施:
使用 exec
修改变量并 return 它而不是 return 修改 exec
本身的输出。
def safe_execute(command):
d = {}
try:
exec(command, d)
return d['output']
except Exception:
print(Exception.with_traceback())
return []
然后像这样称呼它:
remove = safe_execute("output = soup.find_all('style')")
编辑:
执行此代码后,None
再次被 returned。然而,在调试时,在 try
部分中,如果我们 print(soup)
打印了正确的 soup
值,但是 exec(command,d)
给出了 NameError: name 'soup' is not defined
.
通过使用 eval()
而不是 exec()
克服了这种差异。定义的函数是:
def safe_execute(command):
global soup
try:
output = eval(command)
return(output)
except Exception:
return []
电话看起来像:
remove = safe_execute("soup.find_all('style')")
remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent"))
我正在 从 https://www.dictionary.com/ 网站抓取 词典数据。目的是从字典页面中删除不需要的元素并将它们离线保存以供进一步处理。由于网页有些非结构化,因此可能存在也可能不存在下面代码中提到的要删除的元素;缺少元素会导致异常(在片段 2 中)。而且由于在实际代码中,有很多元素需要删除,它们可能存在或不存在,如果我们将 try - except
应用于每个这样的语句,代码行数将急剧增加。
因此,我正在通过为 try - except
(在片段 3 中)创建一个单独的函数来解决这个问题,我的想法是从 soup.find_all('style')
之类的命令是 returning None
应该是 return 所有 [=] 的列表17=] 标签类似于片段 2。我不能直接应用引用的解决方案,因为有时我必须通过引用它的 parent
或 sibling
来间接到达预期的元素以移除,例如 soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent
代码段1用于设置代码执行环境。
如果您能提供一些建议以使代码段 3 正常工作,那就太好了。
片段1(设置代码执行环境):
import urllib.request
import requests
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',}
folder = "dictionary_com"
片段 2(有效):
def makedefinition(url):
success = False
while success==False:
try:
request=urllib.request.Request(url,headers=headers)
final_url = urllib.request.urlopen(request, timeout=5).geturl()
r = requests.get(final_url, headers=headers, timeout=5)
success=True
except:
success=False
soup = BeautifulSoup(r.text, 'lxml')
soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})
# there are many more elements to remove. mentioned only 2 for shortness
remove = soup.find_all("style") # style tags
remove.extend(safe_execute(soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent)) # related content in the page
for x in remove: x.decompose()
return(soup)
# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)
with open(folder+"/demo.html", "w") as file:
file.write(str(maggi))
代码段 3(无效):
soup = None
def safe_execute(command):
global soup
try:
print(soup) # correct soup is printed
print(exec(command)) # this should print the list of style tags but printing None, and for related content this should throw some exception
return exec(command) # None is being returned for style
except Exception:
print(Exception.with_traceback())
return []
def makedefinition(url):
global soup
success = False
while success==False:
try:
request=urllib.request.Request(url,headers=headers)
final_url = urllib.request.urlopen(request, timeout=5).geturl()
r = requests.get(final_url, headers=headers, timeout=5)
success=True
except:
success=False
soup = BeautifulSoup(r.text, 'lxml')
soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})
# there are many more elements to remove. mentioned only 2 for shortness
remove = safe_execute("soup.find_all('style')") # style tags
remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent")) # related content in the page
for x in remove: x.decompose()
return(soup)
# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)
with open(folder+"/demo.html", "w") as file:
file.write(str(maggi))
在片段 3 的代码中,您使用 exec
内置方法 returns None
而不管它对其参数做了什么。有关详细信息,请参阅 this SO 线程。
补救措施:
使用 exec
修改变量并 return 它而不是 return 修改 exec
本身的输出。
def safe_execute(command):
d = {}
try:
exec(command, d)
return d['output']
except Exception:
print(Exception.with_traceback())
return []
然后像这样称呼它:
remove = safe_execute("output = soup.find_all('style')")
编辑:
执行此代码后,None
再次被 returned。然而,在调试时,在 try
部分中,如果我们 print(soup)
打印了正确的 soup
值,但是 exec(command,d)
给出了 NameError: name 'soup' is not defined
.
通过使用 eval()
而不是 exec()
克服了这种差异。定义的函数是:
def safe_execute(command):
global soup
try:
output = eval(command)
return(output)
except Exception:
return []
电话看起来像:
remove = safe_execute("soup.find_all('style')")
remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent"))