python 中的静默错误处理?
Silent erroer handling in python?
我得到了包含大量 URL 的 csv 文件。为了方便起见,我将其读入 pandas 数据框。稍后我需要做一些统计工作 - pandas 非常方便。它看起来有点像这样:
import pandas as pd
csv = [{"URLs" : "www.mercedes-benz.de", "electric" : 1}, {"URLs" : "www.audi.de", "electric" : 0}, {"URLs" : "ww.audo.e", "electric" : 0}, {"URLs" : "NaN", "electric" : 0}]
df = pd.DataFrame(csv)
我的任务是检查网站是否包含某些字符串,并添加一个额外的列,如果是,则为 1,否则为 0。例如:我想检查 www.mercedes-benz.de
是否包含字符串 car
。我执行以下操作:
for i, row in df.iterrows():
page_content = requests.get(row['URLs'])
if "car" in page_content.text:
df.loc[i, 'car'] = '1'
else:
df.loc[i, 'car'] = '0'
问题是:有时 URL 是 wrong/missing。我的小脚本导致错误。
如果 URL 是 wrong/missing,我如何 handle/supress 错误?而且,我怎么能在这些情况下使用 df.loc[i, 'url_wrong'] = '1'
来表示 URL 是 wrong/missing?
希望我没弄错,'NaN'
是 "wrong/missing" URL。在这种情况下,您可以检查一下。有无数种方法可以表示缺少 URL。我更喜欢 car
的缺失值:试试这个:
import pandas as pd
csv = [{"URLs" : "www.mercedes-benz.de", "electric" : 1}, {"URLs" : "www.audi.de", "electric" : 0}, {"URLs" : "ww.audo-car.e", "electric" : 0}, {"URLs" : "NaN", "electric" : 0}]
df = pd.DataFrame(csv)
print(df)
for i, row in df.iterrows():
page_content = row['URLs']
if page_content is None or page_content is "NaN":
df.loc[i, 'car'] = None
elif "car" in page_content:
df.loc[i, 'car'] = True
else:
df.loc[i, 'car'] = False
print(df.loc[i, 'car'])
print(df)
我在你的代码中编辑了一些东西,因为它们不起作用。例如,带有 page_content = requests.get(row['URLs'])
- requests
的这一行未定义。我猜你的意思是 row
。
尝试定义一个首先进行 "car" 检查的函数,然后使用 pandas Series
的 .apply
方法来获取 1
, 0
或 Wrong URL
。以下应该有所帮助:
import pandas as pd
import requests
data = [{"URLs" : "https://www.mercedes-benz.de", "electric" : 1},
{"URLs" : "https://www.audi.de", "electric" : 0},
{"URLs" : "https://ww.audo.e", "electric" : 0},
{"URLs" : "NaN", "electric" : 0}]
def contains_car(link):
try:
return int('car' in requests.get(link).text)
except:
return "Wrong/Missing URL"
df = pd.DataFrame(data)
df['extra_column'] = df.URLs.apply(contains_car)
# URLs electric extra_column
# 0 https://www.mercedes-benz.de 1 1
# 1 https://www.audi.de 0 1
# 2 https://ww.audo.e 0 Wrong/Missing URL
# 3 NaN 0 Wrong/Missing URL
编辑:
您可以在 HTTP 请求的 returned 文本中搜索多个关键字。根据您设置的条件,这可以使用内置函数 any
或内置函数 all
来完成。使用any
意味着找到任何一个关键字应该return1,而使用all
意味着必须匹配所有关键字才能return1。在下面例如,我将 any
与 'car'、'automobile'、'vehicle':
等关键字一起使用
import pandas as pd
import requests
data = [{"URLs" : "https://www.mercedes-benz.de", "electric" : 1},
{"URLs" : "https://www.audi.de", "electric" : 0},
{"URLs" : "https://ww.audo.e", "electric" : 0},
{"URLs" : "NaN", "electric" : 0}]
def contains_keywords(link, keywords):
try:
output = requests.get(link).text
return int(any(x in output for x in keywords))
except:
return "Wrong/Missing URL"
df = pd.DataFrame(data)
mykeywords = ('car', 'vehicle', 'automobile')
df['extra_column'] = df.URLs.apply(lambda l: contains_keywords(l, mykeywords))
应该产生:
# URLs electric extra_column
# 0 https://www.mercedes-benz.de 1 1
# 1 https://www.audi.de 0 1
# 2 https://ww.audo.e 0 Wrong/Missing URL
# 3 NaN 0 Wrong/Missing URL
希望对您有所帮助。
我得到了包含大量 URL 的 csv 文件。为了方便起见,我将其读入 pandas 数据框。稍后我需要做一些统计工作 - pandas 非常方便。它看起来有点像这样:
import pandas as pd
csv = [{"URLs" : "www.mercedes-benz.de", "electric" : 1}, {"URLs" : "www.audi.de", "electric" : 0}, {"URLs" : "ww.audo.e", "electric" : 0}, {"URLs" : "NaN", "electric" : 0}]
df = pd.DataFrame(csv)
我的任务是检查网站是否包含某些字符串,并添加一个额外的列,如果是,则为 1,否则为 0。例如:我想检查 www.mercedes-benz.de
是否包含字符串 car
。我执行以下操作:
for i, row in df.iterrows():
page_content = requests.get(row['URLs'])
if "car" in page_content.text:
df.loc[i, 'car'] = '1'
else:
df.loc[i, 'car'] = '0'
问题是:有时 URL 是 wrong/missing。我的小脚本导致错误。
如果 URL 是 wrong/missing,我如何 handle/supress 错误?而且,我怎么能在这些情况下使用 df.loc[i, 'url_wrong'] = '1'
来表示 URL 是 wrong/missing?
希望我没弄错,'NaN'
是 "wrong/missing" URL。在这种情况下,您可以检查一下。有无数种方法可以表示缺少 URL。我更喜欢 car
的缺失值:试试这个:
import pandas as pd
csv = [{"URLs" : "www.mercedes-benz.de", "electric" : 1}, {"URLs" : "www.audi.de", "electric" : 0}, {"URLs" : "ww.audo-car.e", "electric" : 0}, {"URLs" : "NaN", "electric" : 0}]
df = pd.DataFrame(csv)
print(df)
for i, row in df.iterrows():
page_content = row['URLs']
if page_content is None or page_content is "NaN":
df.loc[i, 'car'] = None
elif "car" in page_content:
df.loc[i, 'car'] = True
else:
df.loc[i, 'car'] = False
print(df.loc[i, 'car'])
print(df)
我在你的代码中编辑了一些东西,因为它们不起作用。例如,带有 page_content = requests.get(row['URLs'])
- requests
的这一行未定义。我猜你的意思是 row
。
尝试定义一个首先进行 "car" 检查的函数,然后使用 pandas Series
的 .apply
方法来获取 1
, 0
或 Wrong URL
。以下应该有所帮助:
import pandas as pd
import requests
data = [{"URLs" : "https://www.mercedes-benz.de", "electric" : 1},
{"URLs" : "https://www.audi.de", "electric" : 0},
{"URLs" : "https://ww.audo.e", "electric" : 0},
{"URLs" : "NaN", "electric" : 0}]
def contains_car(link):
try:
return int('car' in requests.get(link).text)
except:
return "Wrong/Missing URL"
df = pd.DataFrame(data)
df['extra_column'] = df.URLs.apply(contains_car)
# URLs electric extra_column
# 0 https://www.mercedes-benz.de 1 1
# 1 https://www.audi.de 0 1
# 2 https://ww.audo.e 0 Wrong/Missing URL
# 3 NaN 0 Wrong/Missing URL
编辑:
您可以在 HTTP 请求的 returned 文本中搜索多个关键字。根据您设置的条件,这可以使用内置函数 any
或内置函数 all
来完成。使用any
意味着找到任何一个关键字应该return1,而使用all
意味着必须匹配所有关键字才能return1。在下面例如,我将 any
与 'car'、'automobile'、'vehicle':
import pandas as pd
import requests
data = [{"URLs" : "https://www.mercedes-benz.de", "electric" : 1},
{"URLs" : "https://www.audi.de", "electric" : 0},
{"URLs" : "https://ww.audo.e", "electric" : 0},
{"URLs" : "NaN", "electric" : 0}]
def contains_keywords(link, keywords):
try:
output = requests.get(link).text
return int(any(x in output for x in keywords))
except:
return "Wrong/Missing URL"
df = pd.DataFrame(data)
mykeywords = ('car', 'vehicle', 'automobile')
df['extra_column'] = df.URLs.apply(lambda l: contains_keywords(l, mykeywords))
应该产生:
# URLs electric extra_column
# 0 https://www.mercedes-benz.de 1 1
# 1 https://www.audi.de 0 1
# 2 https://ww.audo.e 0 Wrong/Missing URL
# 3 NaN 0 Wrong/Missing URL
希望对您有所帮助。