抓取时处理键错误
Handle Key-Error whilst scraping
我目前正在编写脚本以从 ClinicalTrials.gov 抓取数据。为此,我编写了以下脚本:
def clinicalTrialsGov (id):
url = "https://clinicaltrials.gov/ct2/show/" + id + "?displayxml=true"
data = BeautifulSoup(requests.get(url).text, "lxml")
studyType = data.study_type.text
if studyType == 'Interventional':
allocation = data.allocation.text
interventionModel = data.intervention_model.text
primaryPurpose = data.primary_purpose.text
masking = data.masking.text
enrollment = data.enrollment.text
officialTitle = data.official_title.text
condition = data.condition.text
minAge = data.eligibility.minimum_age.text
maxAge = data.eligibility.maximum_age.text
gender = data.eligibility.gender.text
healthyVolunteers = data.eligibility.healthy_volunteers.text
armType = []
intType = []
for each in data.findAll('intervention'):
intType.append(each.intervention_type.text)
for each in data.findAll('arm_group'):
armType.append(each.arm_group_type.text)
citedPMID = tryExceptCT(data, '.results_reference.PMID')
citedPMID = data.results_reference.PMID
print(citedPMID)
return officialTitle, studyType, allocation, interventionModel, primaryPurpose, masking, enrollment, condition, minAge, maxAge, gender, healthyVolunteers, armType, intType
但是,以下脚本并不总是有效,因为并非所有研究都会包含所有项目(即会出现 KeyError
)。为了解决这个问题,我可以简单地将每个语句包装在一个 try-except catch 中,如下所示:
try:
studyType = data.study_type.text
except:
studyType = ""
但这似乎是一种糟糕的实现方式。什么是 better/cleaner 解决方案?
这是个好问题。在我解决它之前,让我说你应该考虑将 BeautifulSoup (BS) 构造函数的第二个参数从 lxml
更改为 xml
。否则,BS 不会将解析的标记标记为 XML(要自己验证这一点,请访问代码中 data
变量的 is_xml
属性)。
通过将所需元素名称列表传递给 find_all()
方法,可以避免在尝试访问不存在的元素时产生错误:
subset = ['results_reference','allocation','interventionModel','primaryPurpose','masking','enrollment','eligibility','official_title','arm_group','condition']
tag_matches = data.find_all(subset)
然后,如果您想从标签列表中获取特定元素而不遍历它,您可以使用标签名称作为键将其转换为字典:
tag_dict = dict((tag_matches[i].name, tag_matches[i]) for i in range(0, len(tag_matches)))
我目前正在编写脚本以从 ClinicalTrials.gov 抓取数据。为此,我编写了以下脚本:
def clinicalTrialsGov (id):
url = "https://clinicaltrials.gov/ct2/show/" + id + "?displayxml=true"
data = BeautifulSoup(requests.get(url).text, "lxml")
studyType = data.study_type.text
if studyType == 'Interventional':
allocation = data.allocation.text
interventionModel = data.intervention_model.text
primaryPurpose = data.primary_purpose.text
masking = data.masking.text
enrollment = data.enrollment.text
officialTitle = data.official_title.text
condition = data.condition.text
minAge = data.eligibility.minimum_age.text
maxAge = data.eligibility.maximum_age.text
gender = data.eligibility.gender.text
healthyVolunteers = data.eligibility.healthy_volunteers.text
armType = []
intType = []
for each in data.findAll('intervention'):
intType.append(each.intervention_type.text)
for each in data.findAll('arm_group'):
armType.append(each.arm_group_type.text)
citedPMID = tryExceptCT(data, '.results_reference.PMID')
citedPMID = data.results_reference.PMID
print(citedPMID)
return officialTitle, studyType, allocation, interventionModel, primaryPurpose, masking, enrollment, condition, minAge, maxAge, gender, healthyVolunteers, armType, intType
但是,以下脚本并不总是有效,因为并非所有研究都会包含所有项目(即会出现 KeyError
)。为了解决这个问题,我可以简单地将每个语句包装在一个 try-except catch 中,如下所示:
try:
studyType = data.study_type.text
except:
studyType = ""
但这似乎是一种糟糕的实现方式。什么是 better/cleaner 解决方案?
这是个好问题。在我解决它之前,让我说你应该考虑将 BeautifulSoup (BS) 构造函数的第二个参数从 lxml
更改为 xml
。否则,BS 不会将解析的标记标记为 XML(要自己验证这一点,请访问代码中 data
变量的 is_xml
属性)。
通过将所需元素名称列表传递给 find_all()
方法,可以避免在尝试访问不存在的元素时产生错误:
subset = ['results_reference','allocation','interventionModel','primaryPurpose','masking','enrollment','eligibility','official_title','arm_group','condition']
tag_matches = data.find_all(subset)
然后,如果您想从标签列表中获取特定元素而不遍历它,您可以使用标签名称作为键将其转换为字典:
tag_dict = dict((tag_matches[i].name, tag_matches[i]) for i in range(0, len(tag_matches)))