抓取时处理键错误

Handle Key-Error whilst scraping

我目前正在编写脚本以从 ClinicalTrials.gov 抓取数据。为此,我编写了以下脚本:

def clinicalTrialsGov (id):
    url = "https://clinicaltrials.gov/ct2/show/" + id + "?displayxml=true"
    data = BeautifulSoup(requests.get(url).text, "lxml")
    studyType = data.study_type.text
    if studyType == 'Interventional':
        allocation = data.allocation.text
        interventionModel = data.intervention_model.text
        primaryPurpose = data.primary_purpose.text
        masking = data.masking.text
        enrollment = data.enrollment.text
    officialTitle = data.official_title.text
    condition = data.condition.text
    minAge = data.eligibility.minimum_age.text
    maxAge = data.eligibility.maximum_age.text
    gender = data.eligibility.gender.text
    healthyVolunteers = data.eligibility.healthy_volunteers.text
    armType = []
    intType = []
    for each in data.findAll('intervention'):
        intType.append(each.intervention_type.text)
    for each in data.findAll('arm_group'):
        armType.append(each.arm_group_type.text)
    citedPMID = tryExceptCT(data, '.results_reference.PMID')
    citedPMID = data.results_reference.PMID
    print(citedPMID)
    return officialTitle, studyType, allocation, interventionModel, primaryPurpose, masking, enrollment, condition, minAge, maxAge, gender, healthyVolunteers, armType, intType

但是,以下脚本并不总是有效,因为并非所有研究都会包含所有项目(即会出现 KeyError)。为了解决这个问题,我可以简单地将每个语句包装在一个 try-except catch 中,如下所示:

try:
  studyType = data.study_type.text
except:
  studyType = ""

但这似乎是一种糟糕的实现方式。什么是 better/cleaner 解决方案?

这是个好问题。在我解决它之前,让我说你应该考虑将 BeautifulSoup (BS) 构造函数的第二个参数从 lxml 更改为 xml。否则,BS 不会将解析的标记标记为 XML(要自己验证这一点,请访问代码中 data 变量的 is_xml 属性)。

通过将所需元素名称列表传递给 find_all() 方法,可以避免在尝试访问不存在的元素时产生错误:

subset = ['results_reference','allocation','interventionModel','primaryPurpose','masking','enrollment','eligibility','official_title','arm_group','condition']

tag_matches = data.find_all(subset)

然后,如果您想从标签列表中获取特定元素而不遍历它,您可以使用标签名称作为键将其转换为字典:

tag_dict = dict((tag_matches[i].name, tag_matches[i]) for i in range(0, len(tag_matches)))