Python / BeautifulSoup - 从 clinicaltrials.gov 中提取 XML 数据,只能提取没有缺失数据的研究

Python / BeautifulSoup - Extracting XML data from clinicaltrials.gov, only able to extract studies that don't have missing data

我正在使用 clinicaltrials.gov 的 API 将临床试验数据列表获取到 XML 文件中,然后解析数据以最终导出到 Excel数据集。

在我的代码提供的 URL 中,有 9 个结果,但是我的代码只提取 5/9 的数据。我意识到这是因为对于其中一个字段 (detaileddescription),只有一些试验具有此数据。当我删除 detaileddescription 并仅使用其他两个字段(nctid 和 briefdescription)时,我能够获得 9/9。除了为 detaileddescription 创建一个单独的数据框并合并之外,我还能做什么?

底线:我正在从包含 9 项临床试验的 XML 文件中提取 3 个字段:nctidbriefsummary,和detaileddescription,但我的输出只提取了 5/9 的临床试验。如果不从我的输出中取出 detaileddescription 字段,我的输出如何获得全部 9/9?

import requests
from bs4 import BeautifulSoup
import pandas as pd

out = []
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=50&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
nctids = soup.find_all("field", {"name" : "NCTId"})
briefsummaries = soup.find_all("field", {"name" : "BriefSummary"}) if soup.find_all("field", {"name" : "BriefSummary"}) is not None else 'nothing'
detaileddescriptions = soup.find_all("field", {"name" : "DetailedDescription"}) if soup.find_all("field", {"name" : "DetailedDescription"}) is not None else 'nothing'

for nctid, briefsummary, detaileddescription in zip(nctids, briefsummaries, detaileddescriptions):
    
    data = {'nctid': nctid, 'briefsummary': briefsummary, 'detaileddescription': detaileddescription}
    out.append(data)
df = pd.DataFrame(out)

df.to_excel('clinicaltrialstresults.xlsx')

您可以尝试循环访问学习列表,稍微更改您的代码

import requests
from bs4 import BeautifulSoup
import pandas as pd


out = []
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=50&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
study_list = soup.find_all("fullstudy")

for study in study_list:
    nctid = study.find("field", {"name" : "NCTId"})
    briefsummary = study.find("field", {"name" : "BriefSummary"}) if study.find("field", {"name" : "BriefSummary"}) is not None else 'nothing'
    detaileddescription = study.find("field", {"name" : "DetailedDescription"}) if study.find("field", {"name" : "DetailedDescription"}) is not None else 'nothing'
    data = {'nctid': nctid, 'briefsummary': briefsummary, 'detaileddescription': detaileddescription}
    out.append(data)

df = pd.DataFrame(out)
df.to_excel('clinicaltrialstresults.xlsx', index=False)