从 PubMed 高效获取 ID

Efficently getting ID's from PubMed

我目前正在寻找 PubMed/MEDLINE 上的引用与临床试验注册之间的直接联系。具体来说,给定一个 PMID,我希望在任何临床试验注册表中找到引文的所有 ID。 (例如,参见 PMID 29593018 which has id ACTRN12616000470493

目前,我仅使用以下正则表达式搜索指向 ClinicalTrials.gov 的链接(ID 形式:NCT 后跟 8 位数字(例如 NCT01435343)):

attributes = {'mdTitle': 'High-dose versus standard-dose amoxicillin/clavulanate for clinically-diagnosed acute bacterial sinusitis: A randomized clinical trial.', 'mdAbstract': 'BACKGROUND: The recommended treatment for acute bacterial sinusitis in adults, amoxicillin with clavulanate, provides only modest benefit. OBJECTIVE: To see if a higher dose of amoxicillin will lead to more rapid improvement. DESIGN, SETTING, AND PARTICIPANTS: Double-blind randomized trial in which, from November 2014 through February 2017, we enrolled 315 adult outpatients diagnosed with acute sinusitis in accordance with Infectious Disease Society of America guidelines. INTERVENTIONS: Standard-dose (SD) immediate-release (IR) amoxicillin/clavulanate 875 /125 mg (n = 159) vs. high-dose (HD) (n = 156). The original HD formulation, 2000 mg of extended-release (ER) amoxicillin with 125 mg of IR clavulanate twice a day, became unavailable half way through the study. The IRB then approved a revised protocol after patient 180 to provide 1750 mg of IR amoxicillin twice a day in the HD formulation and to compare Time Period 1 (ER) with Time Period 2 (IR). MAIN MEASURE: The primary outcome was the percentage in each group reporting a major improvement-defined as a global assessment of sinusitis symptoms as "a lot better" or "no symptoms"-after 3 days of treatment. KEY RESULTS: Major improvement after 3 days was reported during Period 1 by 38.8% of ER HD versus 37.9% of SD patients (P = 0.91) and during Period 2 by 52.4% of IR HD versus 34.4% of SD patients, an effect size of 18% (95% CI 0.75 to 35%, P = 0.04). No significant differences in efficacy were seen at Day 10. The major side effect, severe diarrhea at Day 3, was reported during Period 1 by 7.4% of HD and 5.7% of SD patients (P = 0.66) and during Period 2 by 15.8% of HD and 4.8% of SD patients (P = 0.048). CONCLUSIONS: Adults with clinically diagnosed acute bacterial sinusitis were more likely to improve rapidly when treated with IR HD than with SD but not when treated with ER HD. They were also more likely to suffer severe diarrhea. Further study is needed to confirm these findings. TRIAL REGISTRATION: ClinicalTrials.gov Identifier: NCT02340000.', 'mdMesh': '', 'mdPMID': '29738561', 'mdPublicationType': ['Journal Article'], 'mdAuthor': ['Matho A', 'Mulqueen M', 'Tanino M', 'Quidort A', 'Cheung J', 'Pollard J', 'Rodriguez J', 'Swamy S', 'Tayler B', 'Garrison G', 'Ata A', 'Sorum P'], 'mdDataPublished': '2018', 'mdPMC': '', 'mdSI': ['ClinicalTrials.gov/NCT02340000'], 'mdAID': ['10.1371/journal.pone.0196734 [doi]', 'PONE-D-17-43190 [pii]'], 'mdDOI': ['10.1371/journal.pone.0196734 [doi]', 'PONE-D-17-43190 [pii]'], 'mdSO': 'PLoS One. 2018 May 8;13(5):e0196734. doi: 10.1371/journal.pone.0196734. eCollection 2018.', 'mdLanguage': ['English']}

dictString = ', '.join("{!s}={!r}".format(key,val) for (key,val) in attributes.items())
for each in dictString.split(' '):
    if re.match(r'(NCT)\d{8}', each):
        print (each.strip('.\','))

但是,PubMed/MEDLINE 也包含 40 other clinical trial registration ID's。我也希望获得这些 ID。我怎样才能比多写 40 个正则表达式语句更有效地做到这一点?

注意:为了澄清,我需要识别每个 ID 和每个 ID 的正文。 (即 NCT01435343 的 ClinicalTrials.Gov 或 ACTRN12616000470493 的澳大利亚新西兰临床试验注册中心)

我没有看过一堆以了解是否适用相同的模式,但如果它们总是遵循 html <h4> 标签内的 "TRIAL REGISTRATION NUMBER:" 文本,您可以解析<h4> 标签的实际 html 文档包含此术语,然后从 <p> 标签中的以下段落中提取文本。 BeautifulSoup 使这相对简单。

但同样,您只展示了一个示例。我不知道它是否总是遵循这种模式。从那里它们似乎是分号分隔的,这很容易拆分。