解析非标准 HL7 XML w/ Python

Parsing Non-Standard HL7 XML w/ Python

背景:

我有点熟悉通过 DOM 将 XML 解析为 Java。

我想做什么:

我正在尝试从 NLM Daily Med 网站解析 HL7 / XML 结构化产品标签。我要解析的示例 url 是:Atenolol SPL

到目前为止我尝试过的:

我试过 DOM、ElementTree、lxml 和 minidom。我能做到的最接近的一直是使用此代码:

#!/usr/bin/python3

import xml.sax
from xml.dom.minidom import parse
import xml.dom.minidom



# ------Using SAX Parser---------------
class MovieHandler(xml.sax.ContentHandler):
def __init__(self):
    self.CurrentData = ""
    self.type = ""
    self.title = ""
    self.text = ""
    self.description = ""
    self.displayName = ""

# Call when an element starts
def startElement(self, tag, attributes):
    self.CurrentData = tag
    if tag == "code":
        print ("*****Section*****")
        code = attributes["code"]
        #displayName = attributes["displayName"]
        print ("Code:", code)
        #print("Display Name:", displayName)

# Call when an elements ends
def endElement(self, tag):
    if self.CurrentData == "type":
        print ("Type:", self.type)
    elif self.CurrentData == "displayName":
        print("Display Name:", self.displayName)
    elif self.CurrentData == "title":
        print ("Title:", self.CurrentData.title())
    elif self.CurrentData == "text":
        print ("Text:", self.text)
    elif self.CurrentData == "description":
        print ("Description:", self.description)
    self.CurrentData = ""

# Call when a character is read
def characters(self, content):
    if self.CurrentData == "type":
        self.type = content
    elif self.CurrentData == "format":
        self.format = content
    elif self.CurrentData == "year":
        self.year = content
    elif self.CurrentData == "rating":
        self.rating = content
    elif self.CurrentData == "stars":
        self.stars = content
    elif self.CurrentData == "description":
        self.description = content


if (__name__ == "__main__"):
# create an XMLReader
parser = xml.sax.make_parser()
# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)

# override the default ContextHandler
Handler = MovieHandler()
parser.setContentHandler(Handler)

parser.parse(saved_file_path)

控制台中的结果是:

 *****Section***** Code: 34391-3 Title: Title
*****Section***** Code: 57664-264
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 368GB5141J
*****Section***** Code: 70097M6I30
*****Section***** Code: 57664-264-88
*****Section***** Code: 57664-264-13
*****Section***** Code: 57664-264-18
*****Section***** Code: SPLCOLOR
*****Section***** Code: SPLSHAPE
*****Section***** Code: SPLSCORE
*****Section***** Code: SPLSIZE
*****Section***** Code: SPLIMPRINT
*****Section***** Code: SPLCOATING
*****Section***** Code: SPLSYMBOL
*****Section***** Code: 57664-265
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 368GB5141J
*****Section***** Code: 70097M6I30
*****Section***** Code: 57664-265-88
*****Section***** Code: 57664-265-13
*****Section***** Code: 57664-265-18
*****Section***** Code: SPLCOLOR
*****Section***** Code: SPLSHAPE
*****Section***** Code: SPLSCORE
*****Section***** Code: SPLSIZE
*****Section***** Code: SPLIMPRINT
*****Section***** Code: SPLCOATING
*****Section***** Code: SPLSYMBOL
*****Section***** Code: 57664-266
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 368GB5141J
*****Section***** Code: 70097M6I30
*****Section***** Code: 57664-266-88
*****Section***** Code: 57664-266-13
*****Section***** Code: 57664-266-18
*****Section***** Code: SPLCOLOR
*****Section***** Code: SPLSHAPE
*****Section***** Code: SPLSCORE
*****Section***** Code: SPLSIZE
*****Section***** Code: SPLIMPRINT
*****Section***** Code: SPLCOATING
*****Section***** Code: SPLSYMBOL
*****Section***** Code: 34066-1 Title: Title Title: Title
*****Section***** Code: 34089-3 Title: Title
*****Section***** Code: 34090-1 Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 34067-9 Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 34070-3 Title: Title
*****Section***** Code: 34071-1 Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 42232-9 Title: Title
*****Section***** Code: 34072-9 Title: Title
*****Section***** Code: 34073-7 Title: Title
*****Section***** Code: 34083-6 Title: Title
*****Section***** Code: 34091-9 Title: Title
*****Section***** Code: 42228-7 Title: Title
*****Section***** Code: 34080-2 Title: Title
*****Section***** Code: 34081-0 Title: Title
*****Section***** Code: 34082-8 Title: Title Title: Title Title: Title
*****Section***** Code: 34084-4 Title: Title Title: Title Title: Title Text: 
*****Section***** Code: 34088-5 Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Text: 
*****Section***** Code: 34068-7 Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 34069-5 Title: Title

Process finished with exit code 0

问题/什么不工作:

我真的不需要包含 "Code: XXXXX-X" 的部分之前的部分 对于这些部分中的每一个,我都想获取该部分及其所有子部分的 <title><text><paragraph> 标记的值。

虽然我已经能够使用 DOM、ElementTree、lxml 和 minidom 的教程,但目标 XML 是非标准的并且在单个标记中包含多个属性,对于示例:

<code code="34090-1" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Clinical Pharmacology section" />

有些 nodes/elements 将包含一个快捷方式结束标记(如上所示),而其他一些将包含完整的传统结束标记。

难怪医疗这么复杂!

那么我如何获取标签的内容并遍历子部分以执行相同的操作?

我希望我回答对了你的问题,此代码通过 requests 模块加载 XML,然后提取每个 <code> 以及随后的 <title><paragraph> 里面 <text>:

import requests
from bs4 import BeautifulSoup

url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/f36d4ed3-dcbb-4465-9fa6-1da811f555e6.xml'

soup = BeautifulSoup( requests.get(url).text, 'html.parser' )

for section in soup.select('section:has(> code[code]):has(> title)'):
    print('Code   = ', section.select_one('code')['code'])

    for title in section.select('title'):
        print()
        print('Title  = ', title.text)
        print('*' * 80)
        txt = title.find_next_sibling('text')

        if not txt:
            continue

        for paragraph in txt.select('paragraph'):
            for tag in paragraph.select('br'):
                tag.replace_with("\n")
            print()
            lines = '\n'.join(line.strip() for line in paragraph.get_text().splitlines() if line.strip())
            print(lines)

    print('-' * 120 + '\n')

打印:

Code   =  34066-1

Title  =  BOXED WARNING
********************************************************************************

Title  =  Cessation of Therapy with Atenolol
********************************************************************************

Patients with coronary artery disease, who are being treated with atenolol, should be advised against abrupt discontinuation of therapy. Severe exacerbation of angina and the occurrence of myocardial infarction 
and ventricular arrhythmias have been reported in angina patients following the abrupt discontinuation of therapy with beta-blockers. The last two complications may occur with or without preceding exacerbation o
f the angina pectoris. As with other beta-blockers, when discontinuation of atenolol tablet, USP, is planned, the patients should be carefully observed and advised to limit physical activity to a minimum. If the
 angina worsens or acute coronary insufficiency develops, it is recommended that atenolol tablet, USP be promptly reinstituted, at least temporarily. Because coronary artery disease is common and may be unrecogn
ized, it may be prudent not to discontinue atenolol tablet, USP, therapy abruptly even in patients treated only for hypertension. (See DOSAGE AND ADMINISTRATION.)
------------------------------------------------------------------------------------------------------------------------

Code   =  34089-3

Title  =  DESCRIPTION
********************************************************************************

Atenolol, USP, a synthetic, beta1-selective (cardioselective) adrenoreceptor blocking agent, may be chemically described as benzeneacetamide, 4 -[2'-hydroxy- 3'-[(1- methylethyl) amino] propoxy]-. The molecular 
and structural formulas are:

Atenolol (free base) has a molecular weight of 266.34. It is a relatively polar hydrophilic compound with a water solubility of 26.5 mg/mL at 37°C and a log partition coefficient (octanol/water) of 0.23. It is f
reely soluble in 1N HCl (300 mg/mL at 25°C) and less soluble in chloroform (3 mg/mL at 25°C).

Atenolol is available as 25, 50 and 100 mg tablets for oral administration.

Each tablet contains the labeled amount of atenolol, USP and the following inactive ingredients: povidone, microcrystalline cellulose, corn starch, sodium lauryl sulfate, croscarmellose sodium, colloidal silicon
 dioxide, sodium stearyl fumarate and magnesium stearate.
------------------------------------------------------------------------------------------------------------------------

...and so on.

我来晚了。

import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc 

url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/f36d4ed3-dcbb-4465-9fa6-1da811f555e6.xml'
doc = SimplifiedDoc(requests.get(url).text).getElementByTag('document')
id = doc.id # doc.getElementByTag('id') # get node by tag
print (id) # {'tag': 'id', 'root': '703B8B58-E0F2-9A0B-3443-E5F84ED5BF47'}

id = doc.structuredBody.id # The shorter the path, the better the performance
print (id) # {'tag': 'id', 'root': '11E3A4BE-274B-0B13-8006-59D6FFE10481'}

lst = doc.getChildren() # get child nodes
for l in lst:
  print (l.tag)