从 RSS 提要中解析重复的名称元素

Question

我正在解析此 RSS 提要 -> https://gh.bmj.com/rss/recent.xml 每个 <item> 块都有 2 个名称为 <dc:identifier> 的元素：

<item rdf:about="http://gh.bmj.com/cgi/content/short/4/4/e001065?rss=1">
<title>
<![CDATA[
Use of routinely collected electronic healthcare data for postlicensure vaccine safety signal detection: a systematic review
]]>
</title>
<link>
http://gh.bmj.com/cgi/content/short/4/4/e001065?rss=1
</link>
<description>
<![CDATA[
<sec><st>Background</st> <p>Concerns regarding adverse events following vaccination (AEFIs) are a key challenge for public confidence in vaccination. Robust postlicensure vaccine safety monitoring remains critical to detect adverse events, including those not identified in prelicensure studies, and to ensure public safety and public confidence in vaccination. We summarise the literature examined AEFI signal detection using electronic healthcare data, regarding data sources, methodological approach and statistical analysis techniques used.</p> </sec> <sec><st>Methods</st> <p>We performed a systematic review using the Preferred Reporting Items for Systematic Reviews and Meta-analyses guidelines. Five databases (PubMed/Medline, EMBASE, CINAHL, the Cochrane Library and Web of Science) were searched for studies on AEFIs monitoring published up to 25 September 2017. Studies were appraised for methodological quality, and results were synthesised narratively.</p> </sec> <sec><st>Result</st> <p>We included 47 articles describing AEFI signal detection using electronic healthcare data. All studies involved linked diagnostic healthcare data, from the emergency department, inpatient and outpatient setting and immunisation records. Statistical analysis methodologies used included non-sequential analysis in 33 studies, group sequential analysis in two studies and 12 studies used continuous sequential analysis. Partially elapsed risk window and data accrual lags were the most cited barriers to monitor AEFIs in near real-time.</p> </sec> <sec><st>Conclusion</st> <p>Routinely collected electronic healthcare data are increasingly used to detect AEFI signals in near real-time. Further research is required to check the utility of non-coded complaints and encounters, such as telephone medical helpline calls, to enhance AEFI signal detection.</p> </sec> <sec><st>Trial registration number</st> <p>CRD42017072741</p> </sec>
]]>
</description>
<dc:creator>
<![CDATA[ Mesfin, Y. M., Cheng, A., Lawrie, J., Buttery, J. ]]>
</dc:creator>
<dc:date>2019-07-08T21:52:19-07:00</dc:date>
<dc:identifier>info:doi/10.1136/bmjgh-2018-001065</dc:identifier>
<dc:identifier>hwp:master-id:bmjgh;bmjgh-2018-001065</dc:identifier>
<dc:publisher>BMJ Publishing Group Ltd</dc:publisher>
<dc:subject>
<![CDATA[ Open access ]]>
</dc:subject>
<dc:title>
<![CDATA[
Use of routinely collected electronic healthcare data for postlicensure vaccine safety signal detection: a systematic review
]]>
</dc:title>
<prism:publicationDate>2019-07-08</prism:publicationDate>
<prism:section>Research</prism:section>
<prism:volume>4</prism:volume>
<prism:number>4</prism:number>
<prism:startingPage>e001065</prism:startingPage>
<prism:endingPage>e001065</prism:endingPage>
</item>

在这 2 个元素中：

<dc:identifier>info:doi/10.1136/bmjgh-2018-001065</dc:identifier>
<dc:identifier>hwp:master-id:bmjgh;bmjgh-2018-001065</dc:identifier>

我想要包含 doi - info:doi/10.1136/bmjgh-2018-001065 的那个，但是当我使用 python feedparser (https://pythonhosted.org/feedparser/) 时，我只得到第二个，我的假设是因为它获取第一个的值，但在遇到第二个具有相同名称的元素时覆盖它。有什么办法可以防止或克服这个问题吗？

Answer 1

您可以从 url 下载带有 urllib.request.urlretrieve 的 rss 文件，然后使用 minidom 删除不需要的 dc:identifier's第一的。之后，您可以使用 feedparser 来访问您想要的值。

from xml.dom import minidom
from urllib import request
import feedparser
request.urlretrieve("https://gh.bmj.com/rss/recent.xml", "recent.xml")
xmldoc = minidom.parse('recent.xml')
itemlist = xmldoc.getElementsByTagName('dc:identifier')

for item in itemlist:
    if item.firstChild.nodeValue.startswith("hwp:"):
        p = item.parentNode
        p.removeChild(item)

file_handle = open("recent_modified.xml","w+")
xmldoc.writexml(file_handle)
file_handle.close()

d = feedparser.parse('recent_modified.xml')

for item in d.entries:
    print(item.dc_identifier)

Answer 2

在这种情况下，一个简单的正则表达式就可以很好地做到这一点。

In [1]: text = '''<item rdf:about="http://gh.bmj.com/cgi/content/short/4/4/e001065?rss=1"> 
   ...: <title> 
   ...: <![CDATA[ 
   ...: Use of routinely collected electronic healthcare data for postlicensure vaccine safety signal det
   ...: ection: a systematic review 
   ...: ]]> 
   ...: </title> 
   ...: <link>...'''

In [2]: import re                                                                                        

In [3]: re.findall('<dc:identifier>(info:doi.*?)</dc:identifier>', text)                                 
Out[3]: ['info:doi/10.1136/bmjgh-2018-001065']

如果文本在标签内包含换行符，您可以先删除它们：

text = text.replace('\n', '')

但在这种情况下似乎没有必要。

从 RSS 提要中解析重复的名称元素

Parse duplicate name elements from RSS feed

python

feedparser

xml-parsing