Python 解析异常 XML?
Python Parsing weird XML?
我有一个奇怪的问题 XML 我正在尝试解析,但在读完这篇文章后,我仍然遇到问题。
我正在尝试解析 NIST CVE 数据库,它只出现在 XML 中。这是它的一个示例。
<?xml version='1.0' encoding='UTF-8'?>
<nvd xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.1" xmlns:cvss="http://scap.nist.gov/schema/cvss-v2/0.2" xmlns:vuln="http://scap.nist.gov/schema/vulnerability/0.4" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:patch="http://scap.nist.gov/schema/patch/0.1" xmlns="http://scap.nist.gov/schema/feed/vulnerability/2.0" xmlns:cpe-lang="http://cpe.mitre.org/language/2.0" nvd_xml_version="2.0" pub_date="2017-04-12T18:00:08" xsi:schemaLocation="http://scap.nist.gov/schema/patch/0.1 https://scap.nist.gov/schema/nvd/patch_0.1.xsd http://scap.nist.gov/schema/feed/vulnerability/2.0 https://scap.nist.gov/schema/nvd/nvd-cve-feed_2.0.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
<entry id="CVE-2013-7450">
<vuln:vulnerable-configuration id="http://nvd.nist.gov/">
<cpe-lang:logical-test operator="OR" negate="false">
<cpe-lang:fact-ref name="cpe:/a:pulp_project:pulp:2.2.1-1"/>
</cpe-lang:logical-test>
</vuln:vulnerable-configuration>
<vuln:vulnerable-software-list>
<vuln:product>cpe:/a:pulp_project:pulp:2.2.1-1</vuln:product>
</vuln:vulnerable-software-list>
<vuln:cve-id>CVE-2013-7450</vuln:cve-id>
<vuln:published-datetime>2017-04-03T11:59:00.143-04:00</vuln:published-datetime>
<vuln:last-modified-datetime>2017-04-11T10:01:04.323-04:00</vuln:last-modified-datetime>
<vuln:cvss>
<cvss:base_metrics>
<cvss:score>5.0</cvss:score>
<cvss:access-vector>NETWORK</cvss:access-vector>
<cvss:access-complexity>LOW</cvss:access-complexity>
<cvss:authentication>NONE</cvss:authentication>
<cvss:confidentiality-impact>NONE</cvss:confidentiality-impact>
<cvss:integrity-impact>PARTIAL</cvss:integrity-impact>
<cvss:availability-impact>NONE</cvss:availability-impact>
<cvss:source>http://nvd.nist.gov</cvss:source>
<cvss:generated-on-datetime>2017-04-11T09:43:13.623-04:00</cvss:generated-on-datetime>
</cvss:base_metrics>
</vuln:cvss>
<vuln:cwe id="CWE-295"/>
<vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY">
<vuln:source>MLIST</vuln:source>
<vuln:reference href="http://www.openwall.com/lists/oss-security/2016/04/18/11" xml:lang="en">[oss-security] 20160418 CVE-2013-7450: Pulp < 2.3.0 distributed the same CA key to all users</vuln:reference>
</vuln:references>
<vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY">
<vuln:source>MLIST</vuln:source>
<vuln:reference href="http://www.openwall.com/lists/oss-security/2016/04/18/5" xml:lang="en">[oss-security] 20160418 Re: CVE request - Pulp < 2.3.0 shipped the same authentication CA key/cert to all users</vuln:reference>
</vuln:references>
<vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY">
<vuln:source>MLIST</vuln:source>
<vuln:reference href="http://www.openwall.com/lists/oss-security/2016/05/20/1" xml:lang="en">[oss-security] 20160519 Pulp 2.8.3 Released to address multiple CVEs</vuln:reference>
</vuln:references>
<vuln:references xml:lang="en" reference_type="PATCH">
<vuln:source>CONFIRM</vuln:source>
<vuln:reference href="https://bugzilla.redhat.com/show_bug.cgi?id=1003326" xml:lang="en">https://bugzilla.redhat.com/show_bug.cgi?id=1003326</vuln:reference>
</vuln:references>
<vuln:references xml:lang="en" reference_type="PATCH">
<vuln:source>CONFIRM</vuln:source>
<vuln:reference href="https://bugzilla.redhat.com/show_bug.cgi?id=1328345" xml:lang="en">https://bugzilla.redhat.com/show_bug.cgi?id=1328345</vuln:reference>
</vuln:references>
<vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY">
<vuln:source>CONFIRM</vuln:source>
<vuln:reference href="https://github.com/pulp/pulp/pull/627" xml:lang="en">https://github.com/pulp/pulp/pull/627</vuln:reference>
</vuln:references>
<vuln:summary>Pulp before 2.3.0 uses the same the same certificate authority key and certificate for all installations.</vuln:summary>
</entry>
<nvd>
我试图用 ET 解析它,但我得到了一些奇怪的输出...
例如,当我使用这个时,
with open('/tmp/nvdcve-2.0-modified 2.xml', 'rt') as f:
tree = ElementTree.parse(f)
for child in root:
print child.tag, child.attrib
我的输出看起来像这样...
{http://scap.nist.gov/schema/feed/vulnerability/2.0}entry {'id': 'CVE-2007-6759'}
令人困惑的是,如果我想遍历它,我似乎需要这样做..
for child in root.iter('{http://scap.nist.gov/schema/feed/vulnerability/2.0}entry'):
如果我那样做,我不知道 children 的 children 是什么,或者什么都不知道。
例如,我试图拉出 vuln:cve-id
,以及每个个体 cvss:base_metrics
(得分 access-vector)、vuln:summary
和 vuln:product
.
基本上,我试图每小时从 NIST 网站下载 "xml stream" 并将其更新到本地 mysql 数据库,这样我就有了一个本地位置,我也可以在执行漏洞评估时进行查询在我的环境中。弄清楚如何迭代这个 XML 东西真是令人困惑。我想尝试将其转换为 JSON,但这似乎是一个不必要的额外步骤,可能会出现问题,因为没有 1:1 XML/JSON 转换。
这是一个 命名空间 XML 文档。因此,您需要使用各自的命名空间来寻址节点。
文档中使用的名称空间在文档顶部定义,并映射到所谓的名称空间前缀:
xmlns="http://scap.nist.gov/schema/feed/vulnerability/2.0"
xmlns:cvss="http://scap.nist.gov/schema/cvss-v2/0.2"
xmlns:vuln="http://scap.nist.gov/schema/vulnerability/0.4"
...
因此前缀 vuln
例如映射到 "http://scap.nist.gov/schema/vulnerability/0.4"
。
没有前缀的称为默认命名空间 - 它适用于所有不使用显式命名空间前缀的节点(如根节点nvd
和 entry
个节点)。
因此您要么需要使用完全限定的名称空间,要么需要使用适当的名称空间前缀(在您的代码中,您可以以不同于的方式映射它们,而不是在已解析的文档中映射它们) 来解决这些问题。
下面是使用 lxml
(和 XPath 表达式)执行此操作的示例:
from lxml import etree
NSMAP = {
'n': 'http://scap.nist.gov/schema/feed/vulnerability/2.0',
'cpe-lang': 'http://cpe.mitre.org/language/2.0',
'cvss': 'http://scap.nist.gov/schema/cvss-v2/0.2',
'patch': 'http://scap.nist.gov/schema/patch/0.1',
'scap-core': 'http://scap.nist.gov/schema/scap-core/0.1',
'vuln': 'http://scap.nist.gov/schema/vulnerability/0.4',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance',
}
def normalized_tag(node):
return node.tag.replace('{%s}' % node.nsmap[node.prefix], '')
root = etree.parse(open('nvdcve.xml')).getroot()
entries = root.xpath('//n:nvd/n:entry', namespaces=NSMAP)
for entry in entries:
print "Entry: %r" % entry.attrib['id']
# CVE ID
cve_id = entry.xpath('./vuln:cve-id/text()', namespaces=NSMAP)[0]
print " CVE ID: %r" % cve_id
# Base Metrics
metrics = entry.xpath('./vuln:cvss/cvss:base_metrics/*', namespaces=NSMAP)
print " Base Metrics:"
for metric in metrics:
metric_name = normalized_tag(metric)
metric_value = metric.text
print " %s: %s" % (metric_name, metric_value)
# Summary
summary = entry.xpath('./vuln:summary/text()', namespaces=NSMAP)[0]
print " Summary: %s" % summary
# Products
products = entry.xpath('./vuln:vulnerable-software-list/vuln:product',
namespaces=NSMAP)
for product in products:
print " Product: %s" % product.text
输出:
Entry: 'CVE-2013-7450'
CVE ID: 'CVE-2013-7450'
Base Metrics:
score: 5.0
access-vector: NETWORK
access-complexity: LOW
authentication: NONE
confidentiality-impact: NONE
integrity-impact: PARTIAL
availability-impact: NONE
source: http://nvd.nist.gov
generated-on-datetime: 2017-04-11T09:43:13.623-04:00
Summary: Pulp before 2.3.0 uses the same the same certificate authority key and certificate for all installations.
Product: cpe:/a:pulp_project:pulp:2.2.1-1
有关 XML 命名空间的详细信息,请参阅 Namespaces section in the lxml tutorial and the Wikipedia article on XML Namespaces。
有关 XPath 语法的详细信息,请参阅 XPath Syntax page in the W3Schools Xpath Tutorial.
示例
要开始使用 XPath,fiddle 在众多 XPath testers 之一中处理您的文档也很有帮助。此外,Firefox 的 Firebug 插件或 Google Chrome 检查器允许您显示所选元素的(或者更确切地说,许多)XPath。
是的,XML 必须处理带有命名空间的 a little differently。这是另一个继续使用 ElementTree API.
的解决方案
使用此库中的命名空间,在您看到 vuln:summary
的地方,您需要在根元素的 xmlns:vuln
属性中查找 vuln
命名空间,然后将其引用为 {http://scap.nist.gov/schema/vulnerability/0.4}summary
.
import xml.etree.ElementTree as ET
tree = ET.parse('nvdcve-2.0-Modified.xml')
root = tree.getroot()
# default namespace is given by xmlns attribute of root element, still must be provided
for entry in root.findall('{http://scap.nist.gov/schema/feed/vulnerability/2.0}entry'):
product_list = []
metric_list = []
# just use the element's id attribute
id = entry.get('id')
summary = entry.find('{http://scap.nist.gov/schema/vulnerability/0.4}summary').text
software = entry.find('{http://scap.nist.gov/schema/vulnerability/0.4}vulnerable-software-list')
if software is not None:
for sw in software.findall('{http://scap.nist.gov/schema/vulnerability/0.4}product'):
product_list.append(sw.text)
metrics = entry.find('{http://scap.nist.gov/schema/vulnerability/0.4}cvss')
if metrics is not None:
for metric in metrics.find('{http://scap.nist.gov/schema/cvss-v2/0.2}base_metrics').findall('*'):
# we don't know the element name, but can get it with the tag property
metric_list.append(metric.tag.replace('{http://scap.nist.gov/schema/cvss-v2/0.2}', '') + ': ' + metric.text)
print(id, summary, product_list, metric_list)
#save to database!
我有一个奇怪的问题 XML 我正在尝试解析,但在读完这篇文章后,我仍然遇到问题。
我正在尝试解析 NIST CVE 数据库,它只出现在 XML 中。这是它的一个示例。
<?xml version='1.0' encoding='UTF-8'?>
<nvd xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.1" xmlns:cvss="http://scap.nist.gov/schema/cvss-v2/0.2" xmlns:vuln="http://scap.nist.gov/schema/vulnerability/0.4" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:patch="http://scap.nist.gov/schema/patch/0.1" xmlns="http://scap.nist.gov/schema/feed/vulnerability/2.0" xmlns:cpe-lang="http://cpe.mitre.org/language/2.0" nvd_xml_version="2.0" pub_date="2017-04-12T18:00:08" xsi:schemaLocation="http://scap.nist.gov/schema/patch/0.1 https://scap.nist.gov/schema/nvd/patch_0.1.xsd http://scap.nist.gov/schema/feed/vulnerability/2.0 https://scap.nist.gov/schema/nvd/nvd-cve-feed_2.0.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
<entry id="CVE-2013-7450">
<vuln:vulnerable-configuration id="http://nvd.nist.gov/">
<cpe-lang:logical-test operator="OR" negate="false">
<cpe-lang:fact-ref name="cpe:/a:pulp_project:pulp:2.2.1-1"/>
</cpe-lang:logical-test>
</vuln:vulnerable-configuration>
<vuln:vulnerable-software-list>
<vuln:product>cpe:/a:pulp_project:pulp:2.2.1-1</vuln:product>
</vuln:vulnerable-software-list>
<vuln:cve-id>CVE-2013-7450</vuln:cve-id>
<vuln:published-datetime>2017-04-03T11:59:00.143-04:00</vuln:published-datetime>
<vuln:last-modified-datetime>2017-04-11T10:01:04.323-04:00</vuln:last-modified-datetime>
<vuln:cvss>
<cvss:base_metrics>
<cvss:score>5.0</cvss:score>
<cvss:access-vector>NETWORK</cvss:access-vector>
<cvss:access-complexity>LOW</cvss:access-complexity>
<cvss:authentication>NONE</cvss:authentication>
<cvss:confidentiality-impact>NONE</cvss:confidentiality-impact>
<cvss:integrity-impact>PARTIAL</cvss:integrity-impact>
<cvss:availability-impact>NONE</cvss:availability-impact>
<cvss:source>http://nvd.nist.gov</cvss:source>
<cvss:generated-on-datetime>2017-04-11T09:43:13.623-04:00</cvss:generated-on-datetime>
</cvss:base_metrics>
</vuln:cvss>
<vuln:cwe id="CWE-295"/>
<vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY">
<vuln:source>MLIST</vuln:source>
<vuln:reference href="http://www.openwall.com/lists/oss-security/2016/04/18/11" xml:lang="en">[oss-security] 20160418 CVE-2013-7450: Pulp < 2.3.0 distributed the same CA key to all users</vuln:reference>
</vuln:references>
<vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY">
<vuln:source>MLIST</vuln:source>
<vuln:reference href="http://www.openwall.com/lists/oss-security/2016/04/18/5" xml:lang="en">[oss-security] 20160418 Re: CVE request - Pulp < 2.3.0 shipped the same authentication CA key/cert to all users</vuln:reference>
</vuln:references>
<vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY">
<vuln:source>MLIST</vuln:source>
<vuln:reference href="http://www.openwall.com/lists/oss-security/2016/05/20/1" xml:lang="en">[oss-security] 20160519 Pulp 2.8.3 Released to address multiple CVEs</vuln:reference>
</vuln:references>
<vuln:references xml:lang="en" reference_type="PATCH">
<vuln:source>CONFIRM</vuln:source>
<vuln:reference href="https://bugzilla.redhat.com/show_bug.cgi?id=1003326" xml:lang="en">https://bugzilla.redhat.com/show_bug.cgi?id=1003326</vuln:reference>
</vuln:references>
<vuln:references xml:lang="en" reference_type="PATCH">
<vuln:source>CONFIRM</vuln:source>
<vuln:reference href="https://bugzilla.redhat.com/show_bug.cgi?id=1328345" xml:lang="en">https://bugzilla.redhat.com/show_bug.cgi?id=1328345</vuln:reference>
</vuln:references>
<vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY">
<vuln:source>CONFIRM</vuln:source>
<vuln:reference href="https://github.com/pulp/pulp/pull/627" xml:lang="en">https://github.com/pulp/pulp/pull/627</vuln:reference>
</vuln:references>
<vuln:summary>Pulp before 2.3.0 uses the same the same certificate authority key and certificate for all installations.</vuln:summary>
</entry>
<nvd>
我试图用 ET 解析它,但我得到了一些奇怪的输出...
例如,当我使用这个时,
with open('/tmp/nvdcve-2.0-modified 2.xml', 'rt') as f:
tree = ElementTree.parse(f)
for child in root:
print child.tag, child.attrib
我的输出看起来像这样...
{http://scap.nist.gov/schema/feed/vulnerability/2.0}entry {'id': 'CVE-2007-6759'}
令人困惑的是,如果我想遍历它,我似乎需要这样做..
for child in root.iter('{http://scap.nist.gov/schema/feed/vulnerability/2.0}entry'):
如果我那样做,我不知道 children 的 children 是什么,或者什么都不知道。
例如,我试图拉出 vuln:cve-id
,以及每个个体 cvss:base_metrics
(得分 access-vector)、vuln:summary
和 vuln:product
.
基本上,我试图每小时从 NIST 网站下载 "xml stream" 并将其更新到本地 mysql 数据库,这样我就有了一个本地位置,我也可以在执行漏洞评估时进行查询在我的环境中。弄清楚如何迭代这个 XML 东西真是令人困惑。我想尝试将其转换为 JSON,但这似乎是一个不必要的额外步骤,可能会出现问题,因为没有 1:1 XML/JSON 转换。
这是一个 命名空间 XML 文档。因此,您需要使用各自的命名空间来寻址节点。
文档中使用的名称空间在文档顶部定义,并映射到所谓的名称空间前缀:
xmlns="http://scap.nist.gov/schema/feed/vulnerability/2.0"
xmlns:cvss="http://scap.nist.gov/schema/cvss-v2/0.2"
xmlns:vuln="http://scap.nist.gov/schema/vulnerability/0.4"
...
因此前缀 vuln
例如映射到 "http://scap.nist.gov/schema/vulnerability/0.4"
。
没有前缀的称为默认命名空间 - 它适用于所有不使用显式命名空间前缀的节点(如根节点nvd
和 entry
个节点)。
因此您要么需要使用完全限定的名称空间,要么需要使用适当的名称空间前缀(在您的代码中,您可以以不同于的方式映射它们,而不是在已解析的文档中映射它们) 来解决这些问题。
下面是使用 lxml
(和 XPath 表达式)执行此操作的示例:
from lxml import etree
NSMAP = {
'n': 'http://scap.nist.gov/schema/feed/vulnerability/2.0',
'cpe-lang': 'http://cpe.mitre.org/language/2.0',
'cvss': 'http://scap.nist.gov/schema/cvss-v2/0.2',
'patch': 'http://scap.nist.gov/schema/patch/0.1',
'scap-core': 'http://scap.nist.gov/schema/scap-core/0.1',
'vuln': 'http://scap.nist.gov/schema/vulnerability/0.4',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance',
}
def normalized_tag(node):
return node.tag.replace('{%s}' % node.nsmap[node.prefix], '')
root = etree.parse(open('nvdcve.xml')).getroot()
entries = root.xpath('//n:nvd/n:entry', namespaces=NSMAP)
for entry in entries:
print "Entry: %r" % entry.attrib['id']
# CVE ID
cve_id = entry.xpath('./vuln:cve-id/text()', namespaces=NSMAP)[0]
print " CVE ID: %r" % cve_id
# Base Metrics
metrics = entry.xpath('./vuln:cvss/cvss:base_metrics/*', namespaces=NSMAP)
print " Base Metrics:"
for metric in metrics:
metric_name = normalized_tag(metric)
metric_value = metric.text
print " %s: %s" % (metric_name, metric_value)
# Summary
summary = entry.xpath('./vuln:summary/text()', namespaces=NSMAP)[0]
print " Summary: %s" % summary
# Products
products = entry.xpath('./vuln:vulnerable-software-list/vuln:product',
namespaces=NSMAP)
for product in products:
print " Product: %s" % product.text
输出:
Entry: 'CVE-2013-7450'
CVE ID: 'CVE-2013-7450'
Base Metrics:
score: 5.0
access-vector: NETWORK
access-complexity: LOW
authentication: NONE
confidentiality-impact: NONE
integrity-impact: PARTIAL
availability-impact: NONE
source: http://nvd.nist.gov
generated-on-datetime: 2017-04-11T09:43:13.623-04:00
Summary: Pulp before 2.3.0 uses the same the same certificate authority key and certificate for all installations.
Product: cpe:/a:pulp_project:pulp:2.2.1-1
有关 XML 命名空间的详细信息,请参阅 Namespaces section in the lxml tutorial and the Wikipedia article on XML Namespaces。
有关 XPath 语法的详细信息,请参阅 XPath Syntax page in the W3Schools Xpath Tutorial.
示例要开始使用 XPath,fiddle 在众多 XPath testers 之一中处理您的文档也很有帮助。此外,Firefox 的 Firebug 插件或 Google Chrome 检查器允许您显示所选元素的(或者更确切地说,许多)XPath。
是的,XML 必须处理带有命名空间的 a little differently。这是另一个继续使用 ElementTree API.
的解决方案使用此库中的命名空间,在您看到 vuln:summary
的地方,您需要在根元素的 xmlns:vuln
属性中查找 vuln
命名空间,然后将其引用为 {http://scap.nist.gov/schema/vulnerability/0.4}summary
.
import xml.etree.ElementTree as ET
tree = ET.parse('nvdcve-2.0-Modified.xml')
root = tree.getroot()
# default namespace is given by xmlns attribute of root element, still must be provided
for entry in root.findall('{http://scap.nist.gov/schema/feed/vulnerability/2.0}entry'):
product_list = []
metric_list = []
# just use the element's id attribute
id = entry.get('id')
summary = entry.find('{http://scap.nist.gov/schema/vulnerability/0.4}summary').text
software = entry.find('{http://scap.nist.gov/schema/vulnerability/0.4}vulnerable-software-list')
if software is not None:
for sw in software.findall('{http://scap.nist.gov/schema/vulnerability/0.4}product'):
product_list.append(sw.text)
metrics = entry.find('{http://scap.nist.gov/schema/vulnerability/0.4}cvss')
if metrics is not None:
for metric in metrics.find('{http://scap.nist.gov/schema/cvss-v2/0.2}base_metrics').findall('*'):
# we don't know the element name, but can get it with the tag property
metric_list.append(metric.tag.replace('{http://scap.nist.gov/schema/cvss-v2/0.2}', '') + ': ' + metric.text)
print(id, summary, product_list, metric_list)
#save to database!