如何通过 ElementTree 在 Python 中用 RDF 解析 XML 文档?
How to parse XML doc with RDF by ElementTree in Python?
我 xml 喜欢 :
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel rdf:about="https://pracujwit.pl/rss/all/">
<description>Najnowsze oferty</description>
<link>https://pracujwit.pl/</link>
<title>Pracuj w IT</title>
<dc:date>05-02-2020</dc:date>
<items>
<rdf:Seq>
<rdf:li rdf:resource="https://pracujwit.pl/job/192829/bi-consultant-at-primaris/"/>
<rdf:li rdf:resource="https://pracujwit.pl/job/192827/senior-python-developer-100-zdalnie-at-newperspective/"/>
<rdf:li rdf:resource="https://pracujwit.pl/job/192826/kierownik-projektu-it-at-comarch-sa/"/>
</rdf:Seq>
</items>
</channel>
<item rdf:about="https://pracujwit.pl/job/192829/bi-consultant-at-primaris/">
<description><![CDATA[<strong>Lokalizacja:</strong> Warszawa<br /><strong>Firma:</strong> Primaris<br /><strong>Oferta:</strong><br /><br /><br /><a href="https://pracujwit.pl/job/192829//">Aplikuj online</a><br />]]></description>
<link>https://pracujwit.pl/job/192829/bi-consultant-at-primaris/</link>
<title><![CDATA[BI Consultant]]></title>
<company><![CDATA[Primaris]]></company>
<location><![CDATA[Warszawa]]></location>
<dc:date>2020-02-04 15:12:32</dc:date>
</item>
<item rdf:about="https://pracujwit.pl/job/192827/senior-python-developer-100-zdalnie-at-newperspective/">
<description><![CDATA[<strong>Lokalizacja:</strong> <br /><strong>Firma:</strong> NewPerspective <br /><strong>Oferta:</strong><br /><br /><br /><a href="https://pracujwit.pl/job/192827//">Aplikuj online</a><br />]]></description>
<link>https://pracujwit.pl/job/192827/senior-python-developer-100-zdalnie-at-newperspective/</link>
<title><![CDATA[Senior Python Developer / 100% zdalnie]]></title>
<company><![CDATA[NewPerspective ]]></company>
<location><![CDATA[]]></location>
<dc:date>2020-02-04 11:45:34</dc:date>
</item>
<item rdf:about="https://pracujwit.pl/job/192826/kierownik-projektu-it-at-comarch-sa/">
<description><![CDATA[<strong>Lokalizacja:</strong> Kraków<br /><strong>Firma:</strong> Comarch SA<br /><strong>Oferta:</strong><br /><br /><br /><a href="https://pracujwit.pl/job/192826//">Aplikuj online</a><br />]]></description>
<link>https://pracujwit.pl/job/192826/kierownik-projektu-it-at-comarch-sa/</link>
<title><![CDATA[Kierownik Projektu IT]]></title>
<company><![CDATA[Comarch SA]]></company>
<location><![CDATA[Kraków]]></location>
<dc:date>2020-02-04 09:33:05</dc:date>
</item>
</rdf:RDF>
我将它保存到文件 'xml_rdf.txt'。我通常像这样将解析器编码为 XML:
import xml.etree.ElementTree as ET
path = 'path/to/xml_rdf.txt'
xml_tree = ET.parse(path/to/xml_rdf.txt)
for item in xml_tree.iter('item'):
print(item)
但在这种情况下我没有得到任何物品。我知道在 XML 解析器上指定名称空间,但在这种情况下我对此有疑问。我尝试像 :
ns = {"dcterms": "http://purl.org/rss/1.0/"}
for item in xml_tree.iter('dcterms:item'):
print(item)
但同样的故事,没有条目。
有人知道如何处理吗?
对于 iter()
,您必须使用命名空间 URI:
for item in xml_tree.iter('{http://purl.org/rss/1.0/}item'):
print(item)
输出:
<Element '{http://purl.org/rss/1.0/}item' at 0x7f6ff8d5ad90>
<Element '{http://purl.org/rss/1.0/}item' at 0x7f6ff8d5af50>
<Element '{http://purl.org/rss/1.0/}item' at 0x7f6ff8d64150>
使用findall()
,可以使用前缀:
ns = {"dcterms": "http://purl.org/rss/1.0/"}
for item in xml_tree.findall('dcterms:item', ns):
print(item)
感谢@mzjn 的帮助。最后我以这种方式获得项目及其数据:
namespaces = {'xml_root': 'http://purl.org/rss/1.0/',
'xml_root_dc': 'http://purl.org/dc/elements/1.1/'}
for offer in xml_tree.findall('./xml_root:item', namespaces):
url = offer.find('./xml_root:link', namespaces).text
date_publication = offer.find('./xml_root_dc:date', namespaces).text
要关闭的主题。
我 xml 喜欢 :
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel rdf:about="https://pracujwit.pl/rss/all/">
<description>Najnowsze oferty</description>
<link>https://pracujwit.pl/</link>
<title>Pracuj w IT</title>
<dc:date>05-02-2020</dc:date>
<items>
<rdf:Seq>
<rdf:li rdf:resource="https://pracujwit.pl/job/192829/bi-consultant-at-primaris/"/>
<rdf:li rdf:resource="https://pracujwit.pl/job/192827/senior-python-developer-100-zdalnie-at-newperspective/"/>
<rdf:li rdf:resource="https://pracujwit.pl/job/192826/kierownik-projektu-it-at-comarch-sa/"/>
</rdf:Seq>
</items>
</channel>
<item rdf:about="https://pracujwit.pl/job/192829/bi-consultant-at-primaris/">
<description><![CDATA[<strong>Lokalizacja:</strong> Warszawa<br /><strong>Firma:</strong> Primaris<br /><strong>Oferta:</strong><br /><br /><br /><a href="https://pracujwit.pl/job/192829//">Aplikuj online</a><br />]]></description>
<link>https://pracujwit.pl/job/192829/bi-consultant-at-primaris/</link>
<title><![CDATA[BI Consultant]]></title>
<company><![CDATA[Primaris]]></company>
<location><![CDATA[Warszawa]]></location>
<dc:date>2020-02-04 15:12:32</dc:date>
</item>
<item rdf:about="https://pracujwit.pl/job/192827/senior-python-developer-100-zdalnie-at-newperspective/">
<description><![CDATA[<strong>Lokalizacja:</strong> <br /><strong>Firma:</strong> NewPerspective <br /><strong>Oferta:</strong><br /><br /><br /><a href="https://pracujwit.pl/job/192827//">Aplikuj online</a><br />]]></description>
<link>https://pracujwit.pl/job/192827/senior-python-developer-100-zdalnie-at-newperspective/</link>
<title><![CDATA[Senior Python Developer / 100% zdalnie]]></title>
<company><![CDATA[NewPerspective ]]></company>
<location><![CDATA[]]></location>
<dc:date>2020-02-04 11:45:34</dc:date>
</item>
<item rdf:about="https://pracujwit.pl/job/192826/kierownik-projektu-it-at-comarch-sa/">
<description><![CDATA[<strong>Lokalizacja:</strong> Kraków<br /><strong>Firma:</strong> Comarch SA<br /><strong>Oferta:</strong><br /><br /><br /><a href="https://pracujwit.pl/job/192826//">Aplikuj online</a><br />]]></description>
<link>https://pracujwit.pl/job/192826/kierownik-projektu-it-at-comarch-sa/</link>
<title><![CDATA[Kierownik Projektu IT]]></title>
<company><![CDATA[Comarch SA]]></company>
<location><![CDATA[Kraków]]></location>
<dc:date>2020-02-04 09:33:05</dc:date>
</item>
</rdf:RDF>
我将它保存到文件 'xml_rdf.txt'。我通常像这样将解析器编码为 XML:
import xml.etree.ElementTree as ET
path = 'path/to/xml_rdf.txt'
xml_tree = ET.parse(path/to/xml_rdf.txt)
for item in xml_tree.iter('item'):
print(item)
但在这种情况下我没有得到任何物品。我知道在 XML 解析器上指定名称空间,但在这种情况下我对此有疑问。我尝试像 :
ns = {"dcterms": "http://purl.org/rss/1.0/"}
for item in xml_tree.iter('dcterms:item'):
print(item)
但同样的故事,没有条目。
有人知道如何处理吗?
对于 iter()
,您必须使用命名空间 URI:
for item in xml_tree.iter('{http://purl.org/rss/1.0/}item'):
print(item)
输出:
<Element '{http://purl.org/rss/1.0/}item' at 0x7f6ff8d5ad90>
<Element '{http://purl.org/rss/1.0/}item' at 0x7f6ff8d5af50>
<Element '{http://purl.org/rss/1.0/}item' at 0x7f6ff8d64150>
使用findall()
,可以使用前缀:
ns = {"dcterms": "http://purl.org/rss/1.0/"}
for item in xml_tree.findall('dcterms:item', ns):
print(item)
感谢@mzjn 的帮助。最后我以这种方式获得项目及其数据:
namespaces = {'xml_root': 'http://purl.org/rss/1.0/',
'xml_root_dc': 'http://purl.org/dc/elements/1.1/'}
for offer in xml_tree.findall('./xml_root:item', namespaces):
url = offer.find('./xml_root:link', namespaces).text
date_publication = offer.find('./xml_root_dc:date', namespaces).text
要关闭的主题。