如何从 XML int Python 中获取一些值?
How can I get some value from XML int Python?
我在 xml 中有此站点地图。我怎样才能得到每个 <loc>
?
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<!-- created with Free Online Sitemap Generator www.xml-sitemaps.com -->
<url>
<loc>https://www.nsnam.org/wiki/Main_Page</loc>
<lastmod>2018-10-24T03:03:05+00:00</lastmod>
<priority>1.00</priority>
</url>
<url>
<loc>https://www.nsnam.org/wiki/Current_Development</loc>
<lastmod>2018-10-24T03:03:05+00:00</lastmod>
<priority>0.80</priority>
</url>
<url>
<loc>https://www.nsnam.org/wiki/Developer_FAQ</loc>
<lastmod>2018-10-24T03:03:05+00:00</lastmod>
<priority>0.80</priority>
</url>
程序看起来像这样。
import os.path
import xml.etree.ElementTree
import requests
from subprocess import call
def creatingListOfBrokenLinks():
if (os.path.isfile('sitemap.xml')):
e = xml.etree.ElementTree.parse('sitemap.xml').getroot()
file = open("all_broken_links.txt", "w")
for atype in e.findall('url'):
r = requests.get(atype.find('loc').text)
print(atype)
if (r.status_code == 404):
file.write(atype)
file.close()
if __name__ == "__main__":
creatingListOfBrokenLinks()
建议使用elementtree标准库包:
from xml.etree import ElementTree as ET
SITEMAP = """<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<!-- created with Free Online Sitemap Generator www.xml-sitemaps.com -->
...
...
</urlset>"""
urlset = ET.fromstring(SITEMAP)
loc_elements = urlset.iter("{http://www.sitemaps.org/schemas/sitemap/0.9}loc")
for loc_element in loc_elements:
print(loc_element.text)
文档链接:
更新:
- 您的代码出错的地方是 XML 命名空间处理。
- 此外,我的示例使用
.iter()
而不是 .findall()
/ .find()
来直接获取 loc
元素。根据 XML 的结构和用例,这可能可行也可能不可行。
你的代码在我这边工作得很好。您所要做的就是在 url
和 loc
之前添加:{http://www.sitemaps.org/schemas/sitemap/0.9}
这里:
import os.path
import xml.etree.ElementTree
import requests
from subprocess import call
def creatingListOfBrokenLinks():
if (os.path.isfile('sitemap.xml')):
e = xml.etree.ElementTree.parse('sitemap.xml').getroot()
file = open("all_broken_links.txt", "w")
for atype in e.findall('{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
r = requests.get(atype.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text)
print(atype)
if (r.status_code == 404):
file.write(atype)
file.close()
if __name__ == "__main__":
creatingListOfBrokenLinks()
我在 xml 中有此站点地图。我怎样才能得到每个 <loc>
?
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<!-- created with Free Online Sitemap Generator www.xml-sitemaps.com -->
<url>
<loc>https://www.nsnam.org/wiki/Main_Page</loc>
<lastmod>2018-10-24T03:03:05+00:00</lastmod>
<priority>1.00</priority>
</url>
<url>
<loc>https://www.nsnam.org/wiki/Current_Development</loc>
<lastmod>2018-10-24T03:03:05+00:00</lastmod>
<priority>0.80</priority>
</url>
<url>
<loc>https://www.nsnam.org/wiki/Developer_FAQ</loc>
<lastmod>2018-10-24T03:03:05+00:00</lastmod>
<priority>0.80</priority>
</url>
程序看起来像这样。
import os.path
import xml.etree.ElementTree
import requests
from subprocess import call
def creatingListOfBrokenLinks():
if (os.path.isfile('sitemap.xml')):
e = xml.etree.ElementTree.parse('sitemap.xml').getroot()
file = open("all_broken_links.txt", "w")
for atype in e.findall('url'):
r = requests.get(atype.find('loc').text)
print(atype)
if (r.status_code == 404):
file.write(atype)
file.close()
if __name__ == "__main__":
creatingListOfBrokenLinks()
建议使用elementtree标准库包:
from xml.etree import ElementTree as ET
SITEMAP = """<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<!-- created with Free Online Sitemap Generator www.xml-sitemaps.com -->
...
...
</urlset>"""
urlset = ET.fromstring(SITEMAP)
loc_elements = urlset.iter("{http://www.sitemaps.org/schemas/sitemap/0.9}loc")
for loc_element in loc_elements:
print(loc_element.text)
文档链接:
更新:
- 您的代码出错的地方是 XML 命名空间处理。
- 此外,我的示例使用
.iter()
而不是.findall()
/.find()
来直接获取loc
元素。根据 XML 的结构和用例,这可能可行也可能不可行。
你的代码在我这边工作得很好。您所要做的就是在 url
和 loc
{http://www.sitemaps.org/schemas/sitemap/0.9}
这里:
import os.path
import xml.etree.ElementTree
import requests
from subprocess import call
def creatingListOfBrokenLinks():
if (os.path.isfile('sitemap.xml')):
e = xml.etree.ElementTree.parse('sitemap.xml').getroot()
file = open("all_broken_links.txt", "w")
for atype in e.findall('{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
r = requests.get(atype.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text)
print(atype)
if (r.status_code == 404):
file.write(atype)
file.close()
if __name__ == "__main__":
creatingListOfBrokenLinks()