如果参数具有特定值,则从 XML 文件中提取数据
Extract data from XML file if arguments are of certain values
我想遍历 XML 格式的维基百科转储,对于每个修订,我想保存时间戳和评论(如果修订是由某个用户名创建的)。这可能吗?我正在尝试熟悉 lxml。
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.27.0-wmf.18</generator>
<case>first-letter</case>
<namespaces>...</namespaces>
</siteinfo>
<page>
<title>Zhuangzi</title>
<ns>0</ns>
<id>42870472</id>
<revision>
<id>610251969</id>
<timestamp>2014-05-26T20:08:14Z</timestamp>
<contributor>
<username>White whirlwind</username>
<id>8761551</id>
</contributor>
<comment>...</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
<sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
</revision>
<revision>...</revision>
<revision>...</revision>
<revision>...</revision>
<revision>...</revision>
<revision>...</revision>
</page>
<page>...</page>
</mediawiki>
是的,这可以使用 lxml。
您知道要查找的节点(从修订版的用户名开始),因此将代码写入 select 该节点并将该值与您要查找的已知名称进行比较。
完成该部分后,保存时间戳和评论应该很简单。
您将在 lxml 文档中找到您需要的内容 (http://lxml.de/);查看有关 "XPath" 的部分,了解如何 select 您想要的节点(这将包括将 XML 加载到您的脚本中的片段)
您可能还希望查阅 lxml 链接 (http://effbot.org/zone/element.htm) 的 ElementTree 教程,以了解如何使用通过 XPath 或其他方法找到的 XML 元素.这对于从元素中获取值很有用。
import xmltodict
xml_input = """
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.27.0-wmf.18</generator>
<case>first-letter</case>
<namespaces>...</namespaces>
</siteinfo>
<page>
<title>Zhuangzi</title>
<ns>0</ns>
<id>42870472</id>
<revision>
<id>610251969</id>
<timestamp>2014-05-25T20:08:14Z</timestamp>
<contributor>
<username>Patric</username>
<id>8761551</id>
</contributor>
</revision>
<revision>
<id>610251969</id>
<timestamp>2014-05-26T20:08:14Z</timestamp>
<contributor>
<username>Don</username>
<id>8761551</id>
</contributor>
</revision>
<revision>
<id>610251969</id>
<timestamp>2014-05-27T20:08:14Z</timestamp>
<contributor>
<username>Patric</username>
<id>8761551</id>
</contributor>
</revision>
</page>
</mediawiki>
"""
dic_xml = xmltodict.parse(xml_input)
for rev in dic_xml['mediawiki']['page']['revision']:
if rev['contributor']['username'] == 'Patric':
print rev['id']
print rev['timestamp']
你的文件:
import xmltodict
with open('/home/jurkij/Downloads/testarticles.xml') as xml_file:
dic_xml = xmltodict.parse(xml_file.read())
for page in dic_xml['mediawiki']['page']:
for rev in page['revision']:
if 'username' in rev['contributor'] and rev['contributor']['username'] == 'Aristophanes68':
print rev['timestamp']
print rev['id']
从您的 last question 继续,您可以使用 lxml 和 xpath 表达式轻松完成此操作:
from lxml.etree import parse
tree = parse("test.xml")
ns = {"wiki": "http://www.mediawiki.org/xml/export-0.10/"}
revs = tree.xpath("//wiki:revision[.//wiki:username='White whirlwind']",namespaces=ns)
print([(rev.xpath(".//wiki:timestamp//text()", namespaces=ns)[0],rev.xpath(".//wiki:username//text()", namespaces=ns)[0]) for rev in revs])
以下xml:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.27.0-wmf.18</generator>
<case>first-letter</case>
<namespaces>...</namespaces>
</siteinfo>
<page>
<title>Zhuangzi</title>
<ns>0</ns>
<id>42870472</id>
<revision>
<id>610251969</id>
<timestamp>2014-05-26T20:08:14Z</timestamp>
<contributor>
<username>White whirlwind</username>
<id>8761551</id>
</contributor>
<comment>...</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
<sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
</revision>
<revision>
<id>610251969</id>
<timestamp>2014-06-26T20:08:14Z</timestamp>
<contributor>
<username>White whirlwind</username>
<id>8761551</id>
</contributor>
<comment>...</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
<sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
</revision>
<revision> <id>610251969</id>
<timestamp>2014-07-26T20:08:14Z</timestamp>
<contributor>
<username>foobar</username>
<id>8761551</id>
</contributor>
<comment>...</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
<sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1></revision>
<revision>...</revision>
<revision>...</revision>
<revision>...</revision>
</page>
输出:
[[('2014-05-26T20:08:14Z', 'White whirlwind'), ('2014-06-26T20:08:14Z', 'White whirlwind')]
//wiki:revision[.//wiki:username='White whirlwind']
查找所有包含用户名且用户名值为 White whirlwind
的修订标签,您将看到它 returns 2 因为 foo
不匹配,您只需要从 revs.
中的每个过滤修订中提取时间戳和用户名值
为了你的file in google drive吧returns:
[('2014-05-26T20:08:14Z', 'White whirlwind'),
('2014-05-26T20:12:49Z', 'White whirlwind'),
('2014-05-26T20:13:04Z', 'White whirlwind'),
('2014-05-31T21:14:15Z', 'White whirlwind'),
('2015-10-11T19:24:46Z', 'White whirlwind'),
('2015-10-11T19:26:31Z', 'White whirlwind')]
如果你检查你的文件是正确的。
我想遍历 XML 格式的维基百科转储,对于每个修订,我想保存时间戳和评论(如果修订是由某个用户名创建的)。这可能吗?我正在尝试熟悉 lxml。
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.27.0-wmf.18</generator>
<case>first-letter</case>
<namespaces>...</namespaces>
</siteinfo>
<page>
<title>Zhuangzi</title>
<ns>0</ns>
<id>42870472</id>
<revision>
<id>610251969</id>
<timestamp>2014-05-26T20:08:14Z</timestamp>
<contributor>
<username>White whirlwind</username>
<id>8761551</id>
</contributor>
<comment>...</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
<sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
</revision>
<revision>...</revision>
<revision>...</revision>
<revision>...</revision>
<revision>...</revision>
<revision>...</revision>
</page>
<page>...</page>
</mediawiki>
是的,这可以使用 lxml。
您知道要查找的节点(从修订版的用户名开始),因此将代码写入 select 该节点并将该值与您要查找的已知名称进行比较。
完成该部分后,保存时间戳和评论应该很简单。
您将在 lxml 文档中找到您需要的内容 (http://lxml.de/);查看有关 "XPath" 的部分,了解如何 select 您想要的节点(这将包括将 XML 加载到您的脚本中的片段)
您可能还希望查阅 lxml 链接 (http://effbot.org/zone/element.htm) 的 ElementTree 教程,以了解如何使用通过 XPath 或其他方法找到的 XML 元素.这对于从元素中获取值很有用。
import xmltodict
xml_input = """
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.27.0-wmf.18</generator>
<case>first-letter</case>
<namespaces>...</namespaces>
</siteinfo>
<page>
<title>Zhuangzi</title>
<ns>0</ns>
<id>42870472</id>
<revision>
<id>610251969</id>
<timestamp>2014-05-25T20:08:14Z</timestamp>
<contributor>
<username>Patric</username>
<id>8761551</id>
</contributor>
</revision>
<revision>
<id>610251969</id>
<timestamp>2014-05-26T20:08:14Z</timestamp>
<contributor>
<username>Don</username>
<id>8761551</id>
</contributor>
</revision>
<revision>
<id>610251969</id>
<timestamp>2014-05-27T20:08:14Z</timestamp>
<contributor>
<username>Patric</username>
<id>8761551</id>
</contributor>
</revision>
</page>
</mediawiki>
"""
dic_xml = xmltodict.parse(xml_input)
for rev in dic_xml['mediawiki']['page']['revision']:
if rev['contributor']['username'] == 'Patric':
print rev['id']
print rev['timestamp']
你的文件:
import xmltodict
with open('/home/jurkij/Downloads/testarticles.xml') as xml_file:
dic_xml = xmltodict.parse(xml_file.read())
for page in dic_xml['mediawiki']['page']:
for rev in page['revision']:
if 'username' in rev['contributor'] and rev['contributor']['username'] == 'Aristophanes68':
print rev['timestamp']
print rev['id']
从您的 last question 继续,您可以使用 lxml 和 xpath 表达式轻松完成此操作:
from lxml.etree import parse
tree = parse("test.xml")
ns = {"wiki": "http://www.mediawiki.org/xml/export-0.10/"}
revs = tree.xpath("//wiki:revision[.//wiki:username='White whirlwind']",namespaces=ns)
print([(rev.xpath(".//wiki:timestamp//text()", namespaces=ns)[0],rev.xpath(".//wiki:username//text()", namespaces=ns)[0]) for rev in revs])
以下xml:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.27.0-wmf.18</generator>
<case>first-letter</case>
<namespaces>...</namespaces>
</siteinfo>
<page>
<title>Zhuangzi</title>
<ns>0</ns>
<id>42870472</id>
<revision>
<id>610251969</id>
<timestamp>2014-05-26T20:08:14Z</timestamp>
<contributor>
<username>White whirlwind</username>
<id>8761551</id>
</contributor>
<comment>...</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
<sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
</revision>
<revision>
<id>610251969</id>
<timestamp>2014-06-26T20:08:14Z</timestamp>
<contributor>
<username>White whirlwind</username>
<id>8761551</id>
</contributor>
<comment>...</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
<sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
</revision>
<revision> <id>610251969</id>
<timestamp>2014-07-26T20:08:14Z</timestamp>
<contributor>
<username>foobar</username>
<id>8761551</id>
</contributor>
<comment>...</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
<sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1></revision>
<revision>...</revision>
<revision>...</revision>
<revision>...</revision>
</page>
输出:
[[('2014-05-26T20:08:14Z', 'White whirlwind'), ('2014-06-26T20:08:14Z', 'White whirlwind')]
//wiki:revision[.//wiki:username='White whirlwind']
查找所有包含用户名且用户名值为 White whirlwind
的修订标签,您将看到它 returns 2 因为 foo
不匹配,您只需要从 revs.
为了你的file in google drive吧returns:
[('2014-05-26T20:08:14Z', 'White whirlwind'),
('2014-05-26T20:12:49Z', 'White whirlwind'),
('2014-05-26T20:13:04Z', 'White whirlwind'),
('2014-05-31T21:14:15Z', 'White whirlwind'),
('2015-10-11T19:24:46Z', 'White whirlwind'),
('2015-10-11T19:26:31Z', 'White whirlwind')]
如果你检查你的文件是正确的。