如何从 mediawiki 中提取纯文本？

Question

我已经从 https://awoiaf.westeros.org/index.php/Special:Export. ~~They come in XML format.~~ I would like plain text from the "Synopsis" sections. You can download the whole thing here 导出了一些类别（压缩后 54KB）。

典型的概要部分如下所示：

==Synopsis== [[Catelyn Tully|Catelyn]] listens to the continuous pounding noise of the drums the musicians in the hall are playing. She is seated between [[Ryman Frey]] and [[Roose Bolton]] during the wedding feast. She remarks to herself how joyless the wedding is, and watches as [[Robb Stark|Robb]] dances with several of the Frey maids and [[Edmure Tully|Edmure]] dotes on his soon to be wife, [[Roslin Frey|Roslin]]. Catelyn becomes more wary when she learns that [[Olyvar Frey|Olyvar]], [[Perwyn Frey|Perwyn]], and [[Alesander Frey]] are all not in attendance at the wedding. She notices [[Merrett Frey]] trying to drink the [[Greatjon Umber|Greatjon]] under the table, and finally Lord [[Walder Frey]] calls for the bedding. Robb does not participate as the Greatjon carries a weeping Roslin to the bed chamber.

如何从所有概要部分中提取纯文本？

Answer 1

首先，您需要将其解析为XML。我推荐使用 lxml 和 xpath。

from lxml import etree

tree = etree.parse('file.xml')
expression = '/m:mediawiki/m:page/m:revision/m:text/text()'
namespaces = {"m": "http://www.mediawiki.org/xml/export-0.10/"}
texts = tree.xpath(expression, namespaces=namespaces)

获得所有文本部分后，使用正则表达式逐一解析。或者编写自己的解析器。

如何从 mediawiki 中提取纯文本？

How to extract plain text from mediawiki?

python

mediawiki