使用 PHP 提取 XML 信息
Extracting XML info using PHP
我有一个格式奇怪的 XML 文件,我想提取 <text xml:space="preserve" bytes="1099">
部分并将该信息转换成它自己的数组。
我想我必须找到 |定界符和分割点,但不太确定该怎么做。
<page>
<title>Martial Ares</title>
<ns>0</ns>
<id>23026</id>
<sha1>7imznp2a51dh3kslf5gtqnkpkidlvom</sha1>
<revision>
<id>79960</id>
<timestamp>2014-02-25T07:24:27Z</timestamp>
<contributor>
<username>JScathach</username>
<id>8024930</id>
</contributor>
<text xml:space="preserve" bytes="1017">{{Infobox card (2)
|card name=[Martial] Ares
|character name=Ares
|release_date=May 1 2013
|image 1=MartialAres5.jpg
|rarity 1=Super Special Rare
|pwr req 1=28
|sale price 1=94200
|max card lv 1=60
|max mastery lv 1=40
|quote 1=Ares prefers weapons that were used during the age of Greek myth: sword, axe, and spear. But he can use any weapon expertly, and turn most ordinary objects into lethal weapons.
|base atk 1=2440
|base def 1=2650
|max atk 1=7015
|max def 1=7613
|mastery bonus atk 1=915
|mastery bonus def 1=993
|image 2=MartialAres6.jpg
|rarity 2=Ultimate Rare
|sale price 2=188400
|max mastery lv 2=200
|quote 2=Next time I see Hercules, We're going to have a steel conversation. It's about time for him to answer for massacring my Stymphalian Birds.
|max atk 2=9822
|max def 2=10660
|mastery bonus atk 2=1098
|mastery bonus def 2=1192
|alignment=Bruiser
|ability=Warhawk
|gender=Male
|usage=Average
|faction=Super Hero
|effect=Significantly harden DEF of your Bruisers.
|centretrait=None
}}
__NOWYSIWYG__
</text>
</revision>
</page>
您可以这样获取文本:
$xml = simplexml_load_string($string);
$text = $xml->revision->text;
I assume I would have to find the | delimiter and split at that point, but not quite sure how to do that.
PHP 的基于 libxml 的库(包括 SimpleXMLElement 和 DOMDocument)及其默认选项会保留空格在您问题的 text 元素中,因此您应该 运行 解决小问题。
对于编码到 text 元素文本值中的格式,这是 Mediawiki 语法,并且在开头是最大的部分 Infobox Template.
{{Infobox
| title = Top level title
| data1 = {{Infobox | decat = yes | child = yes
| title = First subsection
| label1= Label 1.1
| data1 = Data 1.1
}}
| data2 = {{Infobox | decat = yes | child = yes
|title = Second subsection
| label1= Label 2.1
| data1 = Data 2.1
}}
| belowstyle =
| below = Below text
}}
嵌入使用的信息框模板示例
信息框模板遵循通用模板语法来命名模板并为其提供(命名或未命名)参数。您会在 Usage Syntax section of Help:Template. As Mediawiki itself is written in PHP, you can even find PHP code that is related to parsing these template codes within it's source 中找到此概述。它显示了如何以比帮助页面更严格的方式解析这些数据,但是由于它是采用模块化和结构化方式的源代码,因此更加复杂。根据您的程序员类型,它可能会让您不知所措,因为它需要阅读有关 PCRE 正则表达式的技能和知识。
据我目前所见,您需要先提取(递归){{
... }}
块。在这些文件中,您可以解析 title 和任意数量的(命名或未命名)参数。分隔这些字段的分隔符是 |
。我不能说这些定界符中的任何一个是否可以或者它们是如何转义的,我也不能说字段和名称是否可以使用多行值 - 在给出的示例旁边显示一个值可以是另一个模板然后可以是多行的。
我有一个格式奇怪的 XML 文件,我想提取 <text xml:space="preserve" bytes="1099">
部分并将该信息转换成它自己的数组。
我想我必须找到 |定界符和分割点,但不太确定该怎么做。
<page>
<title>Martial Ares</title>
<ns>0</ns>
<id>23026</id>
<sha1>7imznp2a51dh3kslf5gtqnkpkidlvom</sha1>
<revision>
<id>79960</id>
<timestamp>2014-02-25T07:24:27Z</timestamp>
<contributor>
<username>JScathach</username>
<id>8024930</id>
</contributor>
<text xml:space="preserve" bytes="1017">{{Infobox card (2)
|card name=[Martial] Ares
|character name=Ares
|release_date=May 1 2013
|image 1=MartialAres5.jpg
|rarity 1=Super Special Rare
|pwr req 1=28
|sale price 1=94200
|max card lv 1=60
|max mastery lv 1=40
|quote 1=Ares prefers weapons that were used during the age of Greek myth: sword, axe, and spear. But he can use any weapon expertly, and turn most ordinary objects into lethal weapons.
|base atk 1=2440
|base def 1=2650
|max atk 1=7015
|max def 1=7613
|mastery bonus atk 1=915
|mastery bonus def 1=993
|image 2=MartialAres6.jpg
|rarity 2=Ultimate Rare
|sale price 2=188400
|max mastery lv 2=200
|quote 2=Next time I see Hercules, We're going to have a steel conversation. It's about time for him to answer for massacring my Stymphalian Birds.
|max atk 2=9822
|max def 2=10660
|mastery bonus atk 2=1098
|mastery bonus def 2=1192
|alignment=Bruiser
|ability=Warhawk
|gender=Male
|usage=Average
|faction=Super Hero
|effect=Significantly harden DEF of your Bruisers.
|centretrait=None
}}
__NOWYSIWYG__
</text>
</revision>
</page>
您可以这样获取文本:
$xml = simplexml_load_string($string);
$text = $xml->revision->text;
I assume I would have to find the | delimiter and split at that point, but not quite sure how to do that.
PHP 的基于 libxml 的库(包括 SimpleXMLElement 和 DOMDocument)及其默认选项会保留空格在您问题的 text 元素中,因此您应该 运行 解决小问题。
对于编码到 text 元素文本值中的格式,这是 Mediawiki 语法,并且在开头是最大的部分 Infobox Template.
{{Infobox
| title = Top level title
| data1 = {{Infobox | decat = yes | child = yes
| title = First subsection
| label1= Label 1.1
| data1 = Data 1.1
}}
| data2 = {{Infobox | decat = yes | child = yes
|title = Second subsection
| label1= Label 2.1
| data1 = Data 2.1
}}
| belowstyle =
| below = Below text
}}
嵌入使用的信息框模板示例
信息框模板遵循通用模板语法来命名模板并为其提供(命名或未命名)参数。您会在 Usage Syntax section of Help:Template. As Mediawiki itself is written in PHP, you can even find PHP code that is related to parsing these template codes within it's source 中找到此概述。它显示了如何以比帮助页面更严格的方式解析这些数据,但是由于它是采用模块化和结构化方式的源代码,因此更加复杂。根据您的程序员类型,它可能会让您不知所措,因为它需要阅读有关 PCRE 正则表达式的技能和知识。
据我目前所见,您需要先提取(递归){{
... }}
块。在这些文件中,您可以解析 title 和任意数量的(命名或未命名)参数。分隔这些字段的分隔符是 |
。我不能说这些定界符中的任何一个是否可以或者它们是如何转义的,我也不能说字段和名称是否可以使用多行值 - 在给出的示例旁边显示一个值可以是另一个模板然后可以是多行的。