使用 PHP 从 XML 文件获取 RDF 元素的属性值

Question

我正在尝试从 XML 的 'rdf:li' 元素获取属性 'rdf:resource' 值：http://www.ecb.europa.eu/rss/fxref-usd.html

实现该目标的正确方法是什么？如何正确解析这些 RDF 元素？

这是我目前拥有的：

<!DOCTYPE html>
<html>

    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>RDF</title>
    </head>

    <body>

    <ul> 
 <?php      

            $rdf = file_get_contents('http://www.ecb.europa.eu/rss/fxref-usd.html');


            $rdf = str_replace('rdf:', 'rdf_', $rdf);


            $xml = simplexml_load_string($rdf);


            foreach ($xml->channel->items->rdf_Seq->rdf_li as $item) {
                $attributes = $item->attributes();              

                if(isset($attributes['rdf_resource'])) {
                    echo '<li><a href ='.$attributes['rdf_resource'].' target="_blank">'.$attributes['rdf_resource'].'</a> <l/i>';
                }
            }
?>
    </ul>

    </body>

</html>

如您所见，这是一种 hack，我认为这不是正确的方法。

感谢任何帮助！

Answer 1

I am trying to get attribute 'rdf:resource' value from 'rdf:li' element from this XML: http://www.ecb.europa.eu/rss/fxref-usd.html

首先，这实际上不是合法的 RDF，至少根据 Jena 的解析器而言是这样。删除 xsd 模式位置后，这显然在 rdf:RDF 元素上是不允许的，我仍然收到错误消息：Expecting XML start or end element(s) .字符串数据 "U2" 不允许。也许应该有一个 rdf:parseType='Literal' 用于在 RDF 中嵌入混合 XML 内容。可能是条带化错误。

但即使它是合法的 RDF/XML，您的方法也有两个问题最终会有点脆弱。首先是使用 XML 工具可靠地处理 RDF/XML 非常困难，如 this answer that I wrote to How to access OWL documents using XPath in Java? 中所述。通常，同一个 RDF 图可以序列化为一堆不同的 RDF/XML 文档。对于使用 rdf:li，这一点尤为重要：RDF 图实际上没有任何具有 rdf:li 属性的资源，即使 XML 文档中有 rdf:li 元素.看看：

2.15 Container Membership Property Elements: rdf:li and rdf:_n

RDF has a set of container membership properties and corresponding property elements that are mostly used with instances of the rdf:Seq, rdf:Bag and rdf:Alt classes which may be written as typed node elements. The list properties are rdf:_1, rdf:_2 etc. and can be written as property elements or property attributes as shown in Example 17. There is an rdf:li special property element that is equivalent to rdf:_1, rdf:_2 in order, explained in detail in section 7.4. The mapping to the container membership properties is always done in the order that the rdf:li special property elements appear in XML — the document order is significant. The equivalent RDF/XML to Example 17 written in this form is shown in Example 18.

这意味着 RDF/XML 片段（不太合法，但给人以一般印象）如：

<ex:Collection>
  <rdf:li rdf:about="member1"/>
  <rdf:li rdf:about="member2"/>
</ex:Collection>

也可以写成：

<ex:Collection>
  <rdf:_2 rdf:about="member2"/>
  <rdf:_1 rdf:about="member1"/>
</ex:Collection>

这意味着这里任何纯粹基于 XML 的方法都可能会变得脆弱，因为它将依赖于某些不能保证始终以相同方式表示的结构。

通常答案是用RDF查询语言查询，这样就可以在RDF层面进行查询。标准的 RDF 查询语言是 SPARQL。不幸的是，由于实际上有无限多的属性（rdf:_1、rdf:_2、...），在 SPARQL 中也很难有效地做到这一点，因为您最终需要匹配看起来像 rdf:_xxx 然后找出下划线后面的内容。

好的，所以如果你能把 RDF/XML 变成合法的格式，你可能会得到类似这样的结果：

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:cb="http://www.cbwiki.net/wiki/index.php/Specification_1.1" xmlns:dc = "http://purl.org/dc/elements/1.1/" xmlns:dcterms = "http://purl.org/dc/terms/" xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance">
<channel  rdf:about = "http://www.ecb.europa.eu/rss/usd.html">
<title>ECB | US dollar (USD) - Euro foreign exchange reference rates</title>  
<link>http://www.ecb.europa.eu/home/html/rss.en.html</link>
<description>The reference rates are based on the regular daily concertation procedure between central banks within and outside the European System of Central Banks, which normally takes place at 2.15 p.m. (14:15) ECB time.</description>
<items>
<rdf:Seq>
<rdf:li rdf:resource="http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-09&amp;rate=1.1362" />
<rdf:li rdf:resource="http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-08&amp;rate=1.1254" />
<rdf:li rdf:resource="http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-07&amp;rate=1.1266" />
<rdf:li rdf:resource="http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-06&amp;rate=1.1224" />
<rdf:li rdf:resource="http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-05&amp;rate=1.1236" />
</rdf:Seq>
</items>
</channel>
</rdf:RDF>

现在，请记住，那些 rdf:li XML 元素并不意味着图中有 rdf:li 个属性，而是有一堆 rdf:_n 个属性.在 Turtle 序列化（类似于 SPARQL 语法）中，数据为：

@prefix :      <http://purl.org/rss/1.0/> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix cb:    <http://www.cbwiki.net/wiki/index.php/Specification_1.1> .
@prefix dc:    <http://purl.org/dc/elements/1.1/> .
@prefix xsi:   <http://www.w3.org/2001/XMLSchema-instance> .

<http://www.ecb.europa.eu/rss/usd.html>
        a             :channel ;
        :description  "The reference rates are based on the regular daily concertation procedure between central banks within and outside the European System of Central Banks, which normally takes place at 2.15 p.m. (14:15) ECB time." ;
        :items        [ a       rdf:Seq ;
                        rdf:_1  <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-09&rate=1.1362> ;
                        rdf:_2  <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-08&rate=1.1254> ;
                        rdf:_3  <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-07&rate=1.1266> ;
                        rdf:_4  <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-06&rate=1.1224> ;
                        rdf:_5  <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-05&rate=1.1236>
                      ] ;
        :link         "http://www.ecb.europa.eu/home/html/rss.en.html" ;
        :title        "ECB | US dollar (USD) - Euro foreign exchange reference rates" .

此时我要做的是查找您频道的 :items 属性，检查它是否是 rdf:Seq，然后获取除 rdf:type 之外的所有属性，并假设它们是 rdf:_n 值，或者实际上得到 rdf:_xxx 属性值。看起来像：

prefix :      <http://purl.org/rss/1.0/>
prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

select ?item {
  <http://www.ecb.europa.eu/rss/usd.html> :items ?x .
  ?x a rdf:Seq .
  ?x ?p ?item .
  filter (?p != rdf:type)
}

--------------------------------------------------------------------------------------------------------------------
| item                                                                                                             |
====================================================================================================================
| <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-05&rate=1.1236> |
| <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-06&rate=1.1224> |
| <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-07&rate=1.1266> |
| <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-08&rate=1.1254> |
| <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-09&rate=1.1362> |
--------------------------------------------------------------------------------------------------------------------

或者，后一种方法（实际检查 rdf:_）：

prefix :      <http://purl.org/rss/1.0/>
prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix xsd:   <http://www.w3.org/2001/XMLSchema#>

select ?n ?item {
  <http://www.ecb.europa.eu/rss/usd.html> :items ?x .
  ?x a rdf:Seq .
  ?x ?p ?item .

  # check that ?p starts with rdf:_
  filter strstarts(str(?p),str(rdf:_))

  # and extract the part after rdf:_ and convert
  # it to an integer
  bind (xsd:integer(strafter(str(?p),str(rdf:_))) as ?n)
}

------------------------------------------------------------------------------------------------------------------------
| n | item                                                                                                             |
========================================================================================================================
| 5 | <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-05&rate=1.1236> |
| 4 | <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-06&rate=1.1224> |
| 3 | <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-07&rate=1.1266> |
| 2 | <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-08&rate=1.1254> |
| 1 | <http://www.ecb.europa.eu/stats/exchange/eurofxref/html/eurofxref-graph-usd.en.html?date=2015-10-09&rate=1.1362> |
------------------------------------------------------------------------------------------------------------------------

现在您只需要 PHP 的 SPARQL 库。我不是真正的 PHP 用户，所以我不能推荐一个，但我知道 Stack Overflow 上还有一些关于 PHP 和 SPARQL 的其他问题，并且有一些库。

使用 PHP 从 XML 文件获取 RDF 元素的属性值

Getting RDF element's attribute value from XML file using PHP

php

xml

rdf

2.15 Container Membership Property Elements: rdf:li and rdf:_n