如何使用 XSLTranformation 提取 xhtml meetgrid 部分?
How to extract a xhtml meetgrid section using a XSLTranformation?
我想从网站提取数据,对其进行转换(使用 xsl)并在 XML 中获得输出。
为什么我的 xsl 不转换 XML 以获得所需的输出?
我用来测试转换的XML如下:
<?xml version= "1.0"?>
<?xml-stylesheet type="text/xsl" href="diverecorder.xsl"?>
<head>
<body>
<div id="container">
<div id="content">
<br/><h3>2015</h3>
<table class="meetgrid" summary="List of Meets">
<tr><td>Mar 08</td><td> <a href="selectevent.php?mref=486">Manifestazione Regionale Cat. C4 – C2 –C1 - R</a></td></tr>
<tr><td>Mar 07</td><td> <a href="selectevent.php?mref=484">Diving SA State Age Open & Synchro 2015</a></td></tr>
</table>
<br /><h3>2014</h3>
<table class="meetgrid" summary="List of Meets">
<tr><td>Dec 13</td><td> <a href="selectevent.php?mref=461">Sheffield Santa Skills 2014</a></td></tr>
<tr><td>Dec 11</td><td> <a href="selectevent.php?mref=460">2014/15 Australian Open Championships</a></td></tr>
</table>
</html>
这直接来自网站,除了编辑前三行以连接到 diverecorder.xsl 文件以测试 xsl 转换。信息将重复与示例相同,主要变化是 "mref=".
之后的数字
下面是 XSL 代码,我试图从网站中提取 meetgrid 和 h3 部分并对其进行转换。
<?xml version="1.0" encoding="UTF-8"?
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:soap="http://soap/envelope/">
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<body>
<xsl:for-each select="body/div">
<event>
<xsl:for-each select="div">
<xsl:if test="h3">
<yearRange>
<xsl:value-of select="translate(normalize-space(.), ' ', ',')" />
</yearRange>
</xsl:if>
<xsl:if test="@class='meetgrid'">
<eventmonthDay>
<xsl:value-of select="tr/td" />
</eventmonthDay>
<eventUrl>
<xsl:value-of select="substring-before(a/@href, '/event/')" />/download/<xsl:value-of select="substring-after(a/@href, '/event/')" />multi/
</eventUrl>
<eventTitle>
<xsl:value-of select="/a" />
</eventTitle>
</xsl:if>
</xsl:for-each>
</event>
</xsl:for-each>
</body>
</xsl:template>
</xsl:stylesheet>
当前输出是选择见面。
Expected/Desired 输出 目前我没有得到。
<head>
<body>
<year>
2015
</year>
<eventmonthday>Mar 08</eventmonthday><event>Manifestazione Regionale Cat. C4 – C2 –C1 - R</event>
<eventmonthday>Mar 07</eventmonthday><event>Diving SA State Age Open & Synchro 2015</event>
...
<year>
2014
</year>
<eventmonthday>Dec 13</eventmonthday><event>Sheffield Santa Skills 2014</event>
<eventmonthday>Dec 11</eventmonthday><event>2014/15 Australian Open Championships</event>
...
</body>
</head>
备注:
我要提取的内容的完整来源是 view-source:http://www.diverecorder.co.uk/meetexplorer/selectmeet.php
我用来查看转换是否有效的测试是 - 在 Internet Explorer 中创建 diverecorder.xml 和 diverecorder 到 xsl 和 运行 xml 文件以检查转换是否有效工作正常。
我看过但无法解决的类似问题包括:
How to extract a div section from one xhtml document into another xhtml document
Extracting data from website with XSLT
How to replace a text in XML file using XSLT
Hopefully the question is more clear now. I added the namespace,
changed the template match to "/" and changed example input and
required output.
不幸的是,您的输入仍然格式不正确 XML,因为 (1) 它缺少 body
和两个 div
元素的结束标记,以及 (2)它包含一个未声明的实体
.
为了推动这一点:
给定一个格式正确的输入,例如:
XML
<!DOCTYPE html [
<!ENTITY nbsp " ">
]>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head/>
<body>
<div id="container">
<div id="content">
<br/>
<h3>2015</h3>
<table class="meetgrid" summary="List of Meets">
<tr>
<td>Mar 08</td>
<td> <a href="selectevent.php?mref=486">Manifestazione Regionale Cat. C4 – C2 –C1 - R</a></td>
</tr>
<tr>
<td>Mar 07</td>
<td> <a href="selectevent.php?mref=484">Diving SA State Age Open & Synchro 2015</a></td>
</tr>
</table>
<br/>
<h3>2014</h3>
<table class="meetgrid" summary="List of Meets">
<tr>
<td>Dec 13</td>
<td> <a href="selectevent.php?mref=461">Sheffield Santa Skills 2014</a></td>
</tr>
<tr>
<td>Dec 11</td>
<td> <a href="selectevent.php?mref=460">2014/15 Australian Open Championships</a></td>
</tr>
</table>
</div>
</div>
</body>
</html>
以下 样式表:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:x="http://www.w3.org/1999/xhtml"
exclude-result-prefixes="x">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<head>
<body>
<xsl:for-each select="x:html/x:body/x:div/x:div/x:table">
<year>
<xsl:value-of select="preceding-sibling::x:h3[1]"/>
</year>
<xsl:for-each select="x:tr">
<eventmonthday>
<xsl:value-of select="x:td[1]"/>
</eventmonthday>
<event>
<xsl:value-of select="x:td[2]/x:a"/>
</event>
</xsl:for-each>
</xsl:for-each>
</body>
</head>
</xsl:template>
</xsl:stylesheet>
会产生这个结果:
<?xml version="1.0" encoding="UTF-8"?>
<head>
<body>
<year>2015</year>
<eventmonthday>Mar 08</eventmonthday>
<event>Manifestazione Regionale Cat. C4 – C2 –C1 - R</event>
<eventmonthday>Mar 07</eventmonthday>
<event>Diving SA State Age Open & Synchro 2015</event>
<year>2014</year>
<eventmonthday>Dec 13</eventmonthday>
<event>Sheffield Santa Skills 2014</event>
<eventmonthday>Dec 11</eventmonthday>
<event>2014/15 Australian Open Championships</event>
</body>
</head>
注:
一些处理器(例如 Saxon)能够处理包含 HTML 实体的文档,而无需显式声明它们,而是指向特定的 DTD,例如:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
他们将通过查阅在 DOCTYPE 声明的 URL 处找到的实际 DTD 文档来解码实体。在我的测试中,这非常慢。
我想从网站提取数据,对其进行转换(使用 xsl)并在 XML 中获得输出。 为什么我的 xsl 不转换 XML 以获得所需的输出?
我用来测试转换的XML如下:
<?xml version= "1.0"?>
<?xml-stylesheet type="text/xsl" href="diverecorder.xsl"?>
<head>
<body>
<div id="container">
<div id="content">
<br/><h3>2015</h3>
<table class="meetgrid" summary="List of Meets">
<tr><td>Mar 08</td><td> <a href="selectevent.php?mref=486">Manifestazione Regionale Cat. C4 – C2 –C1 - R</a></td></tr>
<tr><td>Mar 07</td><td> <a href="selectevent.php?mref=484">Diving SA State Age Open & Synchro 2015</a></td></tr>
</table>
<br /><h3>2014</h3>
<table class="meetgrid" summary="List of Meets">
<tr><td>Dec 13</td><td> <a href="selectevent.php?mref=461">Sheffield Santa Skills 2014</a></td></tr>
<tr><td>Dec 11</td><td> <a href="selectevent.php?mref=460">2014/15 Australian Open Championships</a></td></tr>
</table>
</html>
这直接来自网站,除了编辑前三行以连接到 diverecorder.xsl 文件以测试 xsl 转换。信息将重复与示例相同,主要变化是 "mref=".
之后的数字下面是 XSL 代码,我试图从网站中提取 meetgrid 和 h3 部分并对其进行转换。
<?xml version="1.0" encoding="UTF-8"?
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:soap="http://soap/envelope/">
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<body>
<xsl:for-each select="body/div">
<event>
<xsl:for-each select="div">
<xsl:if test="h3">
<yearRange>
<xsl:value-of select="translate(normalize-space(.), ' ', ',')" />
</yearRange>
</xsl:if>
<xsl:if test="@class='meetgrid'">
<eventmonthDay>
<xsl:value-of select="tr/td" />
</eventmonthDay>
<eventUrl>
<xsl:value-of select="substring-before(a/@href, '/event/')" />/download/<xsl:value-of select="substring-after(a/@href, '/event/')" />multi/
</eventUrl>
<eventTitle>
<xsl:value-of select="/a" />
</eventTitle>
</xsl:if>
</xsl:for-each>
</event>
</xsl:for-each>
</body>
</xsl:template>
</xsl:stylesheet>
当前输出是选择见面。
Expected/Desired 输出 目前我没有得到。
<head>
<body>
<year>
2015
</year>
<eventmonthday>Mar 08</eventmonthday><event>Manifestazione Regionale Cat. C4 – C2 –C1 - R</event>
<eventmonthday>Mar 07</eventmonthday><event>Diving SA State Age Open & Synchro 2015</event>
...
<year>
2014
</year>
<eventmonthday>Dec 13</eventmonthday><event>Sheffield Santa Skills 2014</event>
<eventmonthday>Dec 11</eventmonthday><event>2014/15 Australian Open Championships</event>
...
</body>
</head>
备注: 我要提取的内容的完整来源是 view-source:http://www.diverecorder.co.uk/meetexplorer/selectmeet.php
我用来查看转换是否有效的测试是 - 在 Internet Explorer 中创建 diverecorder.xml 和 diverecorder 到 xsl 和 运行 xml 文件以检查转换是否有效工作正常。
我看过但无法解决的类似问题包括: How to extract a div section from one xhtml document into another xhtml document
Extracting data from website with XSLT
How to replace a text in XML file using XSLT
Hopefully the question is more clear now. I added the namespace, changed the template match to "/" and changed example input and required output.
不幸的是,您的输入仍然格式不正确 XML,因为 (1) 它缺少 body
和两个 div
元素的结束标记,以及 (2)它包含一个未声明的实体
.
为了推动这一点:
给定一个格式正确的输入,例如:
XML
<!DOCTYPE html [
<!ENTITY nbsp " ">
]>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head/>
<body>
<div id="container">
<div id="content">
<br/>
<h3>2015</h3>
<table class="meetgrid" summary="List of Meets">
<tr>
<td>Mar 08</td>
<td> <a href="selectevent.php?mref=486">Manifestazione Regionale Cat. C4 – C2 –C1 - R</a></td>
</tr>
<tr>
<td>Mar 07</td>
<td> <a href="selectevent.php?mref=484">Diving SA State Age Open & Synchro 2015</a></td>
</tr>
</table>
<br/>
<h3>2014</h3>
<table class="meetgrid" summary="List of Meets">
<tr>
<td>Dec 13</td>
<td> <a href="selectevent.php?mref=461">Sheffield Santa Skills 2014</a></td>
</tr>
<tr>
<td>Dec 11</td>
<td> <a href="selectevent.php?mref=460">2014/15 Australian Open Championships</a></td>
</tr>
</table>
</div>
</div>
</body>
</html>
以下 样式表:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:x="http://www.w3.org/1999/xhtml"
exclude-result-prefixes="x">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<head>
<body>
<xsl:for-each select="x:html/x:body/x:div/x:div/x:table">
<year>
<xsl:value-of select="preceding-sibling::x:h3[1]"/>
</year>
<xsl:for-each select="x:tr">
<eventmonthday>
<xsl:value-of select="x:td[1]"/>
</eventmonthday>
<event>
<xsl:value-of select="x:td[2]/x:a"/>
</event>
</xsl:for-each>
</xsl:for-each>
</body>
</head>
</xsl:template>
</xsl:stylesheet>
会产生这个结果:
<?xml version="1.0" encoding="UTF-8"?>
<head>
<body>
<year>2015</year>
<eventmonthday>Mar 08</eventmonthday>
<event>Manifestazione Regionale Cat. C4 – C2 –C1 - R</event>
<eventmonthday>Mar 07</eventmonthday>
<event>Diving SA State Age Open & Synchro 2015</event>
<year>2014</year>
<eventmonthday>Dec 13</eventmonthday>
<event>Sheffield Santa Skills 2014</event>
<eventmonthday>Dec 11</eventmonthday>
<event>2014/15 Australian Open Championships</event>
</body>
</head>
注:
一些处理器(例如 Saxon)能够处理包含 HTML 实体的文档,而无需显式声明它们,而是指向特定的 DTD,例如:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
他们将通过查阅在 DOCTYPE 声明的 URL 处找到的实际 DTD 文档来解码实体。在我的测试中,这非常慢。