使用 Schematron QuickFixes 标记混合内容元素中的单个单词
Using Schematron QuickFixes to tag individual words in mixed content elements
我有一个 xml 文件,看起来像这样(简化):
<defs>
<def>Pure text</def>
<def>Mixed content, cuz there is also another: <element>element inside</element> and more.</def>
<def><element>Text nodes within elements other than def are ok.</element></def>
<defs>
我正在尝试编写一个带有快速修复的 Shematron 规则,使我能够在 defs 中获取具有混合内容的每个单词,并将它们分别包装在 <w>
元素中,并将标点符号包装在 <pc>
元素。换句话说,在应用快速修复后我会得到
<defs>
<def>Pure text.</def>
<def><w>Mixed</w> <w>content</w><pc>,</pc> <w>cuz</w> <w>there</w> <w>is</w> <w>also</w> <w>another</w><pc>:</pc> <element>element inside</element> <w>and</w> <w>more</w><pc>.</pc></def>
<def><element>Text nodes within elements other than def are ok.</element></def>
<defs>
<w>
s 和 <pc>
s 之间的空格是可以的。
现在,识别混合内容很容易 — 我想我做对了。问题是我不知道如何在 Schematron 中标记字符串,然后对每个标记应用修复。这是我取得的进展:
<sch:pattern id="mixed">
<sch:rule context="def[child::text()][child::*]">
<sch:report test="tokenize(child::text(), '\s+')" sqf:fix="mix_in_def">
Element has mixed content
<!-- the above this gives me the error: a sequence of more than one item is not allowed as the first argument of tokenize-->
</sch:report>
<sqf:fix id="mix_in_def">
<sqf:description>
<sqf:title>Wrap words in w</sqf:title>
<sqf:p>Fixes the mixed content in def by treating each non-tagged string as w.</sqf:p>
</sqf:description>
<sqf:replace match="." node-type="element" target="w">
<!--how do i represent the content of the matched token?-->
</sqf:replace>
<!-- also do i create an altogether separate rule for punctuation?-->
</sqf:fix>
</sch:rule>
</sch:pattern>
如有任何提示,我们将不胜感激。
丁奇
可以使用XSL,看这个例子(在代码注释中有说明):
<sch:pattern id="mixed">
<!-- Your context is now def => this makes easier add new def reports -->
<sch:rule context="def">
<!-- So now you report every def that has text and elements -->
<sch:report test="child::text() and child::*" sqf:fix="mix_in_def">
Element has mixed content
<!-- What you were doing before where causing error because you were passing a sequence of text nodes to tokenize (it expects a string) -->
</sch:report>
<sqf:fix id="mix_in_def">
<sqf:description>
<sqf:title>Wrap words in w</sqf:title>
<sqf:p>Fixes the mixed content in def by treating each non-tagged string as w.</sqf:p>
</sqf:description>
<!-- Replace every mixed text node of this def (this is called for every matched node) -->
<sqf:replace match="child::text()">
<!-- Tokenize this text node => for each token choose... -->
<xsl:for-each select="tokenize(., '\s+')">
<!-- For this token choose -->
<xsl:choose>
<!-- If text is one of this (,.:) Please note that you are using \s+ to separate tokens. So a comma is only a token if it is separated by spaces -->
<xsl:when test=". = (',', '.', ':', 'is')"> <!-- "is" just to test results -->
<pc><xsl:value-of select="."/></pc>
</xsl:when>
<!-- Otherwise wrap it in <w> -->
<xsl:otherwise>
<w><xsl:value-of select="."/></w>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</sqf:replace>
</sqf:fix>
</sch:rule>
</sch:pattern>
您必须根据您的具体问题对此进行调整,但我认为这会对您有所帮助。
我有一个 xml 文件,看起来像这样(简化):
<defs>
<def>Pure text</def>
<def>Mixed content, cuz there is also another: <element>element inside</element> and more.</def>
<def><element>Text nodes within elements other than def are ok.</element></def>
<defs>
我正在尝试编写一个带有快速修复的 Shematron 规则,使我能够在 defs 中获取具有混合内容的每个单词,并将它们分别包装在 <w>
元素中,并将标点符号包装在 <pc>
元素。换句话说,在应用快速修复后我会得到
<defs>
<def>Pure text.</def>
<def><w>Mixed</w> <w>content</w><pc>,</pc> <w>cuz</w> <w>there</w> <w>is</w> <w>also</w> <w>another</w><pc>:</pc> <element>element inside</element> <w>and</w> <w>more</w><pc>.</pc></def>
<def><element>Text nodes within elements other than def are ok.</element></def>
<defs>
<w>
s 和 <pc>
s 之间的空格是可以的。
现在,识别混合内容很容易 — 我想我做对了。问题是我不知道如何在 Schematron 中标记字符串,然后对每个标记应用修复。这是我取得的进展:
<sch:pattern id="mixed">
<sch:rule context="def[child::text()][child::*]">
<sch:report test="tokenize(child::text(), '\s+')" sqf:fix="mix_in_def">
Element has mixed content
<!-- the above this gives me the error: a sequence of more than one item is not allowed as the first argument of tokenize-->
</sch:report>
<sqf:fix id="mix_in_def">
<sqf:description>
<sqf:title>Wrap words in w</sqf:title>
<sqf:p>Fixes the mixed content in def by treating each non-tagged string as w.</sqf:p>
</sqf:description>
<sqf:replace match="." node-type="element" target="w">
<!--how do i represent the content of the matched token?-->
</sqf:replace>
<!-- also do i create an altogether separate rule for punctuation?-->
</sqf:fix>
</sch:rule>
</sch:pattern>
如有任何提示,我们将不胜感激。
丁奇
可以使用XSL,看这个例子(在代码注释中有说明):
<sch:pattern id="mixed">
<!-- Your context is now def => this makes easier add new def reports -->
<sch:rule context="def">
<!-- So now you report every def that has text and elements -->
<sch:report test="child::text() and child::*" sqf:fix="mix_in_def">
Element has mixed content
<!-- What you were doing before where causing error because you were passing a sequence of text nodes to tokenize (it expects a string) -->
</sch:report>
<sqf:fix id="mix_in_def">
<sqf:description>
<sqf:title>Wrap words in w</sqf:title>
<sqf:p>Fixes the mixed content in def by treating each non-tagged string as w.</sqf:p>
</sqf:description>
<!-- Replace every mixed text node of this def (this is called for every matched node) -->
<sqf:replace match="child::text()">
<!-- Tokenize this text node => for each token choose... -->
<xsl:for-each select="tokenize(., '\s+')">
<!-- For this token choose -->
<xsl:choose>
<!-- If text is one of this (,.:) Please note that you are using \s+ to separate tokens. So a comma is only a token if it is separated by spaces -->
<xsl:when test=". = (',', '.', ':', 'is')"> <!-- "is" just to test results -->
<pc><xsl:value-of select="."/></pc>
</xsl:when>
<!-- Otherwise wrap it in <w> -->
<xsl:otherwise>
<w><xsl:value-of select="."/></w>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</sqf:replace>
</sqf:fix>
</sch:rule>
</sch:pattern>
您必须根据您的具体问题对此进行调整,但我认为这会对您有所帮助。