XSLT 高效识别给定范围内的重复节点

Question

我正在 XML-TEI 中处理一些手稿转录，我正在使用 XSLT 将其转换为 .tex 文档。我的输入文档由代表文本中每个单词的 tei:w 个标记组成。 MWE:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml"
    schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
   <teiHeader>
      ...
   </teiHeader>
   <text>
      <body>
         <p><w>Lorem</w>
            <w>ipsum</w>
            <w>dolor</w>
            <w>sit</w>
            <w>amet</w>
            <pc>,</pc>
            <w>consectetur</w>
            <w>adipiscing</w>
            <w>elit</w>
            <pc>,</pc>
            <w>sed</w>
            <w>do</w>
            <w>eiusmod</w>
            <w>tempor</w>
            <w>incididunt</w>
            <w>ut</w>
            <w>labore</w>
            <w>et</w>
            <w>dolore</w>
            <w>magna</w>
            <w>aliqua</w>
            <pc>;</pc>
            <w>ut</w>
            <w>enim</w>
            <w>ad</w>
            <w>minim</w>
            <w>veniam</w>
         </p>
      </body>
   </text>
</TEI>

我需要识别在一定范围内重复的单词，比如 10，以使 LaTeX 在版本中消除它们的歧义（使用 reledmac 包中名为 \sameword 的命令）。例如，在上面的 MWE，我希望两个 ut 都被这个命令标记。

我想我已经找到了一种方法；我的问题更多是关于如何改进我的代码。对于小文档，下面的模板似乎工作得很好；但是我的语料库由 300.000 个标记组成，转换花费了太多时间：引擎正在评估每个单词的左右上下文...

 <xsl:template match="tei:w">
        <xsl:variable name="current_position" select="count(preceding::tei:w)"/>
        <xsl:variable name="same_word_before"
            select="preceding::tei:w[($current_position - 10) > count(preceding::tei:w)][not(count(preceding::tei:w) > $current_position)]/text() = text()"/>
        <xsl:variable name="same_word_after"
            select="following::tei:w[($current_position + 10) > count(preceding::tei:w)][count(preceding::tei:w) > $current_position]/text() = text()"/>
        ...
        <xsl:choose>
            <xsl:when test="$same_word_before or $same_word_after">
                <xsl:text>\sameword{</xsl:text>
                <xsl:apply-templates/>
                <xsl:text>}</xsl:text>
            </xsl:when>
            <xsl:otherwise>
                <xsl:apply-templates/>
            </xsl:otherwise>
        </xsl:choose>
        ...
    </xsl:template>

是否有更简单 and/or 更有效的方法来做到这一点？我正在考虑的一种解决方案是使用 python，但我更愿意坚持使用 xsl 来完成此任务。

编辑：我正在使用 XSLT 2.0。

Answer 1

和你做的差不多，还是挺快的：

  <xsl:template match="tei:w">
    <xsl:variable name="preceding"  as="xs:string*" select="preceding-sibling::tei:w[position() lt 11]/text()" />
    <xsl:variable name="following"  as="xs:string*" select="following-sibling::tei:w[position() lt 11]/text()" />
    <xsl:choose>
      <xsl:when test="text()=($preceding,$following)">
        <xsl:text>\sameword{</xsl:text>
        <xsl:apply-templates/>
        <xsl:text>}</xsl:text>
      </xsl:when>
      <xsl:otherwise>
        <xsl:apply-templates/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

我用每 50 个单词用 2000 个 p 测试了它，用了 0.3 秒。

从 Xslt 2.0 开始我们有 build-in data-types 他们描述了variable/parameter/function.

的数据类型

即<xsl:variable name="preceding" as="xs:string*"/> 表示变量可以包含零个或多个字符串。
或<xsl:variable name="firtsNextSibling" as="element()?"/>表示变量可以包含零个或一个元素。

这个when的@test属性的意思是当前text()节点的值应该存在于组合的$preceding和$following中string-sequences.

XSLT 高效识别给定范围内的重复节点

XSLT Efficiently identify repeated nodes within a given range

xml

xslt

token