XSL 从相似的标记条目创建 'chapters' 或 'groups'

XSL creating 'chapters' or 'groups' from similar tagged entries

我有一个大型 XML 语料库文档,其结构大致如下所示:

<corpus>
   <document n="001">
       <front>
          <title>foo title</title>
          <group n="foo_group_A"/>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
          <seg n="3">some text with markups</seg>
       </body>
   </document>
   <document n=002">
       <front>
          <title>foo title</title>
          <group n="foo_group_A"/>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
       </body>
   </document>
   <document n="003">
       <front>
          <title>foo title</title>
          <group n="foo_group_A"/>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
          <seg n="3">some text with markups</seg>
       </body>
   </document>
   <document n="004">
       <front>
          <title>foo title</title>
          <group n="foo_group_B"/>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
       </body>
   </document>
   <document n="005">
       <front>
          <title>foo title</title>
          <group n="foo_group_B"/>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
       </body>
   </document>
    [...]
</corpus>

我正在使用 XSL 3.0 将此 XML 文件预处理为另一种格式 XML 在最终输出为 PDF 之前。作为转换的一部分,我想在新的 <chapter> 元素中收集 'wrap' <document>,该元素反映了 front/group/@n 的值。新语料库如下所示,其中 group/@n 值提供了在新 chapter 下分组的逻辑:

<corpus>
  <chapter n="foo_group_A">
   <document n="001">
       <front>
          <title>foo title</title>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
          <seg n="3">some text with markups</seg>
       </body>
   </document>
   <document n=002">
       <front>
          <title>foo title</title>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
       </body>
   </document>
   <document n="003">
       <front>
          <title>foo title</title>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
          <seg n="3">some text with markups</seg>
       </body>
   </document>
  </chapter>
  <chapter n="foo_group_B">
   <document n="004">
       <front>
          <title>foo title</title>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
       </body>
   </document>
   <document n="005">
       <front>
          <title>foo title</title>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
       </body>
   </document>
  </chapter>
    [...]
</corpus>

文件已经预先排序 foo_group_A、foo_group_B 等,因此不需要额外排序。它只需要创建一个新元素 <chapter> 来包含相关文档。我用 xsl:for-each 试过这个,但我想我缺少某种 'summary' 或 'collection' 的组来迭代。

非常感谢。

如果您使用 XSLT 3 并希望对项目进行分组,那么您当然不会使用 xsl:for-each,而是使用 xsl:for-each-group,例如

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="3.0">

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:output method="xml" indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="corpus">
      <xsl:copy>
          <xsl:for-each-group select="document" group-by="front/group/@n">
              <chapter n="{current-grouping-key()}">
                  <xsl:apply-templates select="current-group()"/>
              </chapter>
          </xsl:for-each-group>
      </xsl:copy>
  </xsl:template>  

  <xsl:template match="front/group"/>

</xsl:stylesheet>

http://xsltfiddle.liberty-development.net/nbUY4ki

如果 document 已经按分组键 front/group/@n 排序,那么使用 xsl:for-each-group select="document" group-adjacent="front/group/@n" 而不是上面的 group-by 也应该足够了,这样就可以了通过将 streamable="yes" 添加到 xsl:mode 声明并使用 xsl:for-each-group select="copy-of(document)" group-adjacent="front/group/@n" 进行分组,可以更轻松地对大型文档使用流式传输。