使用 XSLT 将 XML 转换为具有最大宽度的文本

Use XSLT to transform XML to text with Maximum Width

我正在使用 XSLT(XSLT 2.0 很好)将 XML (TEI) 转换为可读的明文(带有一些小的 modifications/challenges—保留 space 用于诗歌;制作标题大写).

到目前为止,一切都按照我的意愿进行,但为了可读性,我还想将通过此转换输出的一行文本的长度限制为某个值(例如 80 个字符宽),仅在 spaces 上拆分(不拆分单词等)。我想设置输出的最大长度(或者说,80 个字符),not 只输出第一个,比如 80 个字符。

有人对最佳方法有什么建议吗?是匹配所有 text() 然后使用 XSLT 的内置字符串函数的模板吗?我试图想象使用字符串函数(string-lengthsubstring 或类似函数)来执行此操作,但还没有任何运气。

(我可以单独执行此操作,使用 python 脚本,非常容易,所以也许 "do it afterwards" 可能是最好的答案。我很想知道我是否忽略了一个简单的解决方案。)

我。这是我 10 多年前写的解决方案。

此转换(来自 FXSL 库):

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="http://fxsl.sf.net/"
 xmlns:str-split2lines-func="f:str-split2lines-func"
 exclude-result-prefixes="f str-split2lines-func">

   <xsl:import href="str-foldl.xsl"/>
   <xsl:output method="text"/>

   <str-split2lines-func:str-split2lines-func/>

    <xsl:template match="/">
      <xsl:call-template name="str-split-to-lines">
        <xsl:with-param name="pStr" select="/*"/>
        <xsl:with-param name="pLineLength" select="64"/>
        <xsl:with-param name="pDelimiters" select="' &#9;&#10;&#13;'"/>
      </xsl:call-template>
    </xsl:template>

    <xsl:template name="str-split-to-lines">
      <xsl:param name="pStr"/>
      <xsl:param name="pLineLength" select="60"/>
      <xsl:param name="pDelimiters" select="' &#9;&#10;&#13;'"/>

      <xsl:variable name="vsplit2linesFun"
                    select="document('')/*/str-split2lines-func:*[1]"/>

      <xsl:variable name="vrtfParams">
       <delimiters><xsl:value-of select="$pDelimiters"/></delimiters>
       <lineLength><xsl:copy-of select="$pLineLength"/></lineLength>
      </xsl:variable>

      <xsl:variable name="vResult">
          <xsl:call-template name="str-foldl">
            <xsl:with-param name="pFunc" select="$vsplit2linesFun"/>
            <xsl:with-param name="pStr" select="$pStr"/>
            <xsl:with-param name="pA0" select="$vrtfParams"/>
          </xsl:call-template>
      </xsl:variable>

      <xsl:for-each select="$vResult/line">
        <xsl:for-each select="word">
          <xsl:value-of select="concat(., ' ')"/>
        </xsl:for-each>
        <xsl:value-of select="'&#10;'"/>
      </xsl:for-each>
    </xsl:template>

    <xsl:template match="str-split2lines-func:*" mode="f:FXSL">
      <xsl:param name="arg1" select="/.."/>
      <xsl:param name="arg2"/>

      <xsl:copy-of select="$arg1/*[position() &lt; 3]"/>
      <xsl:copy-of select="$arg1/line[position() != last()]"/>

      <xsl:choose>
        <xsl:when test="contains($arg1/*[1], $arg2)">
          <xsl:if test="string($arg1/word)">
             <xsl:call-template name="fillLine">
               <xsl:with-param name="pLine" select="$arg1/line[last()]"/>
               <xsl:with-param name="pWord" select="$arg1/word"/>
               <xsl:with-param name="pLineLength" select="$arg1/*[2]"/>
             </xsl:call-template>
          </xsl:if>
        </xsl:when>
        <xsl:otherwise>
          <xsl:copy-of select="$arg1/line[last()]"/>
          <word><xsl:value-of select="concat($arg1/word, $arg2)"/></word>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:template>

      <!-- Test if the new word fits into the last line -->
    <xsl:template name="fillLine">
      <xsl:param name="pLine" select="/.."/>
      <xsl:param name="pWord" select="/.."/>
      <xsl:param name="pLineLength" />

      <xsl:variable name="vnWordsInLine" select="count($pLine/word)"/>
      <xsl:variable name="vLineLength" select="string-length($pLine) + $vnWordsInLine"/>
      <xsl:choose>
        <xsl:when test="not($vLineLength + string-length($pWord) > $pLineLength)">
          <line>
            <xsl:copy-of select="$pLine/*"/>
            <xsl:copy-of select="$pWord"/>
          </line>
        </xsl:when>
        <xsl:otherwise>
          <xsl:copy-of select="$pLine"/>
          <line>
            <xsl:copy-of select="$pWord"/>
          </line>
          <word/>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:template>

</xsl:stylesheet>

应用于以下 XML 文档时

<text>
Dec. 13 — As always for a presidential inaugural, security and surveillance were
extremely tight in Washington, DC, last January. But as George W. Bush prepared to
take the oath of office, security planners installed an extra layer of protection: a
prototype software system to detect a biological attack. The U.S. Department of
Defense, together with regional health and emergency-planning agencies, distributed
a special patient-query sheet to military clinics, civilian hospitals and even aid
stations along the parade route and at the inaugural balls. Software quickly
analyzed complaints of seven key symptoms — from rashes to sore throats — for
patterns that might indicate the early stages of a bio-attack. There was a brief
scare: the system noticed a surge in flulike symptoms at military clinics.
Thankfully, tests confirmed it was just that — the flu.
</text>

将文本调整为最多 64 行(任何长度都可以指定为参数 $pLineLength 的值),结果为:

Dec. 13 — As always for a presidential inaugural, security and 
surveillance were extremely tight in Washington, DC, last 
January. But as George W. Bush prepared to take the oath of 
office, security planners installed an extra layer of 
protection: a prototype software system to detect a biological 
attack. The U.S. Department of Defense, together with regional 
health and emergency-planning agencies, distributed a special 
patient-query sheet to military clinics, civilian hospitals and 
even aid stations along the parade route and at the inaugural 
balls. Software quickly analyzed complaints of seven key 
symptoms — from rashes to sore throats — for patterns that might 
indicate the early stages of a bio-attack. There was a brief 
scare: the system noticed a surge in flulike symptoms at 
military clinics. Thankfully, tests confirmed it was just that — 
the flu. 

在上述转换中导入的单独样式表是:

str-foldl.xsl:


<xsl:stylesheet version="2.0" 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="http://fxsl.sf.net/"
 exclude-result-prefixes="f">
    <xsl:template name="str-foldl">
      <xsl:param name="pFunc" select="/.."/>
      <xsl:param name="pA0"/>
      <xsl:param name="pStr"/>

      <xsl:choose>
         <xsl:when test="not(string($pStr))">
            <xsl:copy-of select="$pA0"/>
         </xsl:when>
         <xsl:otherwise>
            <xsl:variable name="vFunResult">
              <xsl:apply-templates select="$pFunc[1]" mode="f:FXSL">
                <xsl:with-param name="arg0" select="$pFunc[position() > 1]"/>
                <xsl:with-param name="arg1" select="$pA0"/>
                <xsl:with-param name="arg2" select="substring($pStr,1,1)"/>
              </xsl:apply-templates>
            </xsl:variable>

            <xsl:call-template name="str-foldl">
                    <xsl:with-param name="pFunc" select="$pFunc"/>
                    <xsl:with-param name="pStr" 
                   select="substring($pStr,2)"/>
                    <xsl:with-param name="pA0" select="$vFunResult"/>
            </xsl:call-template>
         </xsl:otherwise>
      </xsl:choose>

    </xsl:template>
</xsl:stylesheet>

请注意,这本质上是一个 XSLT 1.0 解决方案。使用 XSLT 2.0 的正则表达式处理功能可以实现更短的 XSLT 2.0 解决方案。


二.使用 XSLT 2.0 正则表达式

这是一个函数 -- f:getLine() -- 当传递一个字符串和最大行长度时,returns 该字符串的第一行是最长的起始子字符串(第一个最大行长度块)以单词边界结束。下面的转换使用此函数生成所需多行结果的第一行。

<xsl:stylesheet version="2.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="my:f" xmlns:xs="http://www.w3.org/2001/XMLSchema">
 <xsl:output method="text"/>

  <xsl:template match="/*/text()">
    <xsl:sequence select="f:getLine(., 64)"/>
  </xsl:template>

  <xsl:function name="f:getLine" as="xs:string?">
    <xsl:param name="pText" as="xs:string?"/>
    <xsl:param name="pLength" as="xs:integer"/>

    <xsl:variable name="vChunk" select="substring($pText, 1, $pLength)"/>

    <xsl:choose>
      <xsl:when test="not(string-length($pText) > $pLength) 
                      or matches(substring($pText, $pLength+1, 1), '\W')">
        <xsl:sequence select="$vChunk"/>
      </xsl:when>
      <xsl:otherwise>
            <xsl:analyze-string select="$vChunk" 
                 regex="^((\W*\w*)*?)(\W+\w*)$">
              <xsl:matching-substring>
                <xsl:sequence select="regex-group(1)"/>
              </xsl:matching-substring>
            </xsl:analyze-string>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:function>
</xsl:stylesheet>

当此转换应用于同一个 XML 文档时,会生成正确的第一行:

Dec. 13 — As always for a presidential inaugural, security and

最后,使用 RegEx 完成 XSLT 2.0 转换:

<xsl:stylesheet version="2.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="my:f" xmlns:xs="http://www.w3.org/2001/XMLSchema">
 <xsl:output method="text"/>

  <xsl:template match="/*/text()" name="reformat">
    <xsl:param name="pText" select="translate(., '&#xA;', ' ')"/>
    <xsl:param name="pMaxLength" select="64"/>
    <xsl:param name="pTotalLength" select="string-length(.)"/>
    <xsl:param name="pLengthFormatted" select="0"/>

    <xsl:if test="not($pLengthFormatted >= $pTotalLength)">
        <xsl:variable name="vNextLine" 
         select="f:getLine(substring($pText, $pLengthFormatted+1), $pMaxLength)"/>
        <xsl:sequence select="concat($vNextLine, '&#xA;')"/>

        <xsl:call-template name="reformat">
          <xsl:with-param name="pText" select="$pText"/>
          <xsl:with-param name="pMaxLength" select="$pMaxLength"/>
          <xsl:with-param name="pTotalLength" select="$pTotalLength"/>
          <xsl:with-param name="pLengthFormatted" 
                    select="$pLengthFormatted + string-length($vNextLine)"/>
        </xsl:call-template>
    </xsl:if>
  </xsl:template>

  <xsl:function name="f:getLine" as="xs:string?">
    <xsl:param name="pText" as="xs:string?"/>
    <xsl:param name="pLength" as="xs:integer"/>

    <xsl:variable name="vChunk" select="substring($pText, 1, $pLength)"/>

    <xsl:choose>
      <xsl:when test="not(string-length($pText) > $pLength) 
                      or matches(substring($pText, $pLength+1, 1), '\W')">
        <xsl:sequence select="$vChunk"/>
      </xsl:when>
      <xsl:otherwise>
            <xsl:analyze-string select="$vChunk" 
                 regex="^((\W*\w*)*?)(\W+\w*)$">
              <xsl:matching-substring>
                <xsl:sequence select="regex-group(1)"/>
              </xsl:matching-substring>
            </xsl:analyze-string>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:function>
</xsl:stylesheet>