在 XSLT 中使用多个正则表达式:如何减少处理时间

Using multiple regex in XSLT : how to reduce processing time

我对编程和 XSLT 还很陌生:我试图改进我提出问题和解释问题的方式,但我还有很长的路要走。抱歉,如果有不清楚的地方。

我需要在我的 XML 文档中检测各种字母,它看起来像这样,有更多不同的语言选项。

<text>
<p>Some text. dise´mbər Some text. Some text.</p> <!-- text in International Phonetic Alphabet + English -->
<p>Some text. dise´mbər Some text. Издательство Академии Наук СССР Some text.</p> <!-- text in International Phonetic Alphabet +  English + Cyrillic alphabet -->
<p>Some text. Издательство Академии Наук СССР dise´mbər Some text.  Some text.</p>
<p>Some text. Some text. Издательство Академии Наук СССР Some text.</p> <!-- text in English + Cyrillic alphabet -->
</text>

我开始用 XSLT 做的是这样的:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:xs="http://www.w3.org/2001/XMLSchema"
   exclude-result-prefixes="xs"
   version="2.0">
   <xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="no" />
   
   <xsl:template match="*">
      <xsl:element name="{local-name()}">
         <xsl:for-each select="@*">
            <xsl:attribute name="{local-name()}">
               <xsl:value-of select="."/>
            </xsl:attribute>
         </xsl:for-each>
         <xsl:apply-templates/>
      </xsl:element>
   </xsl:template>
   <xsl:template match="processing-instruction()">
      <xsl:processing-instruction name="{local-name()}"><xsl:apply-templates></xsl:apply-templates></xsl:processing-instruction>
   </xsl:template>
   <xsl:template name="IPA">
      <xsl:variable name="text" ><xsl:copy-of select="."/></xsl:variable>
      <xsl:analyze-string select="$text" regex="((\p{{IsIPAExtensions}}|\p{{IsPhoneticExtensions}})+)" >
         
         <xsl:matching-substring>
            <IPA><xsl:value-of select="regex-group(1)"/></IPA>
         </xsl:matching-substring>
         <xsl:non-matching-substring><xsl:copy-of select="."></xsl:copy-of></xsl:non-matching-substring>
      </xsl:analyze-string>
   </xsl:template>
   
   
   <xsl:template name="Cyrillic">
      <xsl:variable name="texte" ><xsl:call-template name="IPA"></xsl:call-template></xsl:variable>
      <xsl:analyze-string select="$texte" regex="(\p{{IsCyrillic}}+)" >
         
         <xsl:matching-substring>
            <Cyrillic><xsl:apply-templates select="regex-group(1)"/></Cyrillic>
         </xsl:matching-substring>
         <xsl:non-matching-substring><xsl:call-template name="IPA"></xsl:call-template></xsl:non-matching-substring>
      </xsl:analyze-string>
   </xsl:template>
   
   
   <xsl:template match="text()">
      <xsl:call-template name="Cyrillic"></xsl:call-template>
   </xsl:template>
   
</xsl:stylesheet>

这样我就可以得到这样的 XML:

<?xml version="1.0" encoding="UTF-8"?><text>
<p>Some text. dise´mb<IPA>ə</IPA>r Some text. Some text.</p>  
<p>Some text. dise´mb<IPA>ə</IPA>r Some text. <Cyrillic>Издательство</Cyrillic>   <Cyrillic>Академии</Cyrillic> <Cyrillic>Наук</Cyrillic> <Cyrillic>СССР</Cyrillic> Some text.</p>  
<p>Some text. <Cyrillic>Издательство</Cyrillic> <Cyrillic>Академии</Cyrillic>
<Cyrillic>Наук</Cyrillic> <Cyrillic>СССР</Cyrillic> dise´mb<IPA>ə</IPA>r Some text.  Some text.</p>
<p>Some text. Some text. <Cyrillic>Издательство</Cyrillic> <Cyrillic>Академии</Cyrillic>   
<Cyrillic>Наук</Cyrillic> <Cyrillic>СССР</Cyrillic> Some text.</p>   
</text>

这就是我所需要的,但是,我使用了大约十个正则表达式块,如果我使用这种方法,处理时间会很长。你会怎么做呢?您认为 XSLT 适合这个吗?

谢谢! 玛丽亚 (XSLT 2, Saxon-HE 9.8.0.8)

编辑:这是个人资料:

<html>
    <head>
     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
     <title>Analysis of Stylesheet Execution Time</title>
    </head>
    <body>
     <h1>Analysis of Stylesheet Execution Time</h1>
     <p>Total time: 72128.065 milliseconds</p>
     <h2>Time spent in each template, function or global variable:</h2>
     <p>The table below is ordered by the total net time spent in the template,     function
     or global variable. Gross time means the time including called templates and functions
     (recursive calls only count from the original entry);  net time means time excluding
     time spent in called templates and functions.
     </p>
     <table border="border" cellpadding="10">
     <thead>
     <tr>
     <th>file</th>
     <th>line</th>
     <th>instruction</th>
     <th>count</th>
     <th>average time (gross/ms)</th>
     <th>total time (gross/ms)</th>
     <th>average time (net/ms)</th>
     <th>total time (net/ms)</th>
     </tr>
     </thead>
     <tbody>
     <tr>
     <td>     "*code/unicode.xsl"      </td>
     <td>21</td>
     <td>template Greek</td>
     <td align="right">2,755,968</td>
     <td align="right">0.017</td>
     <td align="right">46,854.785</td>
     <td align="right">0.017</td>
     <td align="right">46,854.785</td>
     </tr>
     <tr>
     <td>     "*code/unicode.xsl"      </td>
     <td>32</td>
     <td>template Hebrew</td>
     <td align="right">1,329,696</td>
     <td align="right">0.043</td>
     <td align="right">57,529.163</td>
     <td align="right">0.008</td>
     <td align="right">10,674.378</td>
     </tr>
     <tr>
     <td>     "*code/unicode.xsl"      </td>
     <td>54</td>
     <td>template IPA</td>
     <td align="right">333,984</td>
     <td align="right">0.206</td>
     <td align="right">68,964.076</td>
     <td align="right">0.019</td>
     <td align="right">6,381.186</td>
     </tr>
     <tr>
     <td>     "*code/unicode.xsl"      </td>
     <td>43</td>
     <td>template Cyrillic</td>
     <td align="right">665,392</td>
     <td align="right">0.094</td>
     <td align="right">62,582.890</td>
     <td align="right">0.008</td>
     <td align="right">5,053.727</td>
     </tr>
     <tr>
     <td>     "*code/unicode.xsl"      </td>
     <td>65</td>
     <td>template Arabic</td>
     <td align="right">167,068</td>
     <td align="right">0.421</td>
     <td align="right">70,284.800</td>
     <td align="right">0.008</td>
     <td align="right">1,320.724</td>
     </tr>
     <tr>
     <td>     "*code/unicode.xsl"      </td>
     <td>76</td>
     <td>template Arrows</td>
     <td align="right">83,536</td>
     <td align="right">0.849</td>
     <td align="right">70,945.946</td>
     <td align="right">0.008</td>
     <td align="right">661.146</td>
     </tr>
     <tr>
     <td>     "*code/unicode.xsl"      </td>
     <td>8</td>
     <td>template *</td>
     <td align="right">12,122</td>
     <td align="right">5.959</td>
     <td align="right">72,238.100</td>
     <td align="right">0.034</td>
     <td align="right">413.937</td>
     </tr>
     <tr>
     <td>     "*code/unicode.xsl"      </td>
     <td>87</td>
     <td>template Dingbats</td>
     <td align="right">41,768</td>
     <td align="right">1.708</td>
     <td align="right">71,323.074</td>
     <td align="right">0.009</td>
     <td align="right">377.128</td>
     </tr>
     <tr>
     <td>     "*code/unicode.xsl"      </td>
     <td>98</td>
     <td>template Private</td>
     <td align="right">20,884</td>
     <td align="right">3.427</td>
     <td align="right">71,576.916</td>
     <td align="right">0.012</td>
     <td align="right">253.842</td>
     </tr>
     <tr>
     <td>     "*code/unicode.xsl"      </td>
     <td>18</td>
     <td>template processing-instruction()</td>
     <td align="right">6,907</td>
     <td align="right">0.014</td>
     <td align="right">98.490</td>
     <td align="right">0.014</td>
     <td align="right">98.490</td>
     </tr>
     <tr>
     <td>     "*code/unicode.xsl"      </td>
     <td>121</td>
     <td>template text()</td>
     <td align="right">20,884</td>
     <td align="right">3.429</td>
     <td align="right">71,600.976</td>
     <td align="right">0.001</td>
     <td align="right">24.060</td>
     </tr>
     </tbody>
     </table>
    </body>
</html>

Martin Honnen代码简介:

<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <title>Analysis of Stylesheet Execution Time</title>
   </head>
   <body>
      <h1>Analysis of Stylesheet Execution Time</h1>
      <p>Total time: 2900.594 milliseconds</p>
      <h2>Time spent in each template, function or global variable:</h2>
      <p>The table below is ordered by the total net time spent in the template,    function
         or global variable. Gross time means the time including called templates and functions
         (recursive calls only count from the original entry);  net time means time excluding
         time spent in called templates and functions.
      </p>
      <table border="border" cellpadding="10">
         <thead>
            <tr>
               <th>file</th>
               <th>line</th>
               <th>instruction</th>
               <th>count</th>
               <th>average time (gross/ms)</th>
               <th>total time (gross/ms)</th>
               <th>average time (net/ms)</th>
               <th>total time (net/ms)</th>
            </tr>
         </thead>
         <tbody>
            <tr>
               <td>            "*code/unicode.xsl"       </td>
               <td>44</td>
               <td>template text()</td>
               <td align="right">222,968</td>
               <td align="right">0.009</td>
               <td align="right">1,949.720</td>
               <td align="right">0.009</td>
               <td align="right">1,949.720</td>
            </tr>
            <tr>
               <td>            "*code/unicode.xsl"       </td>
               <td>26</td>
               <td>template text()</td>
               <td align="right">20,884</td>
               <td align="right">0.135</td>
               <td align="right">2,823.597</td>
               <td align="right">0.042</td>
               <td align="right">873.877</td>
            </tr>
         </tbody>
      </table>
   </body>
</html>

在 XSLT 3 中,我会考虑以下方法:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:fn="http://www.w3.org/2005/xpath-functions"
    xmlns:map="http://www.w3.org/2005/xpath-functions/map"
    exclude-result-prefixes="#all"
    version="3.0">

  <xsl:param name="scripts"
    as="map(xs:string, xs:string)*"
    select="map { 'Cyrillic' : '\p{IsCyrillic}+'},
            map { 'IPA' : '[\p{IsIPAExtensions}\p{IsPhoneticExtensions}]+' }"/>

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:template match="text()">
      <xsl:iterate select="$scripts">
          <xsl:param name="input" select="."/>
          <xsl:on-completion>
              <xsl:sequence select="$input"/>
          </xsl:on-completion>
          <xsl:next-iteration>
              <xsl:with-param name="input">
                  <xsl:apply-templates select="$input" mode="wrap">
                      <xsl:with-param name="script-map" tunnel="yes" select="."/>
                  </xsl:apply-templates>
              </xsl:with-param>
          </xsl:next-iteration>
      </xsl:iterate>
  </xsl:template>
  
  <xsl:mode name="wrap" on-no-match="shallow-copy"/>
  
  <xsl:template match="text()" mode="wrap">
      <xsl:param name="script-map" tunnel="yes"/>
      <xsl:analyze-string select="." regex="{$script-map?*}">
          <xsl:matching-substring>
              <xsl:element name="{map:keys($script-map)}">
                  <xsl:value-of select="."/>
              </xsl:element>              
          </xsl:matching-substring>
          <xsl:non-matching-substring>
              <xsl:value-of select="."/>
          </xsl:non-matching-substring>
      </xsl:analyze-string>
  </xsl:template>
  
</xsl:stylesheet>

我没有衡量它是否表现更好,但对于正则表达式,我 [\p{IsIPAExtensions}\p{IsPhoneticExtensions}]+ 认为比 (\p{IsIPAExtensions}|\p{IsPhoneticExtensions})+.

更容易

其他改进依赖于 xsl:mode 基于身份转换和 xsl:iterate.

\p{IsIPAExtensions} 等正则表达式应该相当有效:大多数块都是一个连续的代码点范围,测试一个字符应该只检查它是否在该范围内。我怀疑,成本不是来自根据一个 Unicode 块检查一个字符的成本,而是来自字符数和块数。

可能值得获得 Java 级别的配置文件,看看它把时间花在哪里了。我能猜到,但个人资料会显示我的猜测是否正确。

回溯可能会降低正则表达式的性能,但我没有立即看到使用此代码回溯的任何风险。

想到的唯一其他方法是生成一个巨大的 translate() 调用,将字符分组(因此所有拉丁字符变为“1”,所有西里尔字符变为“2”等)和然后使用 ` 处理结果。但是并不能保证会表现得更好,而且要进行大量实验才能找出答案。