如何有效地比较 2 个大容量 XML 文件

Question

-- 编辑 -- ，澄清文档和所需的输出。（还有为什么第一次响应之间存在差异）

我正在尝试使用 XSLT 2.0（我也可以使用 3.0）比较 2 个大型 XML 数据集，但我遇到了一些性能问题。

我在文件 1 中有约 300k 条记录，我需要将其与文件 2 中的另一条约 300k 条记录进行比较，以查看文件 1 中的条目是否存在于文件 2 中。如果存在，我需要在结果中插入一个节点.我还需要从文件 1 中排除某些记录类型。

文件 1

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <row>
        <col1>100035</col1>
        <col2>3000009091</col2>
        <col3>SSL</col3>
        <col4>8.000000</col4>
        <col5>06-Jul-2020</col5>
        <col6>A</col6>
    </row>
    <row>
        <col1>100002</col1>
        <col2>3000009091</col2>
        <col3>UUT</col3>
        <col4>8.000000</col4>
        <col5>07-Jul-2020</col5>
        <col6>P</col6>
    </row>
    <row>
        <col1>100028</col1>
        <col2>3000009091</col2>
        <col3>UUT</col3>
        <col4>8.000000</col4>
        <col5>08-Jul-2020</col5>
        <col6>P</col6>
    </row>
    <row>
        <col1>100200</col1>
        <col2>3000009091</col2>
        <col3>UUT</col3>
        <col4>8.000000</col4>
        <col5>09-Jul-2020</col5>
        <col6>A</col6>
    </row>
    <row>
        <col1>100689</col1>
        <col2>3000009091</col2>
        <col3>UUT</col3>
        <col4>8.000000</col4>
        <col5>10-Jul-2020</col5>
        <col6>A</col6>
    </row>
    <row>
        <col1>100035</col1>
        <col2>3000013528</col2>
        <col3>UFH</col3>
        <col4>8.000000</col4>
        <col5>16-Jul-2020</col5>
        <col6>A</col6>
    </row>
</root>

文件 2

<?xml version="1.0" encoding="UTF-8"?>
<nm:Data xmlns:nm="namespace">
    <nm:Entry>
        <nm:Record>
            <nm:ID>10084722-Jun-2020UUT</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>48548310-Jul-2020SSL</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>10000201-Jul-2020UUT</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>57307407-Jul-2020SSL</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>10003516-Jul-2020UFH</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>10020009-Jul-2020UUT</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>00155501-Jun-2020UUT</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>10533728-May-2020UUT</nm:ID>
        </nm:Record>
    </nm:Entry>
    <nm:Entry>
        <nm:Record>
            <nm:ID>99954801-Jul-2020UUT</nm:ID>
        </nm:Record>
    </nm:Entry>
    <nm:Entry>
        <nm:Record>
            <nm:ID>30254801-Jun-2020UFH</nm:ID>
        </nm:Record>
    </nm:Entry>
</nm:Data>

所需的输出（复制 'A' 记录并添加“类型”节点）。如果文件 2 中有匹配的 ID，则为“Adj”，否则为“New”类型：

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <row>
        <type>New</type>
        <col1>100035</col1>
        <col2>3000009091</col2>
        <col3>SSL</col3>
        <col4>8.000000</col4>
        <col5>06-Jul-2020</col5>
        <col6>A</col6>
    </row> 
    <row>
        <type>Adj</type>
        <col1>100200</col1>
        <col2>3000009091</col2>
        <col3>UUT</col3>
        <col4>8.000000</col4>
        <col5>09-Jul-2020</col5>
        <col6>A</col6>
    </row>
    <row>
        <type>New</type>
        <col1>100689</col1>
        <col2>3000009091</col2>
        <col3>UUT</col3>
        <col4>8.000000</col4>
        <col5>10-Jul-2020</col5>
        <col6>A</col6>
    </row>
    <row>
        <type>Adj</type>
        <col1>100035</col1>
        <col2>3000013528</col2>
        <col3>UFH</col3>
        <col4>8.000000</col4>
        <col5>16-Jul-2020</col5>
        <col6>A</col6>
    </row>
</root>

最初，我无法获得准确的输出，所以我妥协了以下 xslt；然而，性能很差，我需要一个更有效的解决方案。

XSLT 尝试 1（想要替换 exists() 和 copy-of() 函数）：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:nm="namespace"
    exclude-result-prefixes="xs" version="3.0">
    
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>
    
    <xsl:variable name="report" select="document('File2.xml')"/>
    
    <xsl:template match="root">
        <root>
            <xsl:for-each select="row[col6 = 'A']">
                <record>
                    <!-- Create value to match against -->
                    <xsl:variable name="inputID" select="concat(col1,col5,col3)"/>
                    
                    <!-- Add Node based on existing match or not -->
                    <xsl:choose>
                        <xsl:when test="exists($report/nm:Data/nm:Entry/nm:Record/nm:ID[. = $inputID])">
                            <type>Adj</type>
                        </xsl:when>
                        <xsl:otherwise>
                            <type>New</type>
                        </xsl:otherwise>
                    </xsl:choose>
                    <!-- Copy all other nodes -->
                    <xsl:copy-of select="."/>
                </record>
            </xsl:for-each>
        </root>
    </xsl:template>
</xsl:stylesheet>

实际输出1（不完美输出，但可以接受）：

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:nm="namespace">
   <record>
      <type>New</type>
      <row>
         <col1>100035</col1>
         <col2>3000009091</col2>
         <col3>SSL</col3>
         <col4>8.000000</col4>
         <col5>06-Jul-2020</col5>
         <col6>A</col6>
      </row>
   </record>
   <record>
      <type>Adj</type>
      <row>
         <col1>100200</col1>
         <col2>3000009091</col2>
         <col3>UUT</col3>
         <col4>8.000000</col4>
         <col5>09-Jul-2020</col5>
         <col6>A</col6>
      </row>
   </record>
   <record>
      <type>New</type>
      <row>
         <col1>100689</col1>
         <col2>3000009091</col2>
         <col3>UUT</col3>
         <col4>8.000000</col4>
         <col5>10-Jul-2020</col5>
         <col6>A</col6>
      </row>
   </record>
   <record>
      <type>Adj</type>
      <row>
         <col1>100035</col1>
         <col2>3000013528</col2>
         <col3>UFH</col3>
         <col4>8.000000</col4>
         <col5>16-Jul-2020</col5>
         <col6>A</col6>
      </row>
   </record>
</root>

然后我采纳了下面的建议并尝试在 XSLT 3.0 中应用流式传输和 key() 函数，但我一直无法使任何功能正常运行。最接近的是这里的 xslt，但输出不正确。

XSLT 3.0 尝试：

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:nm="namespace"
    exclude-result-prefixes="#all" version="3.0">

    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:variable name="report" select="document('File2.xml')"/>

    <xsl:key name="ref" match="nm:Data/nm:Entry/nm:Record/nm:ID" use="."/>
    
    <xsl:key name="type-ref" match="row" use="col6"/>
    
    <xsl:mode on-no-match="shallow-copy"/>
    
    <xsl:template match="key('type-ref', 'A')[key('ref', col1 || col3 || col5, $report)]">
        <xsl:copy>
            <type>Adj</type>
            <xsl:apply-templates/>
        </xsl:copy>
    </xsl:template>
    
    <xsl:template match="key('type-ref', 'A')[not(key('ref', col1 || col3 || col5, $report))]">
        <xsl:copy>
            <type>New</type>
            <xsl:apply-templates/>
        </xsl:copy>
    </xsl:template>
    
    <xsl:template match="key('type-ref', 'P')"/>

</xsl:stylesheet>

3.0 输出（请注意“Adj”类型未正确应用但 P 记录被删除）：

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <row>
      <type>New</type>
      <col1>100035</col1>
      <col2>3000009091</col2>
      <col3>SSL</col3>
      <col4>8.000000</col4>
      <col5>06-Jul-2020</col5>
      <col6>A</col6>
   </row>
   <row>
      <type>New</type>
      <col1>100200</col1>
      <col2>3000009091</col2>
      <col3>UUT</col3>
      <col4>8.000000</col4>
      <col5>09-Jul-2020</col5>
      <col6>A</col6>
   </row>
   <row>
      <type>New</type>
      <col1>100689</col1>
      <col2>3000009091</col2>
      <col3>UUT</col3>
      <col4>8.000000</col4>
      <col5>10-Jul-2020</col5>
      <col6>A</col6>
   </row>
   <row>
      <type>New</type>
      <col1>100035</col1>
      <col2>3000013528</col2>
      <col3>UFH</col3>
      <col4>8.000000</col4>
      <col5>16-Jul-2020</col5>
      <col6>A</col6>
   </row>
</root>

我对 key() 函数的理解还不够深入，无法进一步调整它或在尝试使用流模式时如何正确应用 copy() 语句。

再次感谢您的意见，我会继续努力。

Answer 1

我会使用一个键 (https://www.w3.org/TR/xslt-30/#key) 来索引第二个文档，并且（可能另外）使用一个键 select 整个处理过程中只有某些 rows:

  <xsl:key name="ref" match="data/id" use="."/>
  
  <xsl:key name="type-ref" match="row" use="type"/>

  <xsl:mode on-no-match="shallow-copy"/>
  
  <xsl:template match="root">
      <xsl:copy>
          <xsl:apply-templates select="key('type-ref', 'A')"/>
      </xsl:copy>
  </xsl:template>

  <xsl:template match="row[key('ref', id || code || date, $report)]">
      <xsl:copy>
         <type>Adj</type>
         <xsl:apply-templates/>
      </xsl:copy>
  </xsl:template>
  
  <xsl:template match="row[not(key('ref', id || code || date, $report))]">
      <xsl:copy>
         <type>New</type>
         <xsl:apply-templates/>
      </xsl:copy>
  </xsl:template>

https://xsltfiddle.liberty-development.net/a9HjZH/2

key 函数的参数在 https://www.w3.org/TR/xslt-30/#func-key:

中解释

fn:key( $key-name    as xs:string,
        $key-value   as xs:anyAtomicType*,
        $top     as node()) as node()*

The third argument is used to identify the selected subtree. If the argument is present, the selected subtree is the set of nodes that have $top as an ancestor-or-self node. If the argument is omitted, the selected subtree is the document containing the context node. This means that the third argument effectively defaults to /.

应用于您更改后的输入样本（唯一的困难是按照它们的值在第二个文档中出现的顺序连接 colX 元素）会给出

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:nm="namespace"
    exclude-result-prefixes="#all"
    version="3.0">

  <xsl:param name="report">
<nm:Data xmlns:nm="namespace">
    <nm:Entry>
        <nm:Record>
            <nm:ID>10084722-Jun-2020UUT</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>48548310-Jul-2020SSL</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>10000201-Jul-2020UUT</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>57307407-Jul-2020SSL</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>10003516-Jul-2020UFH</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>10020009-Jul-2020UUT</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>00155501-Jun-2020UUT</nm:ID>
        </nm:Record>
        <nm:Record>
            <nm:ID>10533728-May-2020UUT</nm:ID>
        </nm:Record>
    </nm:Entry>
    <nm:Entry>
        <nm:Record>
            <nm:ID>99954801-Jul-2020UUT</nm:ID>
        </nm:Record>
    </nm:Entry>
    <nm:Entry>
        <nm:Record>
            <nm:ID>30254801-Jun-2020UFH</nm:ID>
        </nm:Record>
    </nm:Entry>
</nm:Data>
  </xsl:param>
  
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>
    
  <xsl:key name="ref" match="nm:Data/nm:Entry/nm:Record/nm:ID" use="."/>
  
  <xsl:key name="type-ref" match="row" use="col6"/>

  <xsl:mode on-no-match="shallow-copy"/>
  
  <xsl:template match="root">
      <xsl:copy>
          <xsl:apply-templates select="key('type-ref', 'A')"/>
      </xsl:copy>
  </xsl:template>

  <xsl:template match="row[key('ref', col1 || col5 || col3, $report)]">
      <xsl:copy>
         <type>Adj</type>
         <xsl:apply-templates/>
      </xsl:copy>
  </xsl:template>
  
  <xsl:template match="row[not(key('ref', col1 || col5 || col3, $report))]">
      <xsl:copy>
         <type>New</type>
         <xsl:apply-templates/>
      </xsl:copy>
  </xsl:template>
  
</xsl:stylesheet>

https://xsltfiddle.liberty-development.net/a9HjZH/3

最后，使用 XSLT 3 和流式传输（例如使用 Saxon 9 或 10 EE），您可以使用不同的方法，通过流式读取第二个文档到地图中，然后通过第一个输入文档进行流式传输并执行模板匹配在内存中具体化的每个 row 上：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:map="http://www.w3.org/2005/xpath-functions/map"
    exclude-result-prefixes="#all"
    version="3.0">
    
    <xsl:param name="doc2-uri" as="xs:string">input-sample2.xml</xsl:param>
    
    <xsl:strip-space elements="*"/>
    <xsl:output indent="yes"/>
        
    <xsl:param name="key-map" as="map(xs:string, xs:boolean)">
        <xsl:map>
            <xsl:source-document href="{$doc2-uri}" streamable="yes">
                <xsl:iterate select="data/id">
                    <xsl:map-entry key="string()" select="true()"/>
                </xsl:iterate>
            </xsl:source-document>
        </xsl:map>
    </xsl:param>
    
    <xsl:mode on-no-match="shallow-copy" streamable="yes"/>
    
    <xsl:template match="root">
        <xsl:copy>
            <xsl:apply-templates select="row!copy-of()" mode="grounded"/>
        </xsl:copy>
    </xsl:template>
    
    <xsl:mode name="grounded" on-no-match="shallow-copy"/>
    
    <xsl:template match="row[map:contains($key-map, id || code || date)]" mode="grounded">
        <xsl:copy>
            <type>Adj</type>
            <xsl:apply-templates mode="#current"/>
        </xsl:copy>
    </xsl:template>
    
    <xsl:template match="row[not(map:contains($key-map, id || code || date))]" mode="grounded">
        <xsl:copy>
            <type>New</type>
            <xsl:apply-templates mode="#current"/>
        </xsl:copy>
    </xsl:template>
    
</xsl:stylesheet>

或者，对于调整后的输入样本和仅处理某些类型的 row 的明确要求：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:map="http://www.w3.org/2005/xpath-functions/map"
    xmlns:nm="namespace"
    exclude-result-prefixes="#all"
    version="3.0">
    
    <xsl:param name="doc2-uri" as="xs:string">input2-sample2.xml</xsl:param>
    
    <xsl:strip-space elements="*"/>
    <xsl:output indent="yes"/>
    
    <xsl:param name="key-map" as="map(xs:string, xs:boolean)">
        <xsl:map>
            <xsl:source-document href="{$doc2-uri}" streamable="yes">
                <xsl:iterate select="nm:Data/nm:Entry/nm:Record/nm:ID">
                    <xsl:map-entry key="string()" select="true()"/>
                </xsl:iterate>
            </xsl:source-document>
        </xsl:map>
    </xsl:param>
    
    <xsl:mode on-no-match="shallow-copy" streamable="yes"/>
    
    <xsl:template match="root">
        <xsl:copy>
            <xsl:apply-templates select="row!copy-of()[col6 = 'A']" mode="grounded"/>
        </xsl:copy>
    </xsl:template>
    
    <xsl:mode name="grounded" on-no-match="shallow-copy"/>
    
    <xsl:template match="row[map:contains($key-map, col1 || col5 || col3)]" mode="grounded">
        <xsl:copy>
            <type>Adj</type>
            <xsl:apply-templates mode="#current"/>
        </xsl:copy>
    </xsl:template>
    
    <xsl:template match="row[not(map:contains($key-map, col1 || col5 || col3))]" mode="grounded">
        <xsl:copy>
            <type>New</type>
            <xsl:apply-templates mode="#current"/>
        </xsl:copy>
    </xsl:template>
    
</xsl:stylesheet>

这应该使第一个文档的内存消耗保持在较低水平，即使您有数百万行也是如此。对于第二个文档，它流过并构建一个轻量级映射来存储键，而不是在内存中保存完整的 XML 树及其键函数。

如何有效地比较 2 个大容量 XML 文件

How to Efficiently compare 2 large volume XML files

xslt

xslt-2.0

xslt-3.0