使用 discard-document 与 saxon 和 xquery

Using discard-document with saxon and xquery

我正在寻找一个关于如何使用 Saxon 的丢弃文档功能的示例。我有大约 50 个文件,每个文件 40mb,所以它们在我的 xquery 脚本中使用了大约 4.5GB 的内存。

我尝试在每次调用 XML 文件后使用 saxon:discard-document(doc("filename.xml")),但也许这不是正确的方法?使用后内存占用没有区别

我还发现了一些关于它的用法的问题(7 年前),他们建议 运行 使用 discard-document 的 xpath。但是我对该文档有很多调用,所以我必须用 saxon:discard-document(doc("filename.xml"))/xpath/etc/etc/etc

替换所有声明

谢谢

我觉得这是一个很好的问题,而且没有太多可用的信息,所以我会尝试自己回答。

这里有一个如何使用 saxon:discard-document 的例子:

declare function local:doStuffInDocument($doc as document-node()) {
  $doc//testPath
};

let $urls := ("http://url1", "http://url2")
let $results :=
for $url in $urls
  let $doc := saxon:discard-document(doc($url))
  return local:doStuffInDocument($doc)      
return $results    

通过使用类似的代码,我设法将内存消耗从 4+GB 减少到仅 300MB。

要了解 discard-document 的作用,请参阅 Michael Kay 在 SF maillist 上发表的精彩评论:

Just to explain what discard-document() does:

Saxon maintains (owned by the Transformer/Controller) a table that maps document URIs to document nodes. When you call the document() function, Saxon looks to see if the URI is in this table, and if it is, it returns the corresponding document node. If it isn't, it reads and parses the resource found at that URI. The effect of saxon:discard-document() is to remove the entry for a document from this mapping table. (Of course, if a document is referenced from this table then the garbage collector will hold the document in memory; if it is not referenced from the table then it becomes eligible for garbage collection. It won't be garbage collected if it's referenced from a global variable; but it will still be absent from the table in the event that another call on document() uses the same URI again.)

还有一个来自 Michael Kay 在 Altova maillist 发现的:

In Saxon, if you use the doc() or document() function, then the file will be loaded into memory, and will stay in memory until the end of the run, just in case it's referenced again. So you will hit the same memory problem with lots of small files as with one large file - worse, in fact, since there is a significant per-document overhead.

However, there's a workaround: an extension function saxon:discard-document() that causes a document to be discarded from memory by the garbage collector as soon as there are no more references to it.

了解幕后实际情况可能很有用。 doc() 函数在缓存中查找文档是否已经存在;如果没有,它会读取文档,将其添加到缓存中,然后 returns 它。 discard-document() 函数查看文档是否在缓存中,如果在,则将其删除,然后 returns 它。通过从缓存中删除文档,当文档不再被引用时,它可以进行垃圾回收。如果使用 discard-document 对内存消耗没有影响,那可能是因为还有其他东西仍在引用文档 - 例如,全局变量。