xquery通过文件名中的下划线排除某些文件

Question

我有以下集合结构

SCTA
 --lectio1
   --lectio1.xml
   --reims_lectio1.xml
   --sorb_lectio1.xml
 --lectio2
   --lectio2.xml
   --reims_lectio2.xml
   --sorb_lectio2.xml

现在使用 Xquery，我只想搜索不包含“_”的文件。

以下查询有效，但会搜索所有文件。我想修改它，让它只搜索 lectio1.xml 和 lectio2.xml，而不搜索带有“_”的文件

for $file in collection('/db/SCTA/')
    for $p at $i in $file/tei:TEI//tei:p
        let $param1:= request:get-parameter('param1', 'oyta')
        let $pid := data($p/@xml:id)
        let $fs := data($file/tei:TEI/tei:text/tei:body/tei:div/@xml:id)
        let $title := $file/tei:TEI/tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title/text()

        where ($p[contains(., $param1)])
        order by $fs
        return 
        <p>{$fs}: {$title}: {$pid}: {$p/text()}</p>

有什么想法吗？

Answer 1

一个完全依赖于 XQuery 规范中可用函数的答案会让您通过解析 base-uri() 函数对所有集合内容的结果来过滤 collection() 函数的结果.例如：

for $file in collection('/db/SCTA')[not(contains(replace(base-uri(.), '^.*/([^/]+?)$', ''), '_'))]

看来您正在使用 eXist，我们可以使用 eXist 的实用函数之一，即 util:document-name()，使这更容易一些：

for $file in collection('/db/SCTA')[not(contains(util:document-name(.), '_'))]

有关 util:document-name() 的功能文档，请参阅 http://exist-db.org/exist/apps/fundocs/view.html?uri=http://exist-db.org/xquery/util#document-name.1。

--

虽然您没有就优化查询的机会寻求建议，但我发现您的代码的某些方面值得讨论。

除非您有超出此处代码示例中显示的原因，否则您可以考虑将两个嵌套的 FLWOR 表达式合并为一个：

let $param1:= request:get-parameter('param1', 'oyta')
let $docs := collection('/db/SCTA')[not(contains(util:document-name(.), '_'))]

for $p in $docs//tei:p[contains(., $param1)]
let $pid := $p/@xml:id/string()
let $fs := $p/ancestor::tei:div[last()]/@xml:id/string()
let $title := root($p)/tei:TEI/tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title/string()
order by $fs
return 
    <p>{$fs}: {$title}: {$pid}: {$p/string()}</p>

这里注意：

我们一次获得 $param1 的值，而不是在 FLWOR 表达式的每次迭代期间再次获得。
我们在 let 子句而不是 for 中识别文档，因为我们真正感兴趣的是遍历所有 tei:p 元素的序列 period ，而不是每个文档中的 tei:p 个元素。

我们利用 eXist 的 structural index to descend directly to the tei:p elements, rather than specifying any intermediate child axis steps; we use the XPath ancestor axis to reach up to the p's highest/outer-most tei:div; and we use the root() function to jump up to the document node in order to get back down to the tei:teiHeader (alternatively, use $p/preceding::tei:titleStmt/tei:title). For more, see Prefer short paths。

我们使用谓词而不是 where 子句。正如 here in eXist's documentation 所述，谓词允许 eXist 的查询优化器从 FLWOR 表达式中获得更多性能。并不是说你不能使用 where;至少最好避开 eXist。

我们使用 string() 函数代替 data() 和 text()。在某些方面，这可以被视为一种风格选择，但在阅读评论中的 Evan Lenz 等文章后 text() is a code smell, I prefer the precision of string() when I want to get the string value of an attribute or a single string value of an element which may contain mixed content. (The article mostly covers text(), but see the discussion of data() in this thread。）

我在这里没有演示的一个步骤是将全文索引应用到您的 tei:p 元素以加速和改进此查询的搜索功能。如果您在 tei:p 上定义了全文索引，您可以将 for 子句更改为：

for $p in $docs//tei:p[ft:query(., $param1)]

然后param1可以使用Lucene's query parser syntax, including stemming, case insensitivity (contains is case sensitive), wildcarding, proximity, etc. But full text indexing is covered in eXist's documentation: http://exist-db.org/exist/apps/doc/lucene.xml的全部力量。

xquery通过文件名中的下划线排除某些文件

xquery exclude certain files by underscore in file name

xquery