标点符号和近查询

Punctuation and Near Query

当我在 cts:word-query 中打开 punctuation-insensitive 时,即使这样 NEAR 查询也会将 - 单词分成两个单词

let $xml :=

  <abstracts count="1">
            <abstract>
              <abstract_text count="1">
                <p>We assessed the impact of a pharmacotherapy follow-up programme on key safety points [adverse events (AE) 
                and drug administration] in outpatients treated with oral antineoplastic agents (OAA). We performed a comparative, 
                interventional, quasi-experimental study of outpatients treated with OAA in a Spanish hospital to compare pre-intervention 
                group patients (not monitored by pharmacists during 2011) with intervention group patients (prospectively monitored by 
                pharmacists during 2013). AE data were collected from medical records. Follow-up was 6 months, and 249 patients were 
                included (pre-intervention, 115; intervention, 134). After the first month, AE were detected in 86.5% of patients 
                in the pre-intervention group and 80.6% of patients in the intervention group, P = 0.096. During the remaining months, 
                79.0% patients had at least one AE in the pre-intervention group compared with 78.0% in the intervention group, P = 0.431. 
                AE were more prevalent with sorafenib and sunitinib. In total, 173 drug interactions were recorded (pre-intervention, 80; 
                intervention, 93; P = 0.045). Drug interactions were more frequent with erlotinib and gefitinib; food interactions were 
                more common with sorafenib and pazopanib. Our follow-up of cancer outpatients revealed a reduction in severe AE and major 
                drug interactions, thus helping health professionals to monitor the safety of OAA.</p>
              </abstract_text>
            </abstract>
          </abstracts>

let $q3 :=
    cts:near-query(
      (
       cts:element-query((xs:QName("abstract_text")),
          cts:word-query( ("Controlled", "randomized", "randomised", "clinical", "masked","blind*","multi center", "open label*","compar*", "cross over", "placebo",
                "post market","meta analysis","volunteer*", "prospective"
                ),
          ("case-insensitive", "punctuation-insensitive", "wildcarded"))
        )
        ,
        cts:element-query((xs:QName("abstract_text")),
          cts:word-query(("stud*", "trial*" ),
          ("case-insensitive", "punctuation-insensitive", "wildcarded"))
        )
      ),   
       3
    )

return 
  cts:highlight($xml,$q3, <b>{$cts:text}</b>)

当我把 NEAR 放到 3 时,它不匹配 comparativestudy 即使距离是 3 而且我有它punctuation-insensitive。但是当我将它更改为 4 时它起作用了..

但是当我也改成punctuation-sensitive时,即使与NEAR距离3仍然不匹配。这是为什么?

而且我想在word-query中实现匹配说placebo-controlledplacebo controlled。我认为一旦我打开 punctuation-insensitive 并在我的单词查询中搜索 placebo controlled 就会找到单词的所有组合..但是当相同时,这将如何影响 NEAR 距离在 NEAR 查询中使用 ?

这实际上与解析搜索时的标点符号无关,而是 MarkLogic 如何标记和索引单个单词的位置。默认情况下,MarkLogic 的标记化将带连字符的短语分解为单独的单词。如果您不喜欢默认行为,您可以使用自定义分词器来指示 MarkLogic 应如何为单词编制索引。有一个非常详细的指南,介绍如何使用自定义分词器忽略单词分词中的连字符 available here

对于您的情况,我不确定我是否会建议您探索使用自定义分词器。可能会产生意想不到的后果,并且它的性能不如使用默认标记化。相反,使您的代码适应默认标记化的工作方式可能更有意义。

让我们看看:comparative, interventional, quasi-experimental study

它将被标记为:

Word            | Position
comparative     | 0
interventional  | 1
quasi           | 2
experimental    | 3
study           | 4

因此,comparativestudy之间的距离是4。注意quasi-experimental被标记为两个词。

我不确定我是否理解您在上一段中提出的问题。但我希望这能为您提供足够的信息,以更好地理解默认标记化的行为方式。