Elasticsearch

Question

我需要从索引 Y 中的所有文本中获取单词 X 的计数，索引 Y 只有一个字段“内容”。请注意，我需要计算特定单词的数量，它在所有文档中总共出现了多少次。我发现 ES 对此没有很好的优化（因为这是一种文本类型），但这是大学作业，所以我别无选择。

到目前为止我已经尝试过（取自here）：

{
  "script_fields": {
    "phrase_Count": {
      "script": {
        "lang": "painless",
        "source": "int count = 0; if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) count++; return count;",
        "params": {
          "phrase": "ustawa"
        }
      }
    }
  }
}

脚本方法returns：

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "runtime error",
        "script_stack": [
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:88)",
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:40)",
          "if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) ",
          "       ^---- HERE"
        ],
        "script": "int count = 0; if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) count++; return count;",
        "lang": "painless",
        "position": {
          "offset": 22,
          "start": 15,
          "end": 104
        }
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "bills",
        "node": "MXtcD7-zT-mhDyxMeRTMLw",
        "reason": {
          "type": "script_exception",
          "reason": "runtime error",
          "script_stack": [
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:88)",
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:40)",
            "if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) ",
            "       ^---- HERE"
          ],
          "script": "int count = 0; if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) count++; return count;",
          "lang": "painless",
          "position": {
            "offset": 22,
            "start": 15,
            "end": 104
          },
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "No field found for [content.keyword] in mapping with types []"
          }
        }
      }
    ]
  },
  "status": 400
}

上面使用了 content.keyword，因为使用普通的 content ES 抱怨文本类型没有针对此类搜索进行优化。

我也尝试过使用文本统计（来自 here），但我无法让它工作，它只计算带有单词的文档（这不是我要找的）。

作为我最后的方法，我尝试使用聚合搜索（来自），但它也只返回文档数，而不是单词：

{
  "query": {
    "query_string": {
      "fields": ["content"],
      "query": "ustawa"
    }
  },  
  "aggs": {
    "my-terms": {
      "terms": {
        "field": "content.keyword"
      }
    }
  }
}

我怎样才能做到这一点？如果重要的话，我正在使用 Python。

编辑我正在使用的索引映射：

  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }

Answer 1

在 Elasticsearch 7.11 中匿名 runtime_mappings。使用此功能，您可以在运行时构建新字段，然后使用常规“总和”聚合计算所有文档中的单词数。

例如：

PUT test/_doc/1
{
  "field" : "test test test ss"

}
PUT test/_doc/2
{
  "field" : "test test test ss"

}
GET test/_search
{
  "size": 0, 
  "runtime_mappings": {
    "phrase_count": {
      "type": "long",
      "script": """
         String tmp = doc['field.keyword'].value;
         Matcher m = /(test)/.matcher(tmp);
         int count = 0;
         while (m.find()){
           count++;
         }
         emit(count);
          """
    }
  },
  "query": {
    "match_all": {}
  }, 
  "aggs": {
    "word_count": {
      "sum": {
        "field": "phrase_count"
      }
    }
  }
}

匹配器中的“测试”字，就是你要找的，要算的。

Answer 2

Elasticsearch 中的 API 中有一个内置功能可以检索此类信息，因为文档和术语频率与 Elasticsearch 中的 BM25 评分非常相关。请参阅 Term vectors API and the term statistics 选项。您正在寻找“总词频”值。

如果您只想return特定术语的术语统计而不是现有文档中的所有术语，您可以发送一个“artifical document”到api只包含您正在查找的条款。

Elasticsearch - 计算索引中所有文本中的单词出现次数

Elasticsearch - count word occurrences in all texts from index

python

full-text-search