lunr.js 的希腊语支持

Question

在 lunr 中为希腊词注册一个新的词干分析器函数没有按预期工作。 here 是我在 codepen 上的代码。我没有收到任何错误，函数 stemWord() 在单独使用时工作正常，但它无法阻止 lunr 中的单词。下面是代码示例：

function stemWord(w) {
// code that returns the stemmed word
};

// create the new function
greekStemmer = function (token) {
    return stemWord(token);
};

// register it with lunr.Pipeline, this allows you to still serialise the index
lunr.Pipeline.registerFunction(greekStemmer, 'greekStemmer')

  var index = lunr(function () {
    this.field('title', {boost: 10})
    this.field('body')
    this.ref('id')

    this.pipeline.remove(lunr.trimmer) // it doesn't work well with non-latin characters
    this.pipeline.add(greekStemmer)
  })

    index.add({
    id: 1,
    title: 'ΚΑΠΟΙΟΣ',
    body: 'Foo foo foo!'
  })

  index.add({
    id: 2,
    title: 'ΚΑΠΟΙΕΣ',
    body: 'Bar bar bar!'
  })


  index.add({
    id: 3,
    title: 'ΤΙΠΟΤΑ',
    body: 'Bar bar bar!'
  })

Answer 1

在 lunr 中，词干分析器是作为管道函数实现的。在索引文档时对文档中的每个词执行管道函数，在搜索时对搜索查询中的每个词执行管道函数。

要使函数在管道中工作，它必须实现一个非常简单的接口。它需要接受单个字符串作为输入，并且必须以字符串作为输出进行响应。

因此，非常简单（且无用）的管道函数如下所示：

var simplePipelineFunction = function (word) {
  return word
}

要真正使用这个管道功能，我们需要做两件事：

将其注册为管道函数，这允许 lunr 正确序列化和反序列化您的管道。
将其添加到您的索引管道中。

看起来像这样：

// registering our pipeline function with the name 'simplePipelineFunction'
lunr.Pipeline.registerFunction(simplePipelineFunction, 'simplePipelineFunction')

var idx = lunr(function () {
  // adding the pipeline function to our indexes pipeline
  // when defining the pipeline
  this.pipeline.add(simplePipelineFunction)
})

现在，您可以使用上面的内容，换出我们管道函数的实现。因此，它可以使用您找到的希腊词干分析器来提取词干，而不是仅仅返回未更改的词，可能像这样：

var myGreekStemmer = function (word) {
  // I don't know how to use the greek stemmer, but I think
  // its safe to assume it won't be that different than this
  return greekStem(word)
}

使 lunr 适应英语以外的语言需要的不仅仅是添加词干分析器。 lunr 的默认语言是英语，因此默认情况下，它包含专门针对英语的管道函数。英语和希腊语非常不同，您可能运行会遇到尝试使用英语默认值索引希腊语单词的问题，因此我们需要执行以下操作：

用我们特定语言的词干分析器替换默认词干分析器
删除默认的修剪器，它不能很好地处理非拉丁字符
Replace/remove 默认的停用词过滤器，它不太可能用于英语以外的语言。

修剪器和停用词过滤器是作为管道函数实现的，因此实现语言特定的过滤器与词干分析器类似。

因此，要为希腊语设置 lunr，您需要：

var idx = lunr(function () {
  this.pipeline.after(lunr.stemmer, greekStemmer)
  this.pipeline.remove(lunr.stemmer)

  this.pipeline.after(lunr.trimmer, greekTrimmer)
  this.pipeline.remove(lunr.trimmer)

  this.pipeline.after(lunr.stopWordFilter, greekStopWordFilter)
  this.pipeline.remove(lunr.stopWordFilter)

  // define the index as normal
  this.ref('id')
  this.field('title')
  this.field('body')
})

要获得更多灵感，您可以查看出色的 lunr-languages 项目，它有许多为 lunr 创建语言扩展的示例。您甚至可以为希腊语提交一份！

EDIT 看起来我不知道 lunr.Pipeline API 和我想的一样，没有 replace 功能，相反，我们只是在要删除的函数之后插入替换，然后将其删除。

编辑添加此内容以帮助将来的其他人...事实证明问题出在 lunr 中令牌的大小写上。 lunr 希望将所有标记都视为小写，这是在 tokenizer 中完成的，没有任何可配置性。对于大多数语言处理函数来说，这不是问题，事实上，大多数假设单词都是小写的。在这种情况下，由于希腊语词干的复杂性，希腊语词干分析器仅对大写单词进行词干分析（我不是希腊语使用者，因此无法评论词干提取有多复杂）。 A 解决方案是在调用希腊语词干分析器之前转换为大写，然后在将标记传递到管道的其余部分之前转换回小写。

lunr.js 的希腊语支持

Greek language support for lunr.js

javascript

full-text-search

static-site

non-latin

lunrjs