如何使用 C# nest 解决 elasticsearch 中的土耳其字母问题？

Question

在土耳其，我们有土耳其语字母，例如 'ğ'、'ü'、'ş'、'ı'、'ö'、'ç'。但是我们在搜索的时候一般会使用'g'、'u'、's'、'i'、'o'、'c'这些字母。这不是一条规则，但我们通常会这样做，像习惯一样思考，我们习惯了。例如，如果我写驼峰式“Ş”，则应该搜索“ş”和 "s"。请看这个 link 是一回事。但是他们的解决方案太长而且不完美。我该怎么办？

我的目标是：

ProductName 或 Category.CategoryName 可能包含土耳其字母 ("Eşarp") 或某些可能打错并用英文字母书写 ("Esarp") 查询字符串可能包含土耳其字母 ("eşarp") 或不包含 ("esarp") 查询字符串可能有多个单词每个索引字符串字段都应该根据查询字符串进行搜索（全文搜索）

我的密码是：


  try
            {
                var node = new Uri(ConfigurationManager.AppSettings["elasticseachhost"]);
                var settings = new ConnectionSettings(node);
                settings.DefaultIndex("defaultindex").MapDefaultTypeIndices(m => m.Add(typeof(Customer), "myindex"));
                var client = new ElasticClient(settings);



                string command = Resource1.GetAllData;
                using (var ctx = new SearchEntities())
                {
                    Console.WriteLine("ORacle db is connected...");
                    var customers = ctx.Database.SqlQuery(command).ToList();
                    Console.WriteLine("Customer count : {0}", customers.Count);
                    if (customers.Count > 0)
                    {
                        var delete = client.DeleteIndex(new DeleteIndexRequest("myindex"));
                        foreach (var customer in customers)
                        {

                            client.Index(customer, idx => idx.Index("myindex"));
                            Console.WriteLine("Data is indexed in elasticSearch engine");
                        }


                    }
                }
            }
            catch (Exception ex)
            {
                Trace.WriteLine(ex.Message);
                Console.WriteLine(ex.Message);
            }

我的实体：


 public class Customer
    {
        public string Name{ get; set; }
        public string SurName { get; set; }
        public string Address{ get; set; }
}

我想我想要的解决方案是：(Create index with multi field mapping syntax with NEST 2.x)

但是我看不懂。


Check this out:

[Nest.ElasticsearchType]
public class MyType
{
    // Index this & allow for retrieval.
    [Nest.Number(Store=true)]
    int Id { get; set; }

    // Index this & allow for retrieval.
    // **Also**, in my searching & sorting, I need to sort on this **entire** field, not just individual tokens.
    [Nest.String(Store = true, Index=Nest.FieldIndexOption.Analyzed, TermVector=Nest.TermVectorOption.WithPositionsOffsets)]
    string CompanyName { get; set; }

    // Don't index this for searching, but do store for display.
    [Nest.Date(Store=true, Index=Nest.NonStringIndexOption.No)]
    DateTime CreatedDate { get; set; }

    // Index this for searching BUT NOT for retrieval/displaying.
    [Nest.String(Store=false, Index=Nest.FieldIndexOption.Analyzed)]
    string CompanyDescription { get; set; }

    [Nest.Nested(Store=true, IncludeInAll=true)]
    // Nest this.
    List Locations { get; set; }
}

[Nest.ElasticsearchType]
public class MyChildType
{
    // Index this & allow for retrieval.
    [Nest.String(Store=true, Index = Nest.FieldIndexOption.Analyzed)]
    string LocationName { get; set; }

    // etc. other properties.
}
After this declaration, to create this mapping in elasticsearch you need to make a call similar to:

var mappingResponse = elasticClient.Map(m => m.AutoMap());

我第二次尝试上述挑战： 错误：未检测到分析。大问题是版本差异。我发现很多样本都产生如下错误： 'CreateeIndexDescriptor' 不包含 "Analysis" 的定义...


using Nest;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ElasticSearchTest2
{
    class Program
    {
        public static Uri EsNode;
        public static ConnectionSettings EsConfig;
        public static ElasticClient client;
        static void Main(string[] args)
        {
            EsNode = new Uri("http://localhost:9200/");
            EsConfig = new ConnectionSettings(EsNode);
            client = new ElasticClient(EsConfig);

            var partialName = new CustomAnalyzer
            {
                Filter = new List { "lowercase", "name_ngrams", "standard", "asciifolding" },
                Tokenizer = "standard"
            };

            var fullName = new CustomAnalyzer
            {
                Filter = new List { "standard", "lowercase", "asciifolding" },
                Tokenizer = "standard"
            };

            client.CreateIndex("employeeindex5", c => c
                            .Analysis(descriptor => descriptor
                                .TokenFilters(bases => bases.Add("name_ngrams", new EdgeNGramTokenFilter
                                {
                                    MaxGram = 20,
                                    MinGram = 2,
                                    Side = "front"
                                }))
                                .Analyzers(bases => bases
                                    .Add("partial_name", partialName)
                                    .Add("full_name", fullName))
                            )
                            .AddMapping(m => m
                                .Properties(o => o
                                    .String(i => i
                                        .Name(x => x.Name)
                                        .IndexAnalyzer("partial_name")
                                        .SearchAnalyzer("full_name")
                                    ))));

            Employee emp = new Employee() { Name = "yılmaz", SurName = "eşarp" };
            client.Index(emp, idx => idx.Index("employeeindex5"));
            Employee emp2 = new Employee() { Name = "ayşe", SurName = "eşarp" };
            client.Index(emp2, idx => idx.Index("employeeindex5"));
            Employee emp3 = new Employee() { Name = "ömer", SurName = "eşarp" };
            client.Index(emp3, idx => idx.Index("employeeindex5"));
            Employee emp4 = new Employee() { Name = "gazı", SurName = "emir" };
            client.Index(emp4, idx => idx.Index("employeeindex5"));
        }
    }

    public class Employee
    {

        public string Name { set; get; }
        public string SurName { set; get; }


    }
}

Answer 1

简单的解决方案是使用一种称为 Unicode 分解的方法。字符 Ş 可以拆分为 ASCII S 后跟组合变音符号。搜索时，您需要执行以下步骤：

分解字符串
删除所有变音符号
将所有剩余字符转换为小写。
与类似转换的搜索关键字进行比较。

具体来说，您需要 FormD decomposition, and remove the combining diacritics by looking at their UnicodeCategory。您还可以使用该 UnicodeCategory 删除空格和其他标点符号。

Answer 2

你想要的是利用ASCII Folding Token Filter，这是从elasticsearch官方页面引用的：

A token filter of type asciifolding that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

这意味着它可以将 ç 之类的字符转换为普通的拉丁字符（在本例中为字母 c），因为它是与标准 ascii 字符最接近的匹配。

所以你可以有一个像 çar 这样的值，当你想执行搜索时，使用相同的标记过滤器搜索 car 或 çar 将 return 你得到了你期待的结果。

举个例子，你可以试试下面的调用：

对您的 elasticsearch 实例执行此 POST 请求

URL:

http://YOUR_ELASTIC_SEARCH_INSTANCE_URL/_analyze/

请求正文： { "tokenizer": "standard", "filter": [ "lowercase", "asciifolding" ], "text": "déja öne ğuess" }

结果如下：

{
"tokens": [
{
"token": "deja",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "one",
"start_offset": 5,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
}
,
{
"token": "guess",
"start_offset": 9,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
}
]
}

请注意 token 属性（elastic 实际索引和处理的文本）是提供的原始文本的英文版本。

要了解有关 ASCII Folding Token Filter 的更多信息，请参阅此 link： https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-asciifolding-tokenfilter.html

注意：为了使用此技术，您需要创建自己的分析器。

这是从自定义分析器的官方页面引用的：

When the built-in analyzers do not fulfill your needs, you can create a custom analyzer which uses the appropriate combination of:

zero or more character filters

a tokenizer

zero or more token filters.

可在此处找到有关创建自定义分析器的更多信息：https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html

您还可以从以下答案中找到有关如何使用 NEST 创建自定义分析器的示例：Create custom token filter with NEST

如何使用 C# nest 解决 elasticsearch 中的土耳其字母问题？

How can I solve turkish letter issue in elasticsearch by using C# nest?

c#

full-text-indexing

elasticsearch

nest