SOLR 和重音字符

SOLR and accented characters

我有一个职业索引(标识符+职业):

<field name="occ_id" type="int" indexed="true" stored="true" required="true" />
<field name="occ_tx_name" type="text_es" indexed="true" stored="true" multiValued="false" />


<!-- Spanish -->
<fieldType name="text_es" class="solr.TextField" positionIncrementGap="100">
  <analyzer> 
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball" />
    <filter class="solr.SpanishLightStemFilterFactory"/>
  </analyzer>
</fieldType>

这是一个真实的查询,针对三个标识符(1、195 和 129):

curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_id:1+occ_id:195+occ_id:129&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"occ_id:1 occ_id:195 occ_id:129",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":3,"start":0,"docs":[
      {
        "occ_id":1,
        "occ_tx_name":"Abogado",
        "_version_":1565225103805906944},
      {
        "occ_id":129,
        "occ_tx_name":"Informático",
        "_version_":1565225103843655680},
      {
        "occ_id":195,
        "occ_tx_name":"Osteópata",
        "_version_":1565225103858335746}]
  }}

其中两个有重音字符,一个没有。因此,让我们在不使用重音符号的情况下按 occ_tx_name 进行搜索:

curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_tx_name:abogado&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"occ_tx_name:abogado",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "occ_id":1,
        "occ_tx_name":"Abogado",
        "_version_":1565225103805906944}]
  }}

curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_tx_name:informatico&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"occ_tx_name:informatico",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound”:1,”start":0,"docs":[
      {
        "occ_id":129,
        "occ_tx_name":"Informático",
        "_version_":1565225103843655680}]
  }}


curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_tx_name:osteopata&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"occ_tx_name:osteopata",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":0,"start":0,"docs":[]
  }}

最后一次搜索“osteopata”失败,而“informatico”成功,这让我很恼火。索引的源数据是一个简单的 MySQL table:

-- -----------------------------------------------------
-- Table `mydb`.`occ_occupation`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `mydb`.`occ_occupation` (
  `occ_id` INT UNSIGNED NOT NULL,
  `occ_tx_name` VARCHAR(255) NOT NULL,
  PRIMARY KEY (`occ_id`)
ENGINE = InnoDB

table 的排序规则是“utf8mb4_general_ci”。索引是使用 DataImportHandler 创建的。这是定义:

<dataConfig>
    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://192.168.1.11:3306/mydb"
        user=“mydb” password=“mydb” />
    <document name="occupations">
        <entity name="occupation" pk="occ_id"
            query="SELECT occ.occ_id, occ.occ_tx_name FROM occ_occupation occ WHERE occ.sta_bo_deleted = false">
            <field column="occ_id" name="occ_id" />
            <field column="occ_tx_name" name="occ_tx_name" />
        </entity>
    </document>
</dataConfig> 

我需要一些线索来检测问题。谁能帮我?提前致谢。

我认为 mysql 或您的 jvm 设置与此无关。我怀疑一个有效,另一个可能由于 SpanishLightStemFilterFactory 而无效。

无论变音符号如何实现匹配的正确方法是使用以下内容:

  <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>

将其放在索引和查询分析器链中的分词器之前,任何变音符号都应转换为 ascii 版本。这将使它始终有效。

只需将 solr.ASCIIFoldingFilterFactory 添加到您的过滤器分析器链中,甚至更好地创建一个新的 fieldType:

<!-- Spanish -->
<fieldType name="text_es_ascii_folding" class="solr.TextField" positionIncrementGap="100">
  <analyzer> 
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball" />
    <filter class="solr.SpanishLightStemFilterFactory"/>
  </analyzer>
</fieldType>

This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists.

即使重音字符缺失,这也能让您匹配搜索。 缺点是像 "cañon" 和 "canon" 这样的词现在是等价的并且都命中相同的文档 IIRC.

好的,我发现了源问题。我已经在十六进制模式下使用 VI 打开 SQL 加载脚本。

这是 INSERT 语句中 'Agrónomo' 的十六进制内容:41 67 72 6f cc 81 6e 6f 6d 6f.

6f cc 81!!!! This is "o COMBINING ACUTE ACCENT" UTF code!!!!

所以这就是问题所在...一定是 "c3 b3"...我从网页上获取文字 copy/pasting,所以来源上的源字符是问题所在。

谢谢你们,因为我对SOLR的灵魂有了更多的了解。

此致。