如果文本包含 # 、 @ 等特殊字符,则自定义标记生成器不会按预期生成标记
custom tokenizer not generating tokens as expected if text contains special characters like # , @
我定义了以下分词器:
PUT /testanlyzer2
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "1",
"max_gram" : "3",
"token_chars": [ "letter", "digit","symbol","currency_symbol","modifier_symbol","other_symbol" ]
}
}
}
}
}
For the following request
GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a#m not available 9177"
Result is:
{
"tokens": [
{
"token": "i",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "a",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 2
}
]
}
对于以下请求::
GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a#m not available 9177"
结果是::
{
"tokens": [
{
"token": "i",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "a",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 2
}
]
}
对于以下请求::
GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a@m not available 9177"
结果是:
Request failed to get to the server (status code: 0):
Expected result should contain these special characters(@,#,currency's,etc..) as tokens. please correct me if anything wrong in my custom tokenizer.
--谢谢
#
是 Sense 中的一个特殊字符(如果您使用的是 Marvel 的 Sense 仪表板),它会注释掉该行。
要删除任何 html escaping/Sense 特殊字符,我会这样测试:
PUT /testanlyzer2
{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "keyword",
"filter": [
"substring"
]
}
},
"filter": {
"substring": {
"type": "nGram",
"min_gram": 1,
"max_gram": 3
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_ngram_analyzer"
}
}
}
}
}
POST /testanlyzer2/test/1
{
"text": "i a@m not available 9177"
}
POST /testanlyzer2/test/2
{
"text": "i a#m not available 9177"
}
GET /testanlyzer2/test/_search
{
"fielddata_fields": ["text"]
}
我定义了以下分词器:
PUT /testanlyzer2
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "1",
"max_gram" : "3",
"token_chars": [ "letter", "digit","symbol","currency_symbol","modifier_symbol","other_symbol" ]
}
}
}
}
}
For the following request
GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a#m not available 9177"
Result is:
{
"tokens": [
{
"token": "i",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "a",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 2
}
]
}
对于以下请求::
GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a#m not available 9177"
结果是::
{
"tokens": [
{
"token": "i",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "a",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 2
}
]
}
对于以下请求::
GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a@m not available 9177"
结果是:
Request failed to get to the server (status code: 0):
Expected result should contain these special characters(@,#,currency's,etc..) as tokens. please correct me if anything wrong in my custom tokenizer.
--谢谢
#
是 Sense 中的一个特殊字符(如果您使用的是 Marvel 的 Sense 仪表板),它会注释掉该行。
要删除任何 html escaping/Sense 特殊字符,我会这样测试:
PUT /testanlyzer2
{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "keyword",
"filter": [
"substring"
]
}
},
"filter": {
"substring": {
"type": "nGram",
"min_gram": 1,
"max_gram": 3
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_ngram_analyzer"
}
}
}
}
}
POST /testanlyzer2/test/1
{
"text": "i a@m not available 9177"
}
POST /testanlyzer2/test/2
{
"text": "i a#m not available 9177"
}
GET /testanlyzer2/test/_search
{
"fielddata_fields": ["text"]
}