Sqlite FTS5 标点符号在 select 查询中不起作用
Sqlite FTS5 punctuation marks not working in select query
我正在使用 sqlite 进行全文搜索,以下是我正在使用的一些 select 查询示例。
例如:
SELECT * FROM table WHERE table MATCH 'column:father's' ORDER BY rank;
SELECT * FROM table WHERE table MATCH 'column:example:' ORDER BY rank;
- SELECT * FROM table WHERE table MATCH 'column:month&' ORDER BY rank;
因为我在搜索文本中使用 ' : & 个字符,所以这些查询给我错误。我也尝试在标点符号前使用转义字符(\-反斜杠)。
有什么解决方案可以用 MATCH 运算符在 fts5 中搜索标点符号(, . / " ' - & 等)?
这些字符在 _, €, £, ¥ 与匹配运算符
谢谢
这似乎是 this question 的副本。尝试那里的最佳答案,其中指出您应该将搜索字符串括在单引号和双引号中。
# fathers'
SELECT * FROM table WHERE table MATCH 'column:"father''s"';
# example:
SELECT * FROM table WHERE table MATCH 'column:"example:"';
# month&
SELECT * FROM table WHERE table MATCH 'column:"month&"';
我想看一个完整的例子,因为我发现使用 fts5 很容易得到微妙和意想不到的结果。
首先,虽然换行搜索字符串可能会给你正确的答案,但它可能不是你真正想要的,这里有一个例子来说明:
$ sqlite3 ":memory:"
sqlite> CREATE VIRTUAL TABLE IF NOT EXISTS bad USING fts5(term, tokenize="unicode61");
sqlite>
sqlite> INSERT INTO bad (term) VALUES ('father''s');
sqlite>
sqlite> SELECT * from bad WHERE term MATCH 'father';
father's
sqlite> SELECT * from bad WHERE term MATCH '"father''s"';
father's
sqlite> SELECT * from bad WHERE term MATCH 's';
father's
请注意 s
如何匹配 father's
也?那是因为当你 运行 father's
通过标记器时,它将根据 the following rules by default:
进行标记化
An FTS5 bareword is a string of one or more consecutive characters
that are all either:
- Non-ASCII range characters (i.e. unicode codepoints greater than 127), or
- One of the 52 upper and lower case ASCII characters, or
- One of the 10 decimal digit ASCII characters, or
- The underscore character (unicode codepoint 96).
- The substitute character (unicode codepoint 26).
所以 father's
会被标记化为 father
和 s
,这可能是也可能不是你想要的,但为了这个答案,我将假设那不是你想要的。
那你怎么告诉 tokenizer 让 father's
在一起呢?通过使用 tokenize
参数的 tokenchars
选项:
tokenchars This option is used to specify additional unicode characters that should be considered token characters, even if they are white-space or punctuation characters according to Unicode 6.1. All characters in the string that this option is set to are considered token characters.
让我们看另一个例子,这次使用 tokenchars
:
$ sqlite3 ":memory:"
sqlite> CREATE VIRTUAL TABLE IF NOT EXISTS good USING fts5(term, tokenize="unicode61 tokenchars '''&:'");
sqlite>
sqlite> INSERT INTO good (term) VALUES ('father''s');
sqlite> INSERT INTO good (term) VALUES ('month&');
sqlite> INSERT INTO good (term) VALUES ('example:');
sqlite>
sqlite> SELECT count(*) from good WHERE term MATCH 'father';
0
sqlite> SELECT count(*) from good WHERE term MATCH '"father''s"';
1
sqlite> SELECT count(*) from good WHERE term MATCH 'example';
0
sqlite> SELECT count(*) from good WHERE term MATCH '"example:"';
1
sqlite> SELECT count(*) from good WHERE term MATCH 'month';
0
sqlite> SELECT count(*) from good WHERE term MATCH '"month&"';
1
这些结果似乎更令人期待。但是第一个例子的随机 s
结果呢?
sqlite> SELECT count(*) from good WHERE term MATCH 's';
0
太棒了!
希望这可以帮助您按照预期的方式设置 table。
我正在使用 sqlite 进行全文搜索,以下是我正在使用的一些 select 查询示例。
例如:
SELECT * FROM table WHERE table MATCH 'column:father's' ORDER BY rank;
SELECT * FROM table WHERE table MATCH 'column:example:' ORDER BY rank;
- SELECT * FROM table WHERE table MATCH 'column:month&' ORDER BY rank;
因为我在搜索文本中使用 ' : & 个字符,所以这些查询给我错误。我也尝试在标点符号前使用转义字符(\-反斜杠)。
有什么解决方案可以用 MATCH 运算符在 fts5 中搜索标点符号(, . / " ' - & 等)?
这些字符在 _, €, £, ¥ 与匹配运算符
谢谢
这似乎是 this question 的副本。尝试那里的最佳答案,其中指出您应该将搜索字符串括在单引号和双引号中。
# fathers'
SELECT * FROM table WHERE table MATCH 'column:"father''s"';
# example:
SELECT * FROM table WHERE table MATCH 'column:"example:"';
# month&
SELECT * FROM table WHERE table MATCH 'column:"month&"';
我想看一个完整的例子,因为我发现使用 fts5 很容易得到微妙和意想不到的结果。
首先,虽然换行搜索字符串可能会给你正确的答案,但它可能不是你真正想要的,这里有一个例子来说明:
$ sqlite3 ":memory:"
sqlite> CREATE VIRTUAL TABLE IF NOT EXISTS bad USING fts5(term, tokenize="unicode61");
sqlite>
sqlite> INSERT INTO bad (term) VALUES ('father''s');
sqlite>
sqlite> SELECT * from bad WHERE term MATCH 'father';
father's
sqlite> SELECT * from bad WHERE term MATCH '"father''s"';
father's
sqlite> SELECT * from bad WHERE term MATCH 's';
father's
请注意 s
如何匹配 father's
也?那是因为当你 运行 father's
通过标记器时,它将根据 the following rules by default:
An FTS5 bareword is a string of one or more consecutive characters that are all either:
- Non-ASCII range characters (i.e. unicode codepoints greater than 127), or
- One of the 52 upper and lower case ASCII characters, or
- One of the 10 decimal digit ASCII characters, or
- The underscore character (unicode codepoint 96).
- The substitute character (unicode codepoint 26).
所以 father's
会被标记化为 father
和 s
,这可能是也可能不是你想要的,但为了这个答案,我将假设那不是你想要的。
那你怎么告诉 tokenizer 让 father's
在一起呢?通过使用 tokenize
参数的 tokenchars
选项:
tokenchars This option is used to specify additional unicode characters that should be considered token characters, even if they are white-space or punctuation characters according to Unicode 6.1. All characters in the string that this option is set to are considered token characters.
让我们看另一个例子,这次使用 tokenchars
:
$ sqlite3 ":memory:"
sqlite> CREATE VIRTUAL TABLE IF NOT EXISTS good USING fts5(term, tokenize="unicode61 tokenchars '''&:'");
sqlite>
sqlite> INSERT INTO good (term) VALUES ('father''s');
sqlite> INSERT INTO good (term) VALUES ('month&');
sqlite> INSERT INTO good (term) VALUES ('example:');
sqlite>
sqlite> SELECT count(*) from good WHERE term MATCH 'father';
0
sqlite> SELECT count(*) from good WHERE term MATCH '"father''s"';
1
sqlite> SELECT count(*) from good WHERE term MATCH 'example';
0
sqlite> SELECT count(*) from good WHERE term MATCH '"example:"';
1
sqlite> SELECT count(*) from good WHERE term MATCH 'month';
0
sqlite> SELECT count(*) from good WHERE term MATCH '"month&"';
1
这些结果似乎更令人期待。但是第一个例子的随机 s
结果呢?
sqlite> SELECT count(*) from good WHERE term MATCH 's';
0
太棒了!
希望这可以帮助您按照预期的方式设置 table。