MySQL REGEXP 检查一个字符串是否以任何顺序包含所有这些字符

MySQL REGEXP to check a string contains all these characters in any order

我正在尝试编写一个搜索单词列表的查询。其中一个条件是检查包含(以任何顺序)字符串中任何字符的单词。

例如单词必须以任意顺序包含 'o' 和 'd',因此 'ABDOMEN' 和 'ABOUND' 都是正确的。

我的查询是:

SELECT word
FROM words
WHERE lower(word) like 'ab%'                   /* Words starts with AB               */
AND   REGEXP_INSTR(lower(word), '[str]') = 0   /* does not contain any of r, s or t  */
AND   REGEXP_INSTR(lower(word), '[od]') > 0    /* must  contain both o and d         */

问题是 'must contain' 条件,特别是让它同时检查 'O' 和 'D',而上面更像是 'O' 或 'D'.

实验中,我发现这个有效:

AND   REGEXP_INSTR(lower(word), '(o.*d|d.*o)' ) > 0    /* must  contain both o and d         */

问题是我必须从原始 od 生成(在 PHP 中)(o.*d|d.*o)。 当列表超过 3 个字符时,生成这变得很困难。

另一种方法是为 'must contain' 列表中的每个字符添加一个单独的条件:

AND   INSTR(lower(word), 'o' ) > 0    /* must  contain o          */
AND   INSTR(lower(word), 'd' ) > 0    /* must  contain d         */

然而,当在 PHP 中使用 bind_param 调用时,传递这些会使代码变得混乱。

在MySQL中是否有'one-liner'可以实现以上这些?

字母系列可以像这个例子一样处理,它们出现的顺序将被忽略:

WHERE REGEXP_INSTR(lower(word), '(?=.*O)(?=.*D)')

是否区分大小写由列上的排序规则决定。除非您有使用区分大小写排序规则的特定原因,否则我建议将其更改为不区分大小写以避免明确强制区分大小写的需要。您可以将匹配类型设置为 case-insensitive for REGEXP_INSTR,而不是对每个单词应用另一个函数。您还可以将前缀检查移至正则表达式 -

SELECT word
FROM words
WHERE REGEXP_INSTR(word, '(?=^ab)(?=.*o)(?=.*d)', 1, 1, 0, 'i');

当然,上面的查询不能使用任何可用的索引来过滤,所以将前缀移动到正则表达式并不是一个好主意。这让我做了一些测试。我拿了一份字典的简化副本来创建以下 table(111,745 行)-

CREATE TABLE `words` (
  `word_cs` varchar(25) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_cs NOT NULL,
  `word_ci` varchar(25) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_ci NOT NULL,
  KEY `idx_word_ci` (`word_ci`),
  KEY `idx_word_cs` (`word_cs`)
) ENGINE=InnoDB;

第一批查询我运行对case-sensitiveword_cs和我运行各查询五次,时间是平均值-

SELECT word_cs
FROM words
WHERE REGEXP_INSTR(word_cs, '(?=^ab)(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.166s */

SELECT word_cs
FROM words
WHERE REGEXP_INSTR(lower(word_cs), '(?=^ab)(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.193s */

SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND REGEXP_INSTR(word_cs, '(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.067s */

SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND REGEXP_INSTR(lower(word_cs), '(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.065s */

SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND INSTR(lower(word_cs), 'o' ) > 0
AND INSTR(lower(word_cs), 'd' ) > 0;
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.064s */

SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND lower(word_cs) LIKE '%o%'
AND lower(word_cs) LIKE '%d%';
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.063s */

然后我 运行 针对 case-insensitive word_ci -

进行了类似的(由于不区分大小写而略有修改)批次查询
SELECT word_ci
FROM words
WHERE REGEXP_INSTR(word_ci, '(?=^ab)(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.147s */

SELECT word_ci
FROM words
WHERE REGEXP_INSTR(word_ci, '(?=^ab)(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.157s */

SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND REGEXP_INSTR(word_ci, '(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.003s */

SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND REGEXP_INSTR(word_ci, '(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.003s */

SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND INSTR(word_ci, 'o' ) > 0
AND INSTR(word_ci, 'd' ) > 0;
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.001s */

SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND word_ci LIKE '%o%'
AND word_ci LIKE '%d%';
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.001s */
word_cs word_ci
Query 1 0.166 0.147
Query 2 0.193 0.157
Query 3 0.067 0.003
Query 4 0.065 0.003
Query 5 0.064 0.001
Query 6 0.063 0.001

真正的性能差异不在于排序规则的 case-sensitivity 或缺乏排序规则,而是查询是否可优化搜索。在列值上使用 LOWER() 会否定索引的使用。

/* Full table scan */
SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'

/* Uses index if available */
SELECT word_cs
FROM words
WHERE word_cs LIKE 'ab%'
OR word_cs LIKE 'AB%'
OR word_cs LIKE 'Ab%'
OR word_cs LIKE 'aB%';

整理很重要。 lower(word_cs) 不是“sargable”,因此不会使用索引。

这应该会加快以前的答案:

WHERE word LIKE 'ab%'
  AND REGEXP_INSTR(word, '(?=.*o)(?=.*o)')

连同以 word 开头的索引,并在 word 上具有 _ci 排序规则。

希望它会使用 INDEX 快速获取 'ab' 单词,然后花时间将正则表达式应用于该子集,寻找 'o' 和 'd'.)

这是另一个想法(具有相同的索引):

SELECT word
    FROM tbl
    WHERE word LIKE 'ab%o%'
      AND word LIKE 'ab%d%'

对于 [str] 的 none,只需添加

     AND NOT word RLIKE '[str]'

同样,性能取决于 LIKERLIKE 之前完成。