正则表达式中的 Unicode 支持

Question

我正在尝试 highlight/bold 来自字符串的匹配词。以下函数适用于英语但不支持 Unicode。我尝试在正则表达式规则中添加 u 以支持 Unicode，但对我不起作用。

function highlight_term($text, $words)
{
    preg_match_all('~[A-Za-z0-9_äöüÄÖÜ]+~u', $words, $m);
    if( !$m )
    {
        return $text;
    }

    $re = '~(' . implode('|', $m[0]) . ')~i';
    return preg_replace($re, '<b>[=10=]</b>', $text);
}

$str = "ह ट इ ड यन भ भ और द";

echo highlight_term($str, 'और');

输出

��

预期输出

ह ट इ ड यन भ भ और द

Answer 1

修正你目前的方法

请注意，您可以将第一个正则表达式更改为 ~[\p{L}\p{M}]+~u 以匹配所有 Unicode 字母（\p{L} 和 u 修饰符可识别 Unicode 并匹配任何 Unicode 字母）和变音符号（\p{M} 匹配组合标记）并向第二个 preg_replace:

添加 u 修饰符

function highlight_term($text, $words)
{
    $i = preg_match_all('~[\p{L}\p{M}]+~u', $words, $m);
    if( $i == 0  )
    {
        return $text;
    }

    $re = '~' . implode('|', $m[0]) . '~iu';
    return preg_replace($re, '<b>[=10=]</b>', $text);
}

$str = "ह ट इ ड यन भ भ और द";

echo highlight_term($str, 'और');

结果：ह ट इ ड यन भ भ <b>और</b> द.

见PHP demo

您需要在第二个正则表达式中使用 u 修饰符，因为您传递给模式的文本是 Unicode，并且您仍然使用 Unicode 字符串。不需要第二个正则表达式中的外括号，因为您只对整个匹配值感兴趣（您使用 [=20=] 反向引用替换）。

更好的方法

您可以将单词数组传递给突出显示函数，并仅匹配具有单词边界的整个单词，直接将正则表达式传递给 preg_replace 函数：

function highlight_term($text, $words)
{
    return preg_replace('~\b(?:' . implode("|", $words) . ')\b~u', '<b>[=11=]</b>', $text);
}

$str = "ह ट इ ड यन भ भ और द";
echo highlight_term($str, ['और','भ']);
// => ह ट इ ड यन <b>भ</b> <b>भ</b> <b>और</b> द

见this PHP demo

正则表达式中的 Unicode 支持

Unicode support in regular expression

php

regex

unicode

highlight

修正你目前的方法

更好的方法