转换正则表达式以说明国际字符

Converting regex to account for international characters

我目前有以下正则表达式来验证将 company name 输入到表单中:

$regexpRange = $min.','.$max;
$regexpPattern = '/^(?=[A-Za-z\d\'\s\,\.]{'.$regexpRange.'}$)(?=.*[a-z\d])[a-zA-Z\d]+[A-Za-z\d\'\s\,\.]+$/m';

我需要将其更新为国际标准以允许使用国际字符。 我对此零经验

有人可以帮助我了解如何解决这个问题吗?

以下是必需的步骤:

  • 使用 u 模式选项。这会打开 PCRE_UTF8 PCRE_UCP(PHP 文档忘记提到那个):

    PCRE_UTF8

    This option causes PCRE to regard both the pattern and the subject as strings of UTF-8 characters instead of single-byte strings. However, it is available only when PCRE is built to include UTF support. If not, the use of this option provokes an error. Details of how this option changes the behaviour of PCRE are given in the pcreunicode page.

    PCRE_UCP

    This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters. More details are given in the section on generic character types in the pcrepattern page. If you set PCRE_UCP, matching one of the items it affects takes much longer. The option is available only if PCRE has been compiled with Unicode property support.

  • \d 将与 PCRE_UCP 一起使用(它已经等同于 \p{N}),但是您必须将这些 [a-z] 范围替换为考虑重音字符:

    • [a-zA-Z]替换为\p{L}
    • [a-z]替换为\p{Ll}
    • [A-Z]替换为\p{Lu}

    \p{X} 表示:来自 Unicode category X, where L means letter, Ll means lowercase letter and Lu means uppercase letter. You can get a list from the docs 的字符。

    请注意,您可以在字符 class 中使用 \p{X}:例如 [\p{L}\d\s]

  • 并确保 PHP 中的字符串使用 UTF8 编码。另外,请确保您使用 Unicode 识别函数来处理这些字符串。