设计决策:将 JSON 中的西里尔字符与 PHP 匹配

Design decision: Matching cyrillic chars in JSON with PHP

我正在为 CMS 开发一个插件,但遇到了一个意想不到的问题:因为该插件支持多语言,所以输入可以是任何 unicode 字符集。该插件以 json 格式保存数据,并包含具有属性 valuelookup 的对象。对于 value 一切都很好,但是 lookup 属性 被 PHP 用来检索这些实体,并且在某些时候通过正则表达式(内容过滤器)。 问题是:

  1. 对于非拉丁字符(例如Экспорт),正则表达式中的\w (word-char) 不匹配。 有什么方法可以将西里尔字符识别为单词字符吗?还有其他隐藏的收获吗?
  2. 数据格式为JSON,非拉丁字符转换为JS unicodes,例如:\u042D\u043A\u0441\u043F\u043E\u0440\u0442这样做安全吗?(服务器限制等)

我的大 'design' 问题源于前两个问题:

我应该允许使用非拉丁字母语言的用户为 lookup 属性使用他们自己的字符,还是应该强制他们使用传统的 'word' 字符,即 a、b、c等等 + 下划线(因此是另一种语言的字母表)?我欢迎 技术建议 来指导这个决定(不是用户体验)。

第一题

For non-latin characters (eg. Экспорт), the \w (word-char) in a regex matches nothing. Is there any way to recognize cyrillic chars as word chars? Any other hidden catches?

您只需打开 u 标志:

preg_match("#^\w+$#u", $str);

Demo.

这里的 PHP docs 有误导性:

u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

我说这是误导,因为从上面的ideone测试来看,它不仅启用了PCRE_UTF8,而且还启用了PCRE_UCP(Unicode字符属性),这是你想要的行为在这里。

以下是 PCRE 文档对此的描述:

PCRE_UTF8
This option causes PCRE to regard both the pattern and the subject as strings of UTF-8 characters instead of single-byte strings. However, it is available only when PCRE is built to include UTF support. If not, the use of this option provokes an error. Details of how this option changes the behaviour of PCRE are given in the pcreunicode page.

PCRE_UCP
This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters. More details are given in the section on generic character types in the pcrepattern page. If you set PCRE_UCP, matching one of the items it affects takes much longer. The option is available only if PCRE has been compiled with Unicode property support.

如果你想让它 显而易见 乍一看 PCRE_UCP 标志将被设置,你可以将它插入模式本身,在开始时,就像那:

preg_match("#(*UCP)^\w+$#u", $str);

Another special sequence that may appear at the start of a pattern is (*UCP). This has the same effect as setting the PCRE_UCP option: it causes sequences such as \d and \w to use Unicode properties to determine character types, instead of recognizing only characters with codes less than 128 via a lookup table.

第二个问题

The data format being JSON, non-latin characters are converted to JS unicodes, eg for the above: \u042D\u043A\u0441\u043F\u043E\u0440\u0442. Is it safe not to do this? (server restrictions etc.)

只要您的 Content-Type header 定义了正确的编码,就可以安全地不这样做。

所以你可能想使用类似的东西:

header('Content-Type: application/json; charset=utf-8');

并确保您确实以 UTF8 格式发送它。

但是,在转义序列中对这些字符进行编码可使整个内容与 ASCII 兼容,因此您基本上可以通过这种方式完全消除问题。

设计问题

Should I either allow users with non-Latin alphabet languages to use their own chars for the lookup properties or should I force them to traditional 'word' chars, that is a,b,c etc. + underscore (thus an alphabet from another language)? I'd welcome a technical advice to guide this decision (not a UX one).

从技术上讲,只要您的整个堆栈支持 Unicode(浏览器、PHP、数据库等),我认为这种方法没有问题。只需确保对其进行良好测试并在数据库中使用 Unicode-enabled 列类型即可。

小心,PHP 是一种糟糕的字符串支持语言,所以你必须确保使用正确的函数(避免 non-Unicode 知道的函数,比如 strlen 等,除非你真的想要字节数)。

确保一切按预期运行可能需要更多工作,但如果这是您想要支持的东西,那没有问题。