如何在 PHP 中查找 unicode 字符 class

Question

我很难找到一种方法来获取字符的 unicode class。

unicode 列表 classes: https://www.php.net/manual/en/regexp.reference.unicode.php

python中需要的函数：https://docs.python.org/3/library/unicodedata.html#unicodedata.category

我只想要 PHP 等同于此 python 函数。

例如，如果我这样调用 x 函数：x('-') 它会 return Pd 因为 Pd 是 class 连字符所属的。

谢谢。

Answer 1

所以显然没有 built-in 函数可以做到这一点，所以我写了这个函数：

<?php
$UNICODE_CATEGORIES = [
        "Cc",
        "Cf",
        "Cs",
        "Co",
        "Cn",
        "Lm",
        "Mn",
        "Mc",
        "Me",
        "No",
        "Zs",
        "Zl" ,
        "Zp",
        "Pc",
        "Pd",
        "Ps" ,
        "Pe" ,
        "Pi" ,
        "Pf" ,
        "Po" ,
        "Sm",
        "Sc",
        "Sk",
        "So",
        "Zs",
        "Zl",
        "Zp"
    ];

function uni_category($char, $UNICODE_CATEGORIES) {
    foreach ($UNICODE_CATEGORIES as $category) {
        if (preg_match('/\p{'.$category.'}/', $char))
            return $category;
    } 
    return null;
}
// call the function 
print uni_category('-', $UNICODE_CATEGORIES); // it returns Pd

这段代码对我有用，我希望它对将来有所帮助:)。

Answer 2

一种可能的方法是使用IntlChar::charType. Unfortunately, this method returns only an int, but this int is a constant defined in the IntlChar class。 30 个类别的所有常量都在 0 到 29 的范围内（没有间隙）。结论，您所要做的就是构建一个遵循相同顺序的索引数组：

$shortCats = [
    'Cn', 'Lu', 'Ll', 'Lt', 'Lm', 'Lo',
    'Mn', 'Me', 'Mc', 'Nd', 'Nl', 'No',
    'Zs', 'Zl', 'Zp', 'Cc', 'Cf', 'Co',
    'Cs', 'Pd', 'Ps', 'Pe', 'Pc', 'Po',
    'Sm', 'Sc', 'Sk', 'So', 'Pi', 'Pf'
];

echo $shortCats[IntlChar::charType('-')]; //Pd

注意：如果怕以后class中定义的数值发生变化，想更严谨一些，也可以这样写数组：

$shortCats = [
    IntlChar::CHAR_CATEGORY_UNASSIGNED => 'Cn',
    IntlChar::CHAR_CATEGORY_UPPERCASE_LETTER => 'Lu',
    IntlChar::CHAR_CATEGORY_LOWERCASE_LETTER => 'Ll',
    IntlChar::CHAR_CATEGORY_TITLECASE_LETTER => 'Lt',
    // etc.
];

Answer 3

我发布这篇文章是因为它可能会有用。以前大规模做过。

下面是使用 PHP.

的简明方法

备注：

启动时只生成一个正则表达式。
它包含一个 Lookahead Assertion，每个属性.
都有一个捕获组示例：(?=(\p{Property1}))?(?=(\p{Property2}))? ... (?=(\p{PropertyN}))?
检查目标中的每个字符是否具有数组中的所有属性。
每个捕获组代表字符数组的索引 $General_Cat_Props
这是分析比赛时的关联
用于打印。

解决了每个字符可以被多个属性匹配的问题。
基本上将感兴趣的属性添加到 $General_Cat_Props.
无需其他更改。

有2个函数：

Get_UniCategories_From_Char( $char ) 一次分析一个字符。
Get_UniCategories_From_String( $str ) 用于字符串（对每个字符调用 1）。

显然值得注意的是，下面的数组 $General_Cat_Props 可以根据需要添加或删除，用于自定义过滤器。
根据特殊检查的需要，可以有很多特定的常量属性数组。属性的数组顺序无关紧要。

Regex101 快速全局测试台

/(?=.)(?=(\p{Cn}))?(?=(\p{Cc}))?(?=(\p{Cf}))?(?=(\p{Co}))?(?=(\p{Cs}))?(?=(\p{Lu}))?(?=(\p{Ll}))?(?=(\p{Lt}))?(?=(\p{Lm}))?(?=(\p{Lo}))?(?=(\p{Mn}))?(?=(\p{Me}))?(?=(\p{Mc}))?(?=(\p{Pd}))?(?=(\p{Ps}))?(?=(\p{Pe}))?(?=(\p{Pc}))?(?=(\p{Po}))?(?=(\p{Pi}))?(?=(\p{Pf}))?(?=(\p{Sm}))?(?=(\p{Sc}))?(?=(\p{Sk}))?(?=(\p{So}))?(?=(\p{Zs}))?(?=(\p{Zl}))?(?=(\p{Zp}))?/su

https://regex101.com/r/fvVZX0/1

PHP
Mod：在意识到 php 仅填充 $match 数组直到最后一个匹配的可选组之后，在创建结果时添加了检查（参见 $last_grp_matched = sizeof($matches); ).

以前是通过在末尾添加捕获组 (.) 强制执行的。旧代码仍然有效，如果需要 use/see 以前的版本。

http://sandbox.onlinephpfunctions.com/code/f1aeca3d9a99d1b2d1bfc72c3dd004ad232bc29e

<?php

// The prop array
$General_Cat_Props = [
"",
"Cn", "Cc", "Cf", "Co", "Cs",
"Lu", "Ll", "Lt", "Lm", "Lo",
"Mn", "Me", "Mc", // "Nd", "Nl", "No",
"Pd", "Ps", "Pe", "Pc", "Po", "Pi", "Pf",
"Sm", "Sc", "Sk", "So",
"Zs", "Zl", "Zp"
];

// The Rx
$GCRx;

// One-time make function
function makeGCRx()
{
    global $General_Cat_Props, $GCRx ;
    $rxstr = "(?=.)";     // Start of regex, something must be ahead
    for ($i = 1; $i < sizeof( $General_Cat_Props ); $i++) {
        $rxstr .= "(?=(\p{" . $General_Cat_Props[ $i ] . "}))?";
    }
    $GCRx = "/$rxstr/su";
}

makeGCRx();
// print_r($GCRx . "\n");

function Get_UniCategories_From_Char( $char )
{
    global $General_Cat_Props, $GCRx;
    $ret = "";
    if ( preg_match( $GCRx, $char, $matches )) {
        $last_grp_matched = sizeof($matches);
        for ($i = 1; $i < sizeof( $General_Cat_Props ), $i < $last_grp_matched; $i++) {
            if ( $matches[ $i ] != null ) {
                $ret .= $General_Cat_Props[ $i ] . " ";
            }
        }
    }
    return $ret;
}

function Get_UniCategories_From_String( $str )
{
    $ret = "";
    for ($i = 0; $i < strlen( $str ); $i++) {
        $ret .= $str[ $i ] . "  " . Get_UniCategories_From_Char( $str[ $i ] ) . "\n";
    }
    return $ret;
}

print_r( "-  " . Get_UniCategories_From_Char( "-" ) . "\n--------\n" );
// or 
print_r( Get_UniCategories_From_String( "Hello 270 -,+?" ) . "\n" );

输出：

-  Pd 
--------
H  Lu 
e  Ll 
l  Ll 
l  Ll 
o  Ll 
   Zs 
2  
7  
0  
   Zs 
-  Pd 
,  Po 
+  Sm 
?  Po

如何在 PHP 中查找 unicode 字符 class

How to find unicode character class in PHP

php

regex

unicode