使用 PHP 解析 unicode 表情符号文本文件

Parsing unicode emoji text file using PHP

Unicode 组织提供了一个文本文件,其中包含表情符号的分类和名称详细信息。

最新版本可在此处获取: http://unicode.org/Public/emoji/5.0/emoji-test.txt

每个表情符号都属于 8 个广泛的 Groups 之一,然后每个 Group 被分成多个子组 - 例如 - Animals & Nature 组的子组已列出下面:

# group: Smileys & People
# group: Animals & Nature

    # subgroup: animal-mammal
    # subgroup: animal-bird
    # subgroup: animal-amphibian
    # subgroup: animal-reptile
    # subgroup: animal-marine
    # subgroup: animal-bug
    # subgroup: plant-flower
    # subgroup: plant-other

# group: Food & Drink
# group: Travel & Places
# group: Activities
# group: Objects
# group: Symbols
# group: Flags

然后针对每个子组列出每个子组中的表情符号 - 例如,对于 animal-bird 子组,列出这些表情符号:

1F983                                      ; fully-qualified     #  turkey
1F414                                      ; fully-qualified     #  chicken
1F413                                      ; fully-qualified     #  rooster
1F423                                      ; fully-qualified     #  hatching chick
1F424                                      ; fully-qualified     #  baby chick
1F425                                      ; fully-qualified     #  front-facing baby chick
1F426                                      ; fully-qualified     #  bird
1F427                                      ; fully-qualified     #  penguin
1F54A FE0F                                 ; fully-qualified     # ️ dove
1F54A                                      ; non-fully-qualified #  dove
1F985                                      ; fully-qualified     #  eagle
1F986                                      ; fully-qualified     #  duck
1F989                                      ; fully-qualified     #  owl

因此每个表情符号都具有以下属性-以turkey表情符号为例:

我有一个 MySQL table,我想将表情符号详细信息存储在:

CREATE TABLE `xx_emoji` (
  `fld_id` int(11) NOT NULL AUTO_INCREMENT,
  `fld_group` varchar(255) DEFAULT NULL,
  `fld_cat` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
  `fld_name` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
  `fld_status` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
  `fld_emoji` varbinary(255) DEFAULT NULL,
  `fld_description` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
  PRIMARY KEY (`fld_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4

我可以手动浏览文本文件并将详细信息一次一个地保存到 MySQL table 中,所以我会在 table 中获得这样的数据 -例如

但是,我想知道是否可以使用 PHP 来解析文本文件?

我想它需要有一系列嵌套循环:

foreach group {
    foreach subgroup {
        loop through emoji list and save into MySQL table...
            group
            subgroup
            name
            status
            emoji
            description
        end loop
    }
}

我知道这只是一个非常基本的大纲,很抱歉问了这么宽泛的问题。

我已经在 unicode 网站上查看表情符号数据是否有任何其他更有用的格式,例如 XML 或 JSON,但我找不到任何东西,并且可以只看到当前表情符号版本:

https://unicode.org/Public/emoji/5.0/

她不是很漂亮,如果他们更改格式可能会损坏,但是现在就开始吧,至少它为您指明了正确的方向 :p

<?php
if (!file_exists('emoji-test.txt')) {
    file_put_contents('emoji-test.txt', file_get_contents('http://unicode.org/Public/emoji/5.0/emoji-test.txt'));
}

// break into blocks
$blocks = explode(PHP_EOL.PHP_EOL, file_get_contents('emoji-test.txt'));

// unset header
unset($blocks[0]);

$emoji = [];

foreach ($blocks as $chunk) {
    $top = explode(PHP_EOL, $chunk)[0];

    if (substr($top, 0, strlen('# group:')) == '# group:') {
        $group = trim(str_replace('# group:', '', $top));
    } elseif (substr($top, 0, strlen('# subgroup:')) == '# subgroup:') {

        $lines = explode(PHP_EOL, $chunk);
        unset($lines[0]);

        foreach ($lines as $line) {

            $subgroup = trim(str_replace('# subgroup:', '', $top));

            $linegroup = explode(';', $line);

            $parts = explode('#', $linegroup[1]);

            $icon = explode(' ', trim($parts[1]), 2);

            $emoji[$group][$subgroup][] = [
                'group' => trim($group),
                'subgroup' => $subgroup,
                'name' => trim($linegroup[0]),
                'status' => trim($parts[0]),
                'emoji' => trim($icon[0]),
                'description' => trim($icon[1]),
            ];
        }
    }
}

print_r($emoji);

输出如下所示,嵌套了组和子组,然后您可以轻松循环并插入到您的数据库中。

Array
(
    [Smileys & People] => Array
        (
            [face-positive] => Array
                (
                    [0] => Array
                        (
                            [group] => Smileys & People
                            [subgroup] => face-positive
                            [name] => 1F600
                            [status] => fully-qualified
                            [emoji] => 
                            [description] => grinning face
                        )

                    [1] => Array
                        (
                            [group] => Smileys & People
                            [subgroup] => face-positive
                            [name] => 1F601
                            [status] => fully-qualified
                            [emoji] => 
                            [description] => beaming face with smiling eyes
                        )

                    [2] => Array
                        (
                            [group] => Smileys & People
                            [subgroup] => face-positive
                            [name] => 1F602
                            [status] => fully-qualified
                            [emoji] => 
                            [description] => face with tears of joy
                        )
 ...snip

希望对您有所帮助。

继 Lawrence 的非常有用的答案之后,我在这里添加一个答案,展示我如何使用代码访问每个表情符号的 6 个属性,这样我就可以将它们加载到 MySQL table:

...
foreach ($blocks as $chunk) {
    $top = explode(PHP_EOL, $chunk)[0];
    if (substr($top, 0, strlen('# group:')) == '# group:') {
        $group = trim(str_replace('# group:', '', $top));
    } elseif (substr($top, 0, strlen('# subgroup:')) == '# subgroup:') {
        $lines = explode(PHP_EOL, $chunk);
        unset($lines[0]);
        foreach ($lines as $line) {
            
            $subgroup = trim(str_replace('# subgroup:', '', $top));
            $linegroup = explode(';', $line);
            $parts = explode('#', $linegroup[1]);
            $icon = explode(' ', trim($parts[1]), 2);
            
            $var_group =        trim($group);
            $var_sub_group =    trim($subgroup);
            $var_name =         trim($linegroup[0]);
            $var_status =       trim($parts[0]);
            $var_emoji =        trim($icon[0]);
            $var_description =  trim($icon[1]);
            
    /*$emoji[$group][$subgroup][] = [
        'group' => trim($group),
        'subgroup' => $subgroup,
        'name' => trim($linegroup[0]),
        'status' => trim($parts[0]),
        'emoji' => trim($icon[0]),
        'description' => trim($icon[1]),
    ];*/

     $sql = "INSERT INTO xx_emoji (fld_group
                                 , fld_sub_group
                                 , fld_name
                                 , fld_status
                                 , fld_emoji
                                 , fld_description) 
                           VALUES (:var_group
                                 , :var_sub_group
                                 , :var_name
                                 , :var_status
                                 , :var_emoji
                                 , :var_description)";

    $stmt = $pdo->prepare($sql);

    $stmt->bindParam(':var_group', $var_group);
    $stmt->bindParam(':var_sub_group', $var_sub_group);
    $stmt->bindParam(':var_name', $var_name);
    $stmt->bindParam(':var_status', $var_status);
    $stmt->bindParam(':var_emoji', $var_emoji);
    $stmt->bindParam(':var_description', $var_description);

    $stmt->execute();
        }
    }
}

2020 年 12 月 26 日更新 - Lawrence 更新了他的 Github 代码以使用 Unicode 13,此处:https://github.com/lcherone/emoji-parse

此外,Lawrence 的 github 代码有将数据加载到 MySQL 的解决方案:https://github.com/lcherone/emoji-parse/blob/master/index.php, which contains an include pointing to 'parse.php'。