在 mb_encode_numericentity() 中更好地解释 $convmap

Question

php manual 中方法 mb_encode_numericentity 的参数 convmap 的描述对我来说很模糊。有人可以帮助对此做出更好的解释，或者 "dumb it down" 是否对我来说足够了？这个参数中使用的数组元素是什么意思？联机帮助页中的示例 1

<?php
$convmap = array (
 int start_code1, int end_code1, int offset1, int mask1,
 int start_code2, int end_code2, int offset2, int mask2,
 ........
 int start_codeN, int end_codeN, int offsetN, int maskN );
// Specify Unicode value for start_codeN and end_codeN
// Add offsetN to value and take bit-wise 'AND' with maskN, then
// it converts value to numeric string reference.
?>

这很有帮助，但后来我看到了很多用法示例，例如 array(0x80, 0xffff, 0, 0xffff);，这让我很反感。这是否意味着偏移量将是 0 而掩码将是 0xffff，如果是这样，偏移量是否意味着字符串中要开始转换的字符数，以及 mask 在此意味着什么上下文？

Answer 1

往下看 rabbit hole, it appears that the comments in the documentation for mb_encode_numericentity 是准确的，虽然有些隐晦。

The four major parts to the convmap appear to be:

start_code: The map affects items starting from this character code.
end_code: The map affects items up to this character code.
offset: Add a specific offset amount (positive or negative) for this character code.
mask: Value to be used for mask operation (character code bitwise AND mask value).

字符代码可以通过字符 table 可视化，例如 this Codepage Layout example for ISO-8859-1 encoding. (ISO-8859-1 is the encoding used in the original PHP documentation Example #2。） 看一下这个编码 table，我们可以看到 convmap 仅用于影响从 0x80 （对于此特定编码似乎为空白） 到此编码中的最终字符 0xff 开始的字符代码项] （看起来是 ÿ）.

为了更好的理解convmap的offset和mask特性，这里举几个例子说明offset和掩码影响字符代码 （在下面的示例中，我们的 character code 的定义值为 162）:

简单示例：

<?php    
$original_str = "¢";
$convmap = array(0x00, 0xff, 0, 0xff);
$converted_str = mb_encode_numericentity($original_str, $convmap, "UTF-8");
echo "original:  $original_str\n";
echo "converted: $converted_str\n";
?>

Result:

original:  ¢
converted: &#162;

偏移示例：

<?php
$original_str = "¢";
$convmap = array(0x00, 0xff, 1, 0xff);
$converted_str = mb_encode_numericentity($original_str, $convmap, "UTF-8");
echo "original:  $original_str\n";
echo "converted: $converted_str\n";
?>

Result:

original:  ¢
converted: &#163;

备注：

offset 似乎允许对要转换的项目的当前 start_code 和 end_code 部分进行更精细的控制。例如，您可能出于某些特殊原因需要为 convmap 中的某行字符代码添加偏移量，但随后您可能需要忽略 convmap 中另一行的偏移量。

掩码示例：

<?php
// Mask Example 1
$original_str = "¢";
$convmap = array(0x00, 0xff, 0, 0xf0);
$converted_str = mb_encode_numericentity($original_str, $convmap, "UTF-8");
echo "original:  $original_str\n";
echo "converted: $converted_str\n\n";

// Mask Example 2
$convmap = array(0x00, 0xff, 0, 0x0f);
$converted_str = mb_encode_numericentity($original_str, $convmap, "UTF-8");
echo "original:  $original_str\n";
echo "converted: $converted_str\n\n";

// Mask Example 3
$convmap = array(0x00, 0xff, 0, 0x00);
$converted_str = mb_encode_numericentity($original_str, $convmap, "UTF-8");
echo "original:  $original_str\n";
echo "converted: $converted_str\n";
?>

Result:

original:  ¢
converted: &#160;

original:  ¢
converted: &#2;

original:  ¢
converted: &#0;

备注：

此答案不打算涵盖给定值的 masking in great detail, but masking can help keep or remove certain bits。

掩码示例 1

所以在第一个掩码示例0xf0中，f表示我们要保留二进制值左侧的值。此处，f 的二进制值为 1111，0 的二进制值为 0000——合起来的值为 11110000.

然后，当我们对 character code 进行按位与运算时（在本例中，162 的二进制值为 10100010） 按位运算如下所示：

  11110000
& 10100010
----------
  10100000

当转换回其十进制值时，10100000 是 160。

因此，我们有效地保留了 character code 位的 "left side" 位，并去掉了 "right side" 位。

蒙版示例 2

在第二个掩码示例中，按位与运算中的掩码 0x0f （其二进制值为 00001111） 将具有以下内容二进制结果：

  00001111
& 10100010
----------
  00000010

当转换回其十进制值时，它是 2。

因此，我们有效地保留了 character code 位的 "right side" 位，并去掉了 "left side" 位。

蒙版示例 3

最后，第三个掩码示例显示了在按位与运算中使用 0x00 （二进制为 00000000） 的掩码时会发生什么:

  00000000
& 10100010
----------
  00000000

这导致 0。

在 mb_encode_numericentity() 中更好地解释 $convmap

Better explanation of $convmap in mb_encode_numericentity()

php

html-encode

collation

html-entities

简单示例：

Result:

偏移示例：

Result:

备注：

掩码示例：

Result:

备注：

掩码示例 1

蒙版示例 2

蒙版示例 3