如何在 PHP 中解码 varints 的特定字节字符串

How to decode a specific byte string of varints in PHP

我正在尝试使用 PHP 解码特定格式的字符串(炉石牌组代码),如下所示:

AAEBAc2xAgjAAe0E7QX3DdYRh6wC8fsCoIADC8kDqwTLBPsMhRDH0wKW6AK0/ALNiQPXiQOfmwMA

AAEBAf0GBAD6DoyAA6CAAw37AZwCigbJB/gHlA+CEIUQrRDy0AL2/QKJgAPRgAMA

规格(original description)为:

The datastring is a base64-encoded byte string.

Unless specified otherwise, every value that follows is an integer, encoded as an unsigned varint.

  1. Header block

    • Reserved byte 0x00
    • Version (1)
    • Format
  2. Data block
    The data block is split in four pairs of length + array in the following order:

    • Heroes
    • Single-copy cards
    • 2-copy cards
    • n-copy cards

Each pair has a leading varint specifying the number of items in the array. For the first three blocks, those are arrays of varints. For the last block, it is an array of pairs of varints. The goal of this structure is to make the datastring as compact as possible.

我已经开始整理一些东西了,但是在处理原始字节方面我还是个新手。我的代码是:

    // I found this to decode Variable-length quantity (varint)
    function vlq_decode(array $bytes) {
        $result = [];
        $integer = 0;
        foreach ($bytes as $byte) {
            if ($integer > 0xfffffff - 0x7f) {
                throw new OverflowException('The value exceeds the maximum allowed.');
            }
            $integer <<= 7;
            $integer |= 0x7f & $byte;

            if (($byte & 0x80) === 0) {
                $result[] = $integer;
                $integer = 0;
            }
        }
        if (($byte & 0x80) !== 0) {
            throw new InvalidArgumentException('Incomplete byte sequence.');
        }
        return $result;
    }

    $datastring = 'AAEBAc2xAgjAAe0E7QX3DdYRh6wC8fsCoIADC8kDqwTLBPsMhRDH0wKW6AK0/ALNiQPXiQOfmwMA';

    $raw = base64_decode($datastring);

    $byte_array = unpack('C*', $raw);

    $result = vlq_decode($byte_array);

    print_r($result);

我唯一确定的是 base64_decode。我不知道 unpack 参数是否正确,或者 vlq_decode 函数是否按预期工作,因为我不是自己写的。

original site 上,Python 和 Javascript 中有参考实现,但它们超出了我的理解范围,我无法使用代码来实现我的工作。

更新:

该代码确实生成了一个 array,看起来与我的预期相似,但许多值似乎不正确。我认为 varint 的转换仍然有些偏差。

// this is the $result I get (wrong)
Array (
    [0] => 0 // this is always 0
    [1] => 1 // Version
    [2] => 1 // Format
    [3] => 1 // What follows is an array of length 1 (data block Heroes)
    [4] => 1267842
    [5] => 8 // What follows is an array of length 8 (data block single-copy cards)
    [6] => 8193
    [7] => 13956
    [8] => 13957
    [9] => 15245
    [10] => 11025
    [11] => 120322
    [12] => 1867138
    [13] => 524291
    [14] => 11 // What follows is an array of length 11 (data block 2-copy cards)
    [15] => 9347
    [16] => 5508
    [17] => 9604
    [18] => 15756
    [19] => 656
    [20] => 1173890
    [21] => 373762
    [22] => 867842
    [23] => 1262723
    [24] => 1426563
    [25] => 511363
    [26] => 0  // What follows is an array of length 0 (data block n-copy cards)
)

Python 实现(Gist) produces different numbers, in a slightly different format, which match nicely to the database 包含 ID 数据(在 dbfId 字段中)

// this is the expected (correct) $result
Array (
    [0] => 0
    [1] => 1
    [2] => 1
    [3] => 1
    [4] => 39117
    [5] => 8
    [6] => 192 
    [7] => 621 
    [8] => 749 
    [9] => 1783 
    [10] => 2262 
    [11] => 38407 
    [12] => 48625 
    [13] => 49184 
    [14] => 11
    [15] => 457 
    [16] => 555 
    [17] => 587 
    [18] => 1659 
    [19] => 2053 
    [20] => 43463 
    [21] => 46102 
    [22] => 48692 
    [23] => 50381 
    [24] => 50391 
    [25] => 52639
    [26] => 0
)

感谢任何帮助!

已经有一个,但是写得不好,没有代码示例,所以我再试一次。

这是一个 endian 问题,也就是您需要以相反的顺序从每个 varint 字节中推送位。线索是低于 128 的值可以顺利通过。

下面是一个说明性的 hack,不应在实际代码中使用:

str_split(decbin(1267842),7)

产量:

array(3) {
  [0]=>
  string(7) "1001101"
  [1]=>
  string(7) "0110001"
  [2]=>
  string(7) "0000010"
}

已经是 7 位的倍数了,超级方便,但也可能是端序问题的症状。

反转、内爆、转换回来:

bindec(implode('', array_reverse(str_split(decbin(1267842),7))))

产量:

int(39117)

我re-jiggered那个函数能够解决这个问题:

function vlq_decode(array $bytes, $swap_endian=false) {
    $result = [];
    $segments = [];
    foreach ($bytes as $byte) {
        if( $swap_endian ) {
            array_unshift($segments, 0x7f & $byte);
        } else {
            $segments[] = ( 0x7f & $byte );
        }

        if (($byte & 0x80) === 0) {
            $integer = 0;
            foreach($segments as $segment) {
                $integer <<= 7;
                $integer |= ( 0x7f & $segment );
            }
            $result[] = $integer;
            $segments = [];
        }
    }
    if (($byte & 0x80) !== 0) {
        throw new InvalidArgumentException('Incomplete byte sequence.');
    }
    return $result;
}

然后 vlq_decode($byte_array, true); 会给你想要的。

我删除了那个 bunk overflow 代码,因为它永远不会真正检测到一个实际的代码,并且还会阻碍您使用 32 位整数。如果你 do 想在解码过程中检测溢出,你需要计算你正在解包的位数,这真是让人头疼 :P