如何在 PHP 中解码 varints 的特定字节字符串
How to decode a specific byte string of varints in PHP
我正在尝试使用 PHP 解码特定格式的字符串(炉石牌组代码),如下所示:
AAEBAc2xAgjAAe0E7QX3DdYRh6wC8fsCoIADC8kDqwTLBPsMhRDH0wKW6AK0/ALNiQPXiQOfmwMA
或
AAEBAf0GBAD6DoyAA6CAAw37AZwCigbJB/gHlA+CEIUQrRDy0AL2/QKJgAPRgAMA
规格(original description)为:
The datastring is a base64
-encoded byte string.
Unless specified otherwise, every value that follows is an integer, encoded as an unsigned varint
.
Header block
- Reserved byte 0x00
- Version (1)
- Format
Data block
The data block is split in four pairs of length + array in the following order:
- Heroes
- Single-copy cards
- 2-copy cards
- n-copy cards
Each pair has a leading varint
specifying the number of items in the array. For the first three blocks, those are arrays of varints
. For the last block, it is an array of pairs of varints
. The goal of this structure is to make the datastring as compact as possible.
我已经开始整理一些东西了,但是在处理原始字节方面我还是个新手。我的代码是:
// I found this to decode Variable-length quantity (varint)
function vlq_decode(array $bytes) {
$result = [];
$integer = 0;
foreach ($bytes as $byte) {
if ($integer > 0xfffffff - 0x7f) {
throw new OverflowException('The value exceeds the maximum allowed.');
}
$integer <<= 7;
$integer |= 0x7f & $byte;
if (($byte & 0x80) === 0) {
$result[] = $integer;
$integer = 0;
}
}
if (($byte & 0x80) !== 0) {
throw new InvalidArgumentException('Incomplete byte sequence.');
}
return $result;
}
$datastring = 'AAEBAc2xAgjAAe0E7QX3DdYRh6wC8fsCoIADC8kDqwTLBPsMhRDH0wKW6AK0/ALNiQPXiQOfmwMA';
$raw = base64_decode($datastring);
$byte_array = unpack('C*', $raw);
$result = vlq_decode($byte_array);
print_r($result);
我唯一确定的是 base64_decode
。我不知道 unpack
参数是否正确,或者 vlq_decode
函数是否按预期工作,因为我不是自己写的。
在 original site 上,Python 和 Javascript 中有参考实现,但它们超出了我的理解范围,我无法使用代码来实现我的工作。
更新:
该代码确实生成了一个 array
,看起来与我的预期相似,但许多值似乎不正确。我认为 varint
的转换仍然有些偏差。
// this is the $result I get (wrong)
Array (
[0] => 0 // this is always 0
[1] => 1 // Version
[2] => 1 // Format
[3] => 1 // What follows is an array of length 1 (data block Heroes)
[4] => 1267842
[5] => 8 // What follows is an array of length 8 (data block single-copy cards)
[6] => 8193
[7] => 13956
[8] => 13957
[9] => 15245
[10] => 11025
[11] => 120322
[12] => 1867138
[13] => 524291
[14] => 11 // What follows is an array of length 11 (data block 2-copy cards)
[15] => 9347
[16] => 5508
[17] => 9604
[18] => 15756
[19] => 656
[20] => 1173890
[21] => 373762
[22] => 867842
[23] => 1262723
[24] => 1426563
[25] => 511363
[26] => 0 // What follows is an array of length 0 (data block n-copy cards)
)
Python 实现(Gist) produces different numbers, in a slightly different format, which match nicely to the database 包含 ID 数据(在 dbfId
字段中)
// this is the expected (correct) $result
Array (
[0] => 0
[1] => 1
[2] => 1
[3] => 1
[4] => 39117
[5] => 8
[6] => 192
[7] => 621
[8] => 749
[9] => 1783
[10] => 2262
[11] => 38407
[12] => 48625
[13] => 49184
[14] => 11
[15] => 457
[16] => 555
[17] => 587
[18] => 1659
[19] => 2053
[20] => 43463
[21] => 46102
[22] => 48692
[23] => 50381
[24] => 50391
[25] => 52639
[26] => 0
)
感谢任何帮助!
已经有一个,但是写得不好,没有代码示例,所以我再试一次。
这是一个 endian 问题,也就是您需要以相反的顺序从每个 varint 字节中推送位。线索是低于 128 的值可以顺利通过。
下面是一个说明性的 hack,不应在实际代码中使用:
str_split(decbin(1267842),7)
产量:
array(3) {
[0]=>
string(7) "1001101"
[1]=>
string(7) "0110001"
[2]=>
string(7) "0000010"
}
已经是 7 位的倍数了,超级方便,但也可能是端序问题的症状。
反转、内爆、转换回来:
bindec(implode('', array_reverse(str_split(decbin(1267842),7))))
产量:
int(39117)
我re-jiggered那个函数能够解决这个问题:
function vlq_decode(array $bytes, $swap_endian=false) {
$result = [];
$segments = [];
foreach ($bytes as $byte) {
if( $swap_endian ) {
array_unshift($segments, 0x7f & $byte);
} else {
$segments[] = ( 0x7f & $byte );
}
if (($byte & 0x80) === 0) {
$integer = 0;
foreach($segments as $segment) {
$integer <<= 7;
$integer |= ( 0x7f & $segment );
}
$result[] = $integer;
$segments = [];
}
}
if (($byte & 0x80) !== 0) {
throw new InvalidArgumentException('Incomplete byte sequence.');
}
return $result;
}
然后 vlq_decode($byte_array, true);
会给你想要的。
我删除了那个 bunk overflow 代码,因为它永远不会真正检测到一个实际的代码,并且还会阻碍您使用 32 位整数。如果你 do 想在解码过程中检测溢出,你需要计算你正在解包的位数,这真是让人头疼 :P
我正在尝试使用 PHP 解码特定格式的字符串(炉石牌组代码),如下所示:
AAEBAc2xAgjAAe0E7QX3DdYRh6wC8fsCoIADC8kDqwTLBPsMhRDH0wKW6AK0/ALNiQPXiQOfmwMA
或
AAEBAf0GBAD6DoyAA6CAAw37AZwCigbJB/gHlA+CEIUQrRDy0AL2/QKJgAPRgAMA
规格(original description)为:
The datastring is a
base64
-encoded byte string.Unless specified otherwise, every value that follows is an integer, encoded as an
unsigned varint
.
Header block
- Reserved byte 0x00
- Version (1)
- Format
Data block
The data block is split in four pairs of length + array in the following order:
- Heroes
- Single-copy cards
- 2-copy cards
- n-copy cards
Each pair has a leading
varint
specifying the number of items in the array. For the first three blocks, those are arrays ofvarints
. For the last block, it is an array of pairs ofvarints
. The goal of this structure is to make the datastring as compact as possible.
我已经开始整理一些东西了,但是在处理原始字节方面我还是个新手。我的代码是:
// I found this to decode Variable-length quantity (varint)
function vlq_decode(array $bytes) {
$result = [];
$integer = 0;
foreach ($bytes as $byte) {
if ($integer > 0xfffffff - 0x7f) {
throw new OverflowException('The value exceeds the maximum allowed.');
}
$integer <<= 7;
$integer |= 0x7f & $byte;
if (($byte & 0x80) === 0) {
$result[] = $integer;
$integer = 0;
}
}
if (($byte & 0x80) !== 0) {
throw new InvalidArgumentException('Incomplete byte sequence.');
}
return $result;
}
$datastring = 'AAEBAc2xAgjAAe0E7QX3DdYRh6wC8fsCoIADC8kDqwTLBPsMhRDH0wKW6AK0/ALNiQPXiQOfmwMA';
$raw = base64_decode($datastring);
$byte_array = unpack('C*', $raw);
$result = vlq_decode($byte_array);
print_r($result);
我唯一确定的是 base64_decode
。我不知道 unpack
参数是否正确,或者 vlq_decode
函数是否按预期工作,因为我不是自己写的。
在 original site 上,Python 和 Javascript 中有参考实现,但它们超出了我的理解范围,我无法使用代码来实现我的工作。
更新:
该代码确实生成了一个 array
,看起来与我的预期相似,但许多值似乎不正确。我认为 varint
的转换仍然有些偏差。
// this is the $result I get (wrong)
Array (
[0] => 0 // this is always 0
[1] => 1 // Version
[2] => 1 // Format
[3] => 1 // What follows is an array of length 1 (data block Heroes)
[4] => 1267842
[5] => 8 // What follows is an array of length 8 (data block single-copy cards)
[6] => 8193
[7] => 13956
[8] => 13957
[9] => 15245
[10] => 11025
[11] => 120322
[12] => 1867138
[13] => 524291
[14] => 11 // What follows is an array of length 11 (data block 2-copy cards)
[15] => 9347
[16] => 5508
[17] => 9604
[18] => 15756
[19] => 656
[20] => 1173890
[21] => 373762
[22] => 867842
[23] => 1262723
[24] => 1426563
[25] => 511363
[26] => 0 // What follows is an array of length 0 (data block n-copy cards)
)
Python 实现(Gist) produces different numbers, in a slightly different format, which match nicely to the database 包含 ID 数据(在 dbfId
字段中)
// this is the expected (correct) $result
Array (
[0] => 0
[1] => 1
[2] => 1
[3] => 1
[4] => 39117
[5] => 8
[6] => 192
[7] => 621
[8] => 749
[9] => 1783
[10] => 2262
[11] => 38407
[12] => 48625
[13] => 49184
[14] => 11
[15] => 457
[16] => 555
[17] => 587
[18] => 1659
[19] => 2053
[20] => 43463
[21] => 46102
[22] => 48692
[23] => 50381
[24] => 50391
[25] => 52639
[26] => 0
)
感谢任何帮助!
已经有一个
这是一个 endian 问题,也就是您需要以相反的顺序从每个 varint 字节中推送位。线索是低于 128 的值可以顺利通过。
下面是一个说明性的 hack,不应在实际代码中使用:
str_split(decbin(1267842),7)
产量:
array(3) {
[0]=>
string(7) "1001101"
[1]=>
string(7) "0110001"
[2]=>
string(7) "0000010"
}
已经是 7 位的倍数了,超级方便,但也可能是端序问题的症状。
反转、内爆、转换回来:
bindec(implode('', array_reverse(str_split(decbin(1267842),7))))
产量:
int(39117)
我re-jiggered那个函数能够解决这个问题:
function vlq_decode(array $bytes, $swap_endian=false) {
$result = [];
$segments = [];
foreach ($bytes as $byte) {
if( $swap_endian ) {
array_unshift($segments, 0x7f & $byte);
} else {
$segments[] = ( 0x7f & $byte );
}
if (($byte & 0x80) === 0) {
$integer = 0;
foreach($segments as $segment) {
$integer <<= 7;
$integer |= ( 0x7f & $segment );
}
$result[] = $integer;
$segments = [];
}
}
if (($byte & 0x80) !== 0) {
throw new InvalidArgumentException('Incomplete byte sequence.');
}
return $result;
}
然后 vlq_decode($byte_array, true);
会给你想要的。
我删除了那个 bunk overflow 代码,因为它永远不会真正检测到一个实际的代码,并且还会阻碍您使用 32 位整数。如果你 do 想在解码过程中检测溢出,你需要计算你正在解包的位数,这真是让人头疼 :P