压缩 60 位字符串的最佳方法

Question

给定 15 个随机十六进制数（60 位），其中每 20 位总是至少有 1 个重复项运行（5 个十六进制数）。

压缩字节的最佳方式是什么？

这里有一些例子：

01230 45647 789AA
D8D9F 8AAAF 21052
20D22 8CC56 AA53A
AECAB 3BB95 E1E6D
9993F C9F29 B3130

最初我一直试图在 20 位上使用霍夫曼编码，因为霍夫曼编码可以从 20 位下降到 ~10 位，但存储 table 需要超过 9 位。

这是 01230

显示 20 位 -> 10 位的细分

Character   Frequency   Assignment  Space Savings
0           2           0           2×4 - 2×1 = 6 bits
2           1           10          1×4 - 1×2 = 2 bits
1           1           110         1×4 - 1×3 = 1 bits
3           1           111         1×4 - 1×3 = 1 bits

然后我尝试对所有 300 位（5 个 60 位运行s）进行霍夫曼编码，这是给定上述示例的映射：

Character   Frequency   Assignment  Space Savings
---------------------------------------------------------
a           10          101         10×4 - 10×3 = 10 bits
9           8           000         8×4 - 8×3 = 8 bits
2           7           1111        7×4 - 7×4 = 0 bits
3           6           1101        6×4 - 6×4 = 0 bits
0           5           1100        5×4 - 5×4 = 0 bits
5           5           1001        5×4 - 5×4 = 0 bits
1           4           0010        4×4 - 4×4 = 0 bits
8           4           0111        4×4 - 4×4 = 0 bits
d           4           0101        4×4 - 4×4 = 0 bits
f           4           0110        4×4 - 4×4 = 0 bits
c           4           1000        4×4 - 4×4 = 0 bits
b           4           0011        4×4 - 4×4 = 0 bits
6           3           11100       3×4 - 3×5 = -3 bits
e           3           11101       3×4 - 3×5 = -3 bits
4           2           01000       2×4 - 2×5 = -2 bits
7           2           01001       2×4 - 2×5 = -2 bits

这总共节省了 8 位，但 8 位不足以存储霍夫曼 table。似乎由于数据的随机性，您尝试使用霍夫曼编码的位数越多，它的效果就越差。霍夫曼编码似乎最适合 20 位（减少 50%），但据我所知，将 table 存储在 9 位或更少位是不可能的。

在 60 位字符串的最坏情况下，仍然至少有 3 个重复项，平均情况下有超过 3 个重复项（我的假设）。由于至少有 3 次重复，您可以在 60 位的运行中拥有的最多符号仅为 12。

因为重复加上不到16个符号，我不禁觉得有某种类型的压缩可以使用

Answer 1

如果我简单统计至少两个十六进制数相等的20位值的个数，有524,416个。比 2¹⁹ 多了一点点。因此，您最多可以节省不到 20 位中的一位。

似乎不值得。

Answer 2

如果您想先删除重复项，请执行此操作，然后查看页面底部的链接。如果你不想去掉重复的，那么还是看看页面底部的链接：

Array.prototype.contains = function(v) {
  for (var i = 0; i < this.length; i++) {
    if (this[i] === v) return true;
  }
  return false;
};

Array.prototype.unique = function() {
  var arr = [];
  for (var i = 0; i < this.length; i++) {
    if (!arr.contains(this[i])) {
      arr.push(this[i]);
    }
  }
  return arr;
}

var duplicates = [1, 3, 4, 2, 1, 2, 3, 8];
var uniques = duplicates.unique(); // result = [1,3,4,2,8]

console.log(uniques);

那么你会缩短你必须处理的代码。那么你可能想查看 Smaz

Smaz is a simple compression library suitable for compressing strings.

如果还是不行，你可以看看这个：

http://ed-von-schleck.github.io/shoco/

Shoco is a C library to compress and decompress short strings. It is very fast and easy to use. The default compression model is optimized for english words, but you can generate your own compression model based on your specific input data.

如果有效请告诉我！

Answer 3

如果我把你的问题分成两部分：

我如何压缩（完美）随机数据：你不能。每一位都是一些新的熵，压缩算法无法“猜到”。
如何压缩“五个字符中的一个重复项”：正好有 10 个选项可以重复（请参阅下面的 table）。这基本上就是熵。只需存储它是哪个选项（可能为整行分组）。

这些是选项：

AAbcd = 1    AbAcd = 2    AbcAd = 3    AbcdA = 4    (<-- cases where first character is duplicated somewhere)
             aBBcd = 5    aBcBd = 6    aBcdB = 7    (<-- cases where second character is duplicated somewhere)
                          abCCd = 8    abCdC = 9    (<-- cases where third character is duplicated somewhere)
                                       abcDD = 0    (<-- cases where last characters are duplicated)

所以对于你的第一个例子：

01230 45647 789AA

第一个（01230）是选项4，第二个3和第三个选项0。

您可以通过将每个连续乘以 10 来压缩它：(4*10 + 3)*10 + 0 = 430 并使用除法和模数解压缩：430%10=0、(430/10)%10=3、(430/10/10)%10=4。所以你可以这样存储你的号码：

1AE 0123 4567 789A
^^^ this is 430 in hex and requires only 10 bit

三个选项加起来的最大数量是1000，所以10位就够了。

与正常存储这3个字符相比你节省了2位。正如其他人已经评论过的那样——这可能不值得。对于整条线，它甚至更少：2 位/60 位 = 3.3% 保存。

压缩 60 位字符串的最佳方法

Optimal way to compress 60 bit string

compression

entropy

huffman-code

information-theory