如何计算 perl 中的二进制统计信息?
How would one calculate binary statistics in perl?
我主要是想做 typical/good 十六进制编辑器可以做的事情:
https://www.hhdsoftware.com/doc/hex-editor/statistics-statistics-tool-window.html
我希望能够计算每个字节的出现次数并将其放入 table 中,这样我就可以确定与 'FF' 相比,说“00”的百分比。
我已经设法获得了熵,一旦完成上述操作,其他统计数据(例如均值、中位数和众数)就有点多余了。
还有一个问题就是我统计的二进制文件比较大,32mb+。[=11=]
有什么建议吗?
use List::Util qw( sum );
use constant BLOCK_SIZE => 4*1024*1024;
open(my $fh, '<:raw', $qfn)
or die("Can't open \"$qfn\": $!\n");
my @counts = (0) x 256;
while (1) {
my $rv = sysread($fh, my $buf, BLOCK_SIZE);
die($!) if !defined($rv);
last if !$rv;
++$counts[$_] for unpack 'C*', $buf;
}
my $N = sum @counts;
这是另一种方法:
use strict;
use warnings;
use Time::HiRes qw( time );
$/ = ;
open my $file, '<', shift;
binmode $file;
my %seen;
my $start = time();
my $n;
while (<$file>) {
$seen{$_} ++;
$n++;
}
my $end = time();
for ( sort keys %seen ) {
printf( "%s%s%.2f%s\n", uc( unpack( 'H*', $_ ) ), " seen $seen{$_} times - ", $seen{$_} / $n * 100, "%" );
}
printf( "took %.3f seconds!\n", $end - $start );
输出:
...
...
F8 seen 46475 times - 0.28%
F9 seen 46611 times - 0.28%
FA seen 46703 times - 0.28%
FB seen 48902 times - 0.29%
FC seen 46829 times - 0.28%
FD seen 47707 times - 0.28%
FE seen 47276 times - 0.28%
FF seen 1752333 times - 10.44%
took 2.374 seconds!
这是为 x86_64-linux-gnu-thread-multi 构建的(windows 中的 WSL)perl 5.22.1
(有 69 个注册补丁)
C 中的相同内容 - https://github.com/james28909/count/blob/master/count.c
编辑:
实际上这是 BrowserUK 在 perlmonks 给出的另一个更好的例子 - https://www.perlmonks.org/?node_id=1159266 - 它似乎 运行 比给出的两个 examples/answers 都快。
use strict;
use Time::HiRes qw[ time ];
my $start = time;
open I, '<:raw', $ARGV[ 0 ];
my @seen;
while( read( I, my $buf, 16384 ) ) {
++$seen[$_] for unpack 'C*', $buf;
}
printf "Took %f secs\n", time() - $start;
我主要是想做 typical/good 十六进制编辑器可以做的事情:
https://www.hhdsoftware.com/doc/hex-editor/statistics-statistics-tool-window.html
我希望能够计算每个字节的出现次数并将其放入 table 中,这样我就可以确定与 'FF' 相比,说“00”的百分比。
我已经设法获得了熵,一旦完成上述操作,其他统计数据(例如均值、中位数和众数)就有点多余了。
还有一个问题就是我统计的二进制文件比较大,32mb+。[=11=]
有什么建议吗?
use List::Util qw( sum );
use constant BLOCK_SIZE => 4*1024*1024;
open(my $fh, '<:raw', $qfn)
or die("Can't open \"$qfn\": $!\n");
my @counts = (0) x 256;
while (1) {
my $rv = sysread($fh, my $buf, BLOCK_SIZE);
die($!) if !defined($rv);
last if !$rv;
++$counts[$_] for unpack 'C*', $buf;
}
my $N = sum @counts;
这是另一种方法:
use strict;
use warnings;
use Time::HiRes qw( time );
$/ = ;
open my $file, '<', shift;
binmode $file;
my %seen;
my $start = time();
my $n;
while (<$file>) {
$seen{$_} ++;
$n++;
}
my $end = time();
for ( sort keys %seen ) {
printf( "%s%s%.2f%s\n", uc( unpack( 'H*', $_ ) ), " seen $seen{$_} times - ", $seen{$_} / $n * 100, "%" );
}
printf( "took %.3f seconds!\n", $end - $start );
输出:
...
...
F8 seen 46475 times - 0.28%
F9 seen 46611 times - 0.28%
FA seen 46703 times - 0.28%
FB seen 48902 times - 0.29%
FC seen 46829 times - 0.28%
FD seen 47707 times - 0.28%
FE seen 47276 times - 0.28%
FF seen 1752333 times - 10.44%
took 2.374 seconds!
这是为 x86_64-linux-gnu-thread-multi 构建的(windows 中的 WSL)perl 5.22.1 (有 69 个注册补丁)
C 中的相同内容 - https://github.com/james28909/count/blob/master/count.c
编辑:
实际上这是 BrowserUK 在 perlmonks 给出的另一个更好的例子 - https://www.perlmonks.org/?node_id=1159266 - 它似乎 运行 比给出的两个 examples/answers 都快。
use strict;
use Time::HiRes qw[ time ];
my $start = time;
open I, '<:raw', $ARGV[ 0 ];
my @seen;
while( read( I, my $buf, 16384 ) ) {
++$seen[$_] for unpack 'C*', $buf;
}
printf "Took %f secs\n", time() - $start;