如何快速统计字符串中连续单个字符的最大个数?
How can I quickly count the maximum number of consecutive single characters in a string?
我的字符串类似于:但更长
my $a = "000000001111111111000000011111111111111111111111111111111";
我正在计算“1”的数量:
my $total_1_available = $a =~ tr/1//;
而且效果非常好,而且速度非常快。
但是,我还想(快速地)计算连续 1 的总数。连续“1”的最大计数。
在上面的示例中,它将 return 计数:
11111111111111111111111111111111
因为这是连续的最大值。
所以,我得到 TOTAL_COUNT 和 TOTAL_CONSECUTIVE_COUNT。
我让它与 REGEXP 一起工作,它基本上替换了 1,然后计算被替换的内容并循环……这实际上完全没问题并且可以工作……但它“感觉”不对。
理想情况下,我根本不想替换字符串,因为我正在寻找最大连续计数。
但是,我知道在 Perl 中这可能不是最快或最干净的方法。
能不能教教我更好的方法,增加我的学习?
根据要求,这是我当前的代码:
my $a= "0110011001101111";
my $total_1_available = $a =~ tr/1//;
print "Total number of 1's = $total_1_available\n";
my $max_c = 0;
while ( $a=~s/(1+)/ / ) {
$max_c = length() if length() > $max_c;
}
print "Consecutive count = $max_c\n";
最终代码:
use strict;
use warnings;
use Benchmark ':all';
use String::Random;
## We test 525,600 as this is the length of the string.
## Actually each 0 or 1 represents a minute of the year.
## And these represent engineer minues available in a 24 hr / 365 day year.
## And there are lots and lots of engineers.
## Hence my wish to improve the performance and I wish to thank everyone whom responded.
## there are a lot more 0's than 1's so hack to sort of simulate
my $test_regex = '[0][0][0][0][0][0-1][0-1][0-1][0-1][0-1]' x 52560;
my $pass = String::Random->new;
my $string = $pass->randregex($test_regex);
cmpthese(-1, {
org => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length(); if ($max < $len) { $max = $len } } },
hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = ; } length $match }
});
# BLOWN AWAY !!!!!!
# BLOWN AWAY !!!!!!
# BLOWN AWAY !!!!!!
# BLOWN AWAY !!!!!!
Perl 允许您动态创建哈希值,您可以用它来进行计数。
循环遍历 $a 的每个字符,使用该字母递增散列的内容。在循环结束时,您将得到一个散列,其中包含每个字母的键和包含每个字母的计数的值。
foreach $letter (split //, $a) {
if $letter eq $last {
$consecutive_count{$letter}++
} else {
if ($consecutive_count{$letter} > $consecutive_runs{$letter})
$consecutive_runs{$letter} = $consecutive_count{$letter};
$consecutive_count{$letter} = 0;
}
}
$counts{$letter}++;
$last = $letter;
}
foreach my $key (keys %counts) {
print "$key occured $counts{$letter} times";
print "longest consecutive run for $key was $consecutive_runs{$key}";
}
我可能会这样做:
use List::Util 'max';
my $string = '01011101100000111111001';
my $longest_run = max( 0, map { length } $string =~ /(1+)/g );
获取每个匹配的 1 组的长度并选择最大的一个。插入了一个 0,这样如果没有 undef
,您就不会得到
$ perl -MList::Util=max \
-E 'say $_, " ", max(0, map { length } /(1+)/g) for @ARGV' \
0 1 00010110 011101111110100110
0 0
1 1
00010110 2
011101111110100110 6
编辑:@TLP 的评论让我很好奇,因为我喜欢 sort
解决方案。
#!/usr/bin/env perl
use v5.16;
use warnings;
use Benchmark ':all';
use List::Util 'max';
my $string = '0100100101111011010010101101101110101011111111101010100100100001011101010100';
cmpthese(1_000_000, {
sort => sub { my $x = ( sort { $b <=> $a } $string =~ /(1+)/g)[0] },
max => sub { my $x = max(0, map { length } $string =~ /(1+)/g) },
});
结果:
Rate sort max
sort 84890/s -- -9%
max 93023/s 10% --
也许 longer/shorter 测试字符串会产生不同的结果?
使用动态正则表达式可以显着提高速度。我们可以使用一个变量来存储最大长度的字符串,然后搜索那个长度加上一个或多个的字符串。理论是我们只需要寻找比我们已有的字符串更长的字符串。
我使用了如下所示的解决方案
sub hack {
my $match = ""; # original search string
while ($string =~ /(${match}1+)/g) { # search for $match plus 1 or more 1s
$match = ; # when found, change to new match
}
length $match; # return max length
}
并将其与 OP 描述的原始方法进行比较,结果如下
use strict;
use warnings;
use Benchmark ':all';
my $string = '0100100101111011010010101101101110101011111111101010100100100001011101010100' x 10_000;
cmpthese(-1, {
org => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length(); if ($max < $len) { $max = $len } } },
hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = ; } length $match }
});
输出:
Rate org hack
org 7.31/s -- -99%
hack 1372/s 18669% --
这似乎高得惊人,快了 19000%。这让我觉得我犯了一个错误,但我想不出那会是什么。也许我在正则表达式机器内部遗漏了一些东西,但这将是对原始解决方案的改进。
对于短字符串,以下解决方案比之前提出的所有解决方案都快:
use List::Util qw( max );
max 0, map length, split /[^1]+/, $s
Rate hack sort org max mxsp xs
hack 76879/s -- -12% -34% -37% -48% -98% <-- TLP
sort 87664/s 14% -- -24% -28% -41% -98% <-- Jim Davis
org 115660/s 50% 32% -- -6% -22% -98% <-- OP
max 122504/s 59% 40% 6% -- -17% -98% <-- Jim Davis
mxsp 147867/s 92% 69% 28% 21% -- -97% <-- ikegami (above)
xs 4950278/s 6339% 5547% 4180% 3941% 3248% -- <-- ikegami (below)
基准代码:
use Benchmark qw( cmpthese );
my $string = ( '01001001011110110100101011011011101010'
. '11111111101010100100100001011101010100' );
cmpthese(-3, {
org => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length(); if ($max < $len) { $max = $len; } } },
hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = ; } my $max = length($match); },
sort => sub { my $max = ( sort { $b <=> $a } $string =~ /(1+)/g )[0]; },
max => sub { my $max = max 0, map length, $string =~ /(1+)/g; },
mxsp => sub { my $max = max 0, map length, split /[^1]+/, $string; },
xs => sub { my $max = longuest_ones_count($string); },
});
也就是说,最快的解决方案将涉及 XS。以下是我的尝试:
IV longuest_ones_count(SV* sv) {
IV max = 0;
IV count = 0;
// This code works whether the string is upgraded or downgraded.
STRLEN len;
char *s = SvPV(sv, len);
while (len--) {
if (*(s++) == '1') {
++count;
}
else if (count) {
if (max < count)
max = count;
count = 0;
}
}
if (max < count)
max = count;
return max;
}
一种使用方式:
use 5.014;
use warnings;
use Inline C => <<'__';
...above code here...
__
say "$_: ", longuest_ones_count($_)
for qw(
0
11111
011111
111110
01110111110
01111101110
);
您看到了它在短字符串方面击败了其他解决方案。但是你没有短弦。对于长字符串,这不如 TLP 的版本快!!!
与上面相同的基准,但使用
my $string = ( '01001001011110110100101011011011101010'
. '11111111101010100100100001011101010100' ) x 10_000;
Rate sort org max mxsp xs hack
sort 8.61/s -- -25% -31% -44% -99% -99%
org 11.6/s 34% -- -8% -24% -99% -99%
max 12.5/s 46% 9% -- -18% -99% -99%
mxsp 15.3/s 77% 32% 22% -- -99% -99% <-- ikegami (Perl)
xs 1031/s 11870% 8822% 8118% 6653% -- -25% <-- ikegami (XS)
hack 1366/s 15772% 11731% 10797% 8855% 33% -- <-- TLP
哇,这个正则表达式引擎真棒!使用 XS 显然可以击败它(通过消除编译模式所需的时间),但有什么意义呢?
我的字符串类似于:但更长
my $a = "000000001111111111000000011111111111111111111111111111111";
我正在计算“1”的数量:
my $total_1_available = $a =~ tr/1//;
而且效果非常好,而且速度非常快。
但是,我还想(快速地)计算连续 1 的总数。连续“1”的最大计数。
在上面的示例中,它将 return 计数:
11111111111111111111111111111111
因为这是连续的最大值。
所以,我得到 TOTAL_COUNT 和 TOTAL_CONSECUTIVE_COUNT。
我让它与 REGEXP 一起工作,它基本上替换了 1,然后计算被替换的内容并循环……这实际上完全没问题并且可以工作……但它“感觉”不对。
理想情况下,我根本不想替换字符串,因为我正在寻找最大连续计数。
但是,我知道在 Perl 中这可能不是最快或最干净的方法。
能不能教教我更好的方法,增加我的学习?
根据要求,这是我当前的代码:
my $a= "0110011001101111";
my $total_1_available = $a =~ tr/1//;
print "Total number of 1's = $total_1_available\n";
my $max_c = 0;
while ( $a=~s/(1+)/ / ) {
$max_c = length() if length() > $max_c;
}
print "Consecutive count = $max_c\n";
最终代码:
use strict;
use warnings;
use Benchmark ':all';
use String::Random;
## We test 525,600 as this is the length of the string.
## Actually each 0 or 1 represents a minute of the year.
## And these represent engineer minues available in a 24 hr / 365 day year.
## And there are lots and lots of engineers.
## Hence my wish to improve the performance and I wish to thank everyone whom responded.
## there are a lot more 0's than 1's so hack to sort of simulate
my $test_regex = '[0][0][0][0][0][0-1][0-1][0-1][0-1][0-1]' x 52560;
my $pass = String::Random->new;
my $string = $pass->randregex($test_regex);
cmpthese(-1, {
org => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length(); if ($max < $len) { $max = $len } } },
hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = ; } length $match }
});
# BLOWN AWAY !!!!!!
# BLOWN AWAY !!!!!!
# BLOWN AWAY !!!!!!
# BLOWN AWAY !!!!!!
Perl 允许您动态创建哈希值,您可以用它来进行计数。
循环遍历 $a 的每个字符,使用该字母递增散列的内容。在循环结束时,您将得到一个散列,其中包含每个字母的键和包含每个字母的计数的值。
foreach $letter (split //, $a) {
if $letter eq $last {
$consecutive_count{$letter}++
} else {
if ($consecutive_count{$letter} > $consecutive_runs{$letter})
$consecutive_runs{$letter} = $consecutive_count{$letter};
$consecutive_count{$letter} = 0;
}
}
$counts{$letter}++;
$last = $letter;
}
foreach my $key (keys %counts) {
print "$key occured $counts{$letter} times";
print "longest consecutive run for $key was $consecutive_runs{$key}";
}
我可能会这样做:
use List::Util 'max';
my $string = '01011101100000111111001';
my $longest_run = max( 0, map { length } $string =~ /(1+)/g );
获取每个匹配的 1 组的长度并选择最大的一个。插入了一个 0,这样如果没有 undef
,您就不会得到
$ perl -MList::Util=max \
-E 'say $_, " ", max(0, map { length } /(1+)/g) for @ARGV' \
0 1 00010110 011101111110100110
0 0
1 1
00010110 2
011101111110100110 6
编辑:@TLP 的评论让我很好奇,因为我喜欢 sort
解决方案。
#!/usr/bin/env perl
use v5.16;
use warnings;
use Benchmark ':all';
use List::Util 'max';
my $string = '0100100101111011010010101101101110101011111111101010100100100001011101010100';
cmpthese(1_000_000, {
sort => sub { my $x = ( sort { $b <=> $a } $string =~ /(1+)/g)[0] },
max => sub { my $x = max(0, map { length } $string =~ /(1+)/g) },
});
结果:
Rate sort max
sort 84890/s -- -9%
max 93023/s 10% --
也许 longer/shorter 测试字符串会产生不同的结果?
使用动态正则表达式可以显着提高速度。我们可以使用一个变量来存储最大长度的字符串,然后搜索那个长度加上一个或多个的字符串。理论是我们只需要寻找比我们已有的字符串更长的字符串。
我使用了如下所示的解决方案
sub hack {
my $match = ""; # original search string
while ($string =~ /(${match}1+)/g) { # search for $match plus 1 or more 1s
$match = ; # when found, change to new match
}
length $match; # return max length
}
并将其与 OP 描述的原始方法进行比较,结果如下
use strict;
use warnings;
use Benchmark ':all';
my $string = '0100100101111011010010101101101110101011111111101010100100100001011101010100' x 10_000;
cmpthese(-1, {
org => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length(); if ($max < $len) { $max = $len } } },
hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = ; } length $match }
});
输出:
Rate org hack
org 7.31/s -- -99%
hack 1372/s 18669% --
这似乎高得惊人,快了 19000%。这让我觉得我犯了一个错误,但我想不出那会是什么。也许我在正则表达式机器内部遗漏了一些东西,但这将是对原始解决方案的改进。
对于短字符串,以下解决方案比之前提出的所有解决方案都快:
use List::Util qw( max );
max 0, map length, split /[^1]+/, $s
Rate hack sort org max mxsp xs
hack 76879/s -- -12% -34% -37% -48% -98% <-- TLP
sort 87664/s 14% -- -24% -28% -41% -98% <-- Jim Davis
org 115660/s 50% 32% -- -6% -22% -98% <-- OP
max 122504/s 59% 40% 6% -- -17% -98% <-- Jim Davis
mxsp 147867/s 92% 69% 28% 21% -- -97% <-- ikegami (above)
xs 4950278/s 6339% 5547% 4180% 3941% 3248% -- <-- ikegami (below)
基准代码:
use Benchmark qw( cmpthese );
my $string = ( '01001001011110110100101011011011101010'
. '11111111101010100100100001011101010100' );
cmpthese(-3, {
org => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length(); if ($max < $len) { $max = $len; } } },
hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = ; } my $max = length($match); },
sort => sub { my $max = ( sort { $b <=> $a } $string =~ /(1+)/g )[0]; },
max => sub { my $max = max 0, map length, $string =~ /(1+)/g; },
mxsp => sub { my $max = max 0, map length, split /[^1]+/, $string; },
xs => sub { my $max = longuest_ones_count($string); },
});
也就是说,最快的解决方案将涉及 XS。以下是我的尝试:
IV longuest_ones_count(SV* sv) {
IV max = 0;
IV count = 0;
// This code works whether the string is upgraded or downgraded.
STRLEN len;
char *s = SvPV(sv, len);
while (len--) {
if (*(s++) == '1') {
++count;
}
else if (count) {
if (max < count)
max = count;
count = 0;
}
}
if (max < count)
max = count;
return max;
}
一种使用方式:
use 5.014;
use warnings;
use Inline C => <<'__';
...above code here...
__
say "$_: ", longuest_ones_count($_)
for qw(
0
11111
011111
111110
01110111110
01111101110
);
您看到了它在短字符串方面击败了其他解决方案。但是你没有短弦。对于长字符串,这不如 TLP 的版本快!!!
与上面相同的基准,但使用
my $string = ( '01001001011110110100101011011011101010'
. '11111111101010100100100001011101010100' ) x 10_000;
Rate sort org max mxsp xs hack
sort 8.61/s -- -25% -31% -44% -99% -99%
org 11.6/s 34% -- -8% -24% -99% -99%
max 12.5/s 46% 9% -- -18% -99% -99%
mxsp 15.3/s 77% 32% 22% -- -99% -99% <-- ikegami (Perl)
xs 1031/s 11870% 8822% 8118% 6653% -- -25% <-- ikegami (XS)
hack 1366/s 15772% 11731% 10797% 8855% 33% -- <-- TLP
哇,这个正则表达式引擎真棒!使用 XS 显然可以击败它(通过消除编译模式所需的时间),但有什么意义呢?