如何快速统计字符串中连续单个字符的最大个数?

How can I quickly count the maximum number of consecutive single characters in a string?

我的字符串类似于:但更长

my $a = "000000001111111111000000011111111111111111111111111111111";

我正在计算“1”的数量:

my $total_1_available = $a =~ tr/1//;

而且效果非常好,而且速度非常快。

但是,我还想(快速地)计算连续 1 的总数。连续“1”的最大计数。

在上面的示例中,它将 return 计数:

11111111111111111111111111111111

因为这是连续的最大值。

所以,我得到 TOTAL_COUNT 和 TOTAL_CONSECUTIVE_COUNT。

我让它与 REGEXP 一起工作,它基本上替换了 1,然后计算被替换的内容并循环……这实际上完全没问题并且可以工作……但它“感觉”不对。

理想情况下,我根本不想替换字符串,因为我正在寻找最大连续计数。

但是,我知道在 Perl 中这可能不是最快或最干净的方法。

能不能教教我更好的方法,增加我的学习?

根据要求,这是我当前的代码:

 my $a= "0110011001101111";
 my $total_1_available = $a =~ tr/1//;
 print "Total number of 1's = $total_1_available\n";

 my $max_c = 0;
 while ( $a=~s/(1+)/ / ) {
   $max_c = length() if length() > $max_c;
 }
 print "Consecutive count   = $max_c\n";

最终代码:

use strict;
use warnings;
use Benchmark ':all';
use String::Random;

## We test 525,600 as this is the length of the string.
## Actually each 0 or 1 represents a minute of the year.
## And these represent engineer minues available in a 24 hr / 365 day year.
## And there are lots and lots of engineers.
## Hence my wish to improve the performance and I wish to thank everyone whom responded.

## there are a lot more 0's than 1's so hack to sort of simulate
my $test_regex = '[0][0][0][0][0][0-1][0-1][0-1][0-1][0-1]' x 52560;
my $pass       = String::Random->new;
my $string     = $pass->randregex($test_regex);

cmpthese(-1, {
    org  => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length(); if ($max < $len) { $max = $len } } },
    hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = ; } length $match }
});

#                BLOWN AWAY !!!!!!
#                BLOWN AWAY !!!!!!
#                BLOWN AWAY !!!!!!
#                BLOWN AWAY !!!!!!

Perl 允许您动态创建哈希值,您可以用它来进行计数。

循环遍历 $a 的每个字符,使用该字母递增散列的内容。在循环结束时,您将得到一个散列,其中包含每个字母的键和包含每个字母的计数的值。

foreach $letter (split //, $a) {
    if $letter eq $last { 
        $consecutive_count{$letter}++
    } else {
        if ($consecutive_count{$letter} > $consecutive_runs{$letter})
             $consecutive_runs{$letter} = $consecutive_count{$letter};
             $consecutive_count{$letter} = 0;
        }
    }   
    $counts{$letter}++;
    $last = $letter;    
}

foreach my $key (keys %counts) {
    print "$key occured $counts{$letter} times";
    print "longest consecutive run for $key was $consecutive_runs{$key}";
}

我可能会这样做:

use List::Util 'max';

my $string = '01011101100000111111001';

my $longest_run = max( 0, map { length } $string =~ /(1+)/g );

获取每个匹配的 1 组的长度并选择最大的一个。插入了一个 0,这样如果没有 undef,您就不会得到

$ perl -MList::Util=max \
  -E 'say $_, " ", max(0, map { length } /(1+)/g) for @ARGV' \
  0 1 00010110 011101111110100110

0 0
1 1
00010110 2
011101111110100110 6

编辑:@TLP 的评论让我很好奇,因为我喜欢 sort 解决方案。

#!/usr/bin/env perl

use v5.16;
use warnings;

use Benchmark ':all';
use List::Util 'max';

my $string = '0100100101111011010010101101101110101011111111101010100100100001011101010100';

cmpthese(1_000_000, {
    sort => sub { my $x =    (  sort { $b <=> $a } $string =~ /(1+)/g)[0] },
    max  => sub { my $x = max(0, map { length    } $string =~ /(1+)/g)    },
});

结果:

        Rate sort  max
sort 84890/s   --  -9%
max  93023/s  10%   --

也许 longer/shorter 测试字符串会产生不同的结果?

使用动态正则表达式可以显着提高速度。我们可以使用一个变量来存储最大长度的字符串,然后搜索那个长度加上一个或多个的字符串。理论是我们只需要寻找比我们已有的字符串更长的字符串。

我使用了如下所示的解决方案

sub hack {
    my $match = "";                        # original search string
    while ($string =~ /(${match}1+)/g) {   # search for $match plus 1 or more 1s
        $match = ;                       # when found, change to new match
    }
    length $match;                         # return max length
}

并将其与 OP 描述的原始方法进行比较,结果如下

use strict;
use warnings;
use Benchmark ':all';

my $string = '0100100101111011010010101101101110101011111111101010100100100001011101010100' x 10_000;

cmpthese(-1, {
    org  => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length(); if ($max < $len) { $max = $len } } },
    hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = ; } length $match }
});

输出:

       Rate    org   hack
org  7.31/s     --   -99%
hack 1372/s 18669%     --

这似乎高得惊人,快了 19000%。这让我觉得我犯了一个错误,但我想不出那会是什么。也许我在正则表达式机器内部遗漏了一些东西,但这将是对原始解决方案的改进。

对于短字符串,以下解决方案比之前提出的所有解决方案都快:

use List::Util qw( max );

max 0, map length, split /[^1]+/, $s
          Rate  hack  sort   org   max  mxsp    xs
hack   76879/s    --  -12%  -34%  -37%  -48%  -98%   <-- TLP
sort   87664/s   14%    --  -24%  -28%  -41%  -98%   <-- Jim Davis
org   115660/s   50%   32%    --   -6%  -22%  -98%   <-- OP
max   122504/s   59%   40%    6%    --  -17%  -98%   <-- Jim Davis
mxsp  147867/s   92%   69%   28%   21%    --  -97%   <-- ikegami (above)
xs   4950278/s 6339% 5547% 4180% 3941% 3248%    --   <-- ikegami (below)

基准代码:

use Benchmark qw( cmpthese );

my $string = ( '01001001011110110100101011011011101010'
             . '11111111101010100100100001011101010100' );

cmpthese(-3, {
    org  => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length(); if ($max < $len) { $max = $len; } } },
    hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = ; } my $max = length($match); },
    sort => sub { my $max = ( sort { $b <=> $a } $string =~ /(1+)/g )[0]; },
    max  => sub { my $max = max 0, map length, $string =~ /(1+)/g; },
    mxsp => sub { my $max = max 0, map length, split /[^1]+/, $string; },
    xs   => sub { my $max = longuest_ones_count($string); },
});

也就是说,最快的解决方案将涉及 XS。以下是我的尝试:

IV longuest_ones_count(SV* sv) {
   IV max = 0;
   IV count = 0;

   // This code works whether the string is upgraded or downgraded.
   STRLEN len;
   char *s = SvPV(sv, len);
   while (len--) {
      if (*(s++) == '1') {
         ++count;
      }
      else if (count) {
         if (max < count)
            max = count;

         count = 0;
      }
   }

   if (max < count)
      max = count;

   return max;
}

一种使用方式:

use 5.014;
use warnings;

use Inline C => <<'__';

...above code here...

__


say "$_: ", longuest_ones_count($_)
   for qw(
      0
      11111
      011111
      111110
      01110111110
      01111101110
   );

您看到了它在短字符串方面击败了其他解决方案。但是你没有短弦。对于长字符串,这不如 TLP 的版本快!!!

与上面相同的基准,但使用

my $string = ( '01001001011110110100101011011011101010'
             . '11111111101010100100100001011101010100' ) x 10_000;
       Rate   sort    org    max   mxsp     xs   hack
sort 8.61/s     --   -25%   -31%   -44%   -99%   -99%
org  11.6/s    34%     --    -8%   -24%   -99%   -99%
max  12.5/s    46%     9%     --   -18%   -99%   -99%
mxsp 15.3/s    77%    32%    22%     --   -99%   -99%  <-- ikegami (Perl)
xs   1031/s 11870%  8822%  8118%  6653%     --   -25%  <-- ikegami (XS)
hack 1366/s 15772% 11731% 10797%  8855%    33%     --  <-- TLP

哇,这个正则表达式引擎真棒!使用 XS 显然可以击败它(通过消除编译模式所需的时间),但有什么意义呢?