如何快速统计字符串中连续单个字符的最大个数？

Question

我的字符串类似于：但更长

my $a = "000000001111111111000000011111111111111111111111111111111";

我正在计算“1”的数量：

my $total_1_available = $a =~ tr/1//;

而且效果非常好，而且速度非常快。

但是，我还想（快速地）计算连续 1 的总数。连续“1”的最大计数。

在上面的示例中，它将 return 计数：

11111111111111111111111111111111

因为这是连续的最大值。

所以，我得到 TOTAL_COUNT 和 TOTAL_CONSECUTIVE_COUNT。

我让它与 REGEXP 一起工作，它基本上替换了 1，然后计算被替换的内容并循环……这实际上完全没问题并且可以工作……但它“感觉”不对。

理想情况下，我根本不想替换字符串，因为我正在寻找最大连续计数。

但是，我知道在 Perl 中这可能不是最快或最干净的方法。

能不能教教我更好的方法，增加我的学习？

根据要求，这是我当前的代码：

 my $a= "0110011001101111";
 my $total_1_available = $a =~ tr/1//;
 print "Total number of 1's = $total_1_available\n";

 my $max_c = 0;
 while ( $a=~s/(1+)/ / ) {
   $max_c = length() if length() > $max_c;
 }
 print "Consecutive count   = $max_c\n";

最终代码：

use strict;
use warnings;
use Benchmark ':all';
use String::Random;

## We test 525,600 as this is the length of the string.
## Actually each 0 or 1 represents a minute of the year.
## And these represent engineer minues available in a 24 hr / 365 day year.
## And there are lots and lots of engineers.
## Hence my wish to improve the performance and I wish to thank everyone whom responded.

## there are a lot more 0's than 1's so hack to sort of simulate
my $test_regex = '[0][0][0][0][0][0-1][0-1][0-1][0-1][0-1]' x 52560;
my $pass       = String::Random->new;
my $string     = $pass->randregex($test_regex);

cmpthese(-1, {
    org  => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length(); if ($max < $len) { $max = $len } } },
    hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = ; } length $match }
});

#                BLOWN AWAY !!!!!!
#                BLOWN AWAY !!!!!!
#                BLOWN AWAY !!!!!!
#                BLOWN AWAY !!!!!!

Answer 1

Perl 允许您动态创建哈希值，您可以用它来进行计数。

循环遍历 $a 的每个字符，使用该字母递增散列的内容。在循环结束时，您将得到一个散列，其中包含每个字母的键和包含每个字母的计数的值。

foreach $letter (split //, $a) {
    if $letter eq $last { 
        $consecutive_count{$letter}++
    } else {
        if ($consecutive_count{$letter} > $consecutive_runs{$letter})
             $consecutive_runs{$letter} = $consecutive_count{$letter};
             $consecutive_count{$letter} = 0;
        }
    }   
    $counts{$letter}++;
    $last = $letter;    
}

foreach my $key (keys %counts) {
    print "$key occured $counts{$letter} times";
    print "longest consecutive run for $key was $consecutive_runs{$key}";
}

Answer 2

我可能会这样做：

use List::Util 'max';

my $string = '01011101100000111111001';

my $longest_run = max( 0, map { length } $string =~ /(1+)/g );

获取每个匹配的 1 组的长度并选择最大的一个。插入了一个 0，这样如果没有 undef，您就不会得到

$ perl -MList::Util=max \
  -E 'say $_, " ", max(0, map { length } /(1+)/g) for @ARGV' \
  0 1 00010110 011101111110100110

0 0
1 1
00010110 2
011101111110100110 6

编辑：@TLP 的评论让我很好奇，因为我喜欢 sort 解决方案。

#!/usr/bin/env perl

use v5.16;
use warnings;

use Benchmark ':all';
use List::Util 'max';

my $string = '0100100101111011010010101101101110101011111111101010100100100001011101010100';

cmpthese(1_000_000, {
    sort => sub { my $x =    (  sort { $b <=> $a } $string =~ /(1+)/g)[0] },
    max  => sub { my $x = max(0, map { length    } $string =~ /(1+)/g)    },
});

结果：

        Rate sort  max
sort 84890/s   --  -9%
max  93023/s  10%   --

也许 longer/shorter 测试字符串会产生不同的结果？

Answer 3

使用动态正则表达式可以显着提高速度。我们可以使用一个变量来存储最大长度的字符串，然后搜索那个长度加上一个或多个的字符串。理论是我们只需要寻找比我们已有的字符串更长的字符串。

我使用了如下所示的解决方案

sub hack {
    my $match = "";                        # original search string
    while ($string =~ /(${match}1+)/g) {   # search for $match plus 1 or more 1s
        $match = ;                       # when found, change to new match
    }
    length $match;                         # return max length
}

并将其与 OP 描述的原始方法进行比较，结果如下

use strict;
use warnings;
use Benchmark ':all';

my $string = '0100100101111011010010101101101110101011111111101010100100100001011101010100' x 10_000;

cmpthese(-1, {
    org  => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length(); if ($max < $len) { $max = $len } } },
    hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = ; } length $match }
});

输出：

       Rate    org   hack
org  7.31/s     --   -99%
hack 1372/s 18669%     --

这似乎高得惊人，快了 19000%。这让我觉得我犯了一个错误，但我想不出那会是什么。也许我在正则表达式机器内部遗漏了一些东西，但这将是对原始解决方案的改进。

Answer 4

对于短字符串，以下解决方案比之前提出的所有解决方案都快：

use List::Util qw( max );

max 0, map length, split /[^1]+/, $s

          Rate  hack  sort   org   max  mxsp    xs
hack   76879/s    --  -12%  -34%  -37%  -48%  -98%   <-- TLP
sort   87664/s   14%    --  -24%  -28%  -41%  -98%   <-- Jim Davis
org   115660/s   50%   32%    --   -6%  -22%  -98%   <-- OP
max   122504/s   59%   40%    6%    --  -17%  -98%   <-- Jim Davis
mxsp  147867/s   92%   69%   28%   21%    --  -97%   <-- ikegami (above)
xs   4950278/s 6339% 5547% 4180% 3941% 3248%    --   <-- ikegami (below)

基准代码：

use Benchmark qw( cmpthese );

my $string = ( '01001001011110110100101011011011101010'
             . '11111111101010100100100001011101010100' );

cmpthese(-3, {
    org  => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length(); if ($max < $len) { $max = $len; } } },
    hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = ; } my $max = length($match); },
    sort => sub { my $max = ( sort { $b <=> $a } $string =~ /(1+)/g )[0]; },
    max  => sub { my $max = max 0, map length, $string =~ /(1+)/g; },
    mxsp => sub { my $max = max 0, map length, split /[^1]+/, $string; },
    xs   => sub { my $max = longuest_ones_count($string); },
});

也就是说，最快的解决方案将涉及 XS。以下是我的尝试：

IV longuest_ones_count(SV* sv) {
   IV max = 0;
   IV count = 0;

   // This code works whether the string is upgraded or downgraded.
   STRLEN len;
   char *s = SvPV(sv, len);
   while (len--) {
      if (*(s++) == '1') {
         ++count;
      }
      else if (count) {
         if (max < count)
            max = count;

         count = 0;
      }
   }

   if (max < count)
      max = count;

   return max;
}

一种使用方式：

use 5.014;
use warnings;

use Inline C => <<'__';

...above code here...

__


say "$_: ", longuest_ones_count($_)
   for qw(
      0
      11111
      011111
      111110
      01110111110
      01111101110
   );

您看到了它在短字符串方面击败了其他解决方案。但是你没有短弦。对于长字符串，这不如 TLP 的版本快！！！

与上面相同的基准，但使用

my $string = ( '01001001011110110100101011011011101010'
             . '11111111101010100100100001011101010100' ) x 10_000;

       Rate   sort    org    max   mxsp     xs   hack
sort 8.61/s     --   -25%   -31%   -44%   -99%   -99%
org  11.6/s    34%     --    -8%   -24%   -99%   -99%
max  12.5/s    46%     9%     --   -18%   -99%   -99%
mxsp 15.3/s    77%    32%    22%     --   -99%   -99%  <-- ikegami (Perl)
xs   1031/s 11870%  8822%  8118%  6653%     --   -25%  <-- ikegami (XS)
hack 1366/s 15772% 11731% 10797%  8855%    33%     --  <-- TLP

哇，这个正则表达式引擎真棒！使用 XS 显然可以击败它（通过消除编译模式所需的时间），但有什么意义呢？

如何快速统计字符串中连续单个字符的最大个数？

How can I quickly count the maximum number of consecutive single characters in a string?

string

perl

replace

count

tr