bash 子序列匹配加速

Question

我想知道是否有一种简单的方法来检查一个字符串是否是 bash 中另一个字符串的子序列，实际上是一个带有额外规则的子序列。我会解释。

"apple" 的一些子序列是 "aple"、"al"、"pp" 和 "ale"。带有额外规则的子序列，我想要得到的是那些以与字符串相同的字母开头和结尾的子序列，所以只有 "aple" 和 "ale" 符合我的愿望。

我做了如下程序：

#!/bin/bash
while read line
do
    search=$(echo "$line" | tr -s 'A-Za-z' | sed 's/./\.\*&/g;s/^\.\*//' )
    expr match "" "$search" >/dev/null && echo "$line"
done

执行如下：

./program.sh greogdgedlqfe < words.txt

这个程序可以运行，但是速度很慢。

它获取文件的每一行，将其修改为正则表达式，然后检查它们是否匹配，然后打印原始行。例如：

其中一行有单词google

$search 变为 g.*o.*g.*l.*e（重复的字母被压缩，额外的规则）

然后我们用给定的参数检查该表达式，如果匹配，我们打印行：google

这工作正常，但是当文件 words.txt 变得太大时，这个程序变得太慢了。我怎样才能加快我的程序，可能是通过更快的匹配子序列。

在 Kamilcuk 的可能解决方案后编辑

字符串 "qwertyuihgfcvbnhjk" 的解决方案 returns quick,quiff,quin,qwerty 应该只返回 quick，所以它几乎正确，但还不完全正确。

Answer 1

bash 不需要使用 expr（外部程序）进行正则表达式匹配；它提供对系统库的内置访问。

#!/bin/bash
while read line
do
    search=$(echo "$line" | tr -s 'A-Za-z' | sed 's/./\.\*&/g;s/^\.\*//' )
    [[  =~ $search ]] && echo "$line"
done

Answer 2

您可以使用模式代替正则表达式。只需在每个单词的每个字母后插入星号（最后一个字母除外）并使用正常模式匹配。

#!/bin/bash
while read line
do
    pattern=""
    for ((i=${#line}-1 ; i>=0 ; --i)) ; do
        pattern="${line:i:1}*"$pattern
    done
    pattern=${pattern%'*'}

    if [[ "" == $pattern ]] ; then
        echo "$line"
    fi
done

Answer 3

试试看：

grep -x "$(<<<"" tr -s 'A-Za-z' | sed 's/./&*/g;s/\*$//;s/\*//1')" words.txt

测试对象：

set -- apple  
cat >words.txt <<EOF
aple
al
pp
ale
fdafda
apppppppple
apple
google
EOF

输出：

aple
ale
apppppppple
apple

而对于 set -- greogdgedlqfe，它只输出 google。

如果我没理解错的话，apple 的 "subsequent" 就是数学 ap*l*e 的所有内容。

Tested on repl

Answer 4

很难用正则表达式打败 perl。

性能

性能的关键是避免分叉额外的进程。此处介绍的大多数 bash 解决方案（基于 KamilCuk grep 的解决方案除外，该解决方案并不总是正确的）将需要多次调用 sed、tr 等。Perl 将优于这些解决方案。即使可以实现纯 bash 解决方案（使用 bash RE，模式），当单词列表的大小很大时，Perl 也可能胜过它。

考虑program.pl appl < words.txt

#! /usr/bin/perl
use strict ;

my $word = shift @ARGV ;

while ( <> ) {
    chomp ;
    my $p = $_ ;
    tr/A-Za-z//s ;
    s/(.)/.*/g ;
    s/^\.\*// ;
    print $p, "\n" if $word =~ "^$_$" ;
} ;

更新 1：KamilCuk 解决方案的 Perl 实现 + 修复。

小修后，我相信可以使用基于 grep 的解决方案中的想法来创建一个速度更快的 Perl 程序。它创建一个 REGEXP，并测试单词列表文件中的每个单词。我认为这是 Perl 的最佳选择。

#! /usr/bin/perl
use strict ;

$_ = shift @ARGV ;
tr/A-Za-z//s ;
s/(.)/*/g ;
s/\*// ;
s/\*$// ;
my $re = "^$_$" ;
print "RE=$re\n" ;

while ( <> ) {
        chomp ;
        print $_, "\n" if /$re/ ;
} ;

bash 子序列匹配加速

bash subsequence matching speed-up

bash

performance

subsequence