如何检查一个文件是否是另一个文件的一部分？

Question

我需要通过 bash 脚本检查一个文件是否在另一个文件中。对于给定的多行模式和输入文件。

Return值：

我想接收状态（如何在 grep 命令中）如果找到匹配项则为 0，如果未找到匹配项则为 1。

模式：

多行，
行的顺序很重要（被视为单个行块），
包括数字、字母、?、&、*、#等字符，

说明

只有以下示例才能找到匹配项：

pattern     file1 file2 file3 file4
222         111   111   222   222
333         222   222   333   333
            333   333         444
            444

以下不应该：

pattern     file1 file2 file3 file4 file5 file6 file7
222         111   111   333   *222  111   111   222
333         *222  222   222   *333  222   222   
            333   333*        444   111         333
            444                     333   333

这是我的脚本：

#!/bin/bash

function writeToFile {
    if [ -w "" ] ; then
        echo "" >> ""
    else
        echo -e "" | sudo tee -a "" > /dev/null
    fi
}

function writeOnceToFile {
        pcregrep --color -M "" ""
        #echo $?

        if [ $? -eq 0 ]; then
            echo This file contains text that was added previously
        else
            writeToFile "" ""
        fi
}

file=file.txt 
#1?1
#2?2
#3?3
#4?4

pattern=`cat pattern.txt`
#2?2
#3?3

writeOnceToFile "$file" "$pattern"

我可以对模式的所有行使用 grep 命令，但在这个例子中它失败了：

file.txt 
#1?1
#2?2
#=== added line
#3?3
#4?4

pattern.txt
#2?2
#3?3

或者即使你换行：2 和 3

file=file.txt 
#1?1
#3?3
#2?2
#4?4

不应该返回 0。

我该如何解决？请注意，我更喜欢使用本机安装的程序（如果可以不用 pcregrep）。也许 sed 或 awk 可以解决这个问题？

Answer 1

我只想使用 diff 来完成这项任务：

diff pattern <(grep -f file pattern)

说明

diff file1 file2 报告两个文件是否不同。
通过说 grep -f file pattern 你可以看到 pattern 的内容在 file.

所以你正在做的是检查 pattern 中的哪些行在 file 中，然后将其与 pattern 本身进行比较。如果它们匹配，则意味着 pattern 是 file!

的子集

测试

seq 10 是 seq 20 的一部分！让我们检查一下：

$ diff <(seq 10) <(grep -f <(seq 20) <(seq 10))
$

seq 10不完全在seq 2 20里面（1不在第二个里面）：

$ diff -q <(seq 10) <(grep -f <(seq 2 20) <(seq 10))
Files /dev/fd/63 and /dev/fd/62 differ

Answer 2

我有一个使用 perl 的工作版本。

我以为我可以使用 GNU awk，但我没有。 RS=空字符串在空行上拆分。查看损坏的 awk 版本的编辑历史。

How can I search for a multiline pattern in a file? 展示了如何使用 pcregrep，但当要搜索的模式可能包含正则表达式特殊字符时，我看不到让它工作的方法。 -F 固定字符串模式不适用于多行模式：它仍然将模式视为一组要单独匹配的行。（不是作为要匹配的多行固定字符串。）我看到您已经在尝试使用 pcregrep。

顺便说一句，我认为你的代码在非 sudo 情况下有一个错误：

function writeToFile {
    if [ -w "" ] ; then
        "" >> ""   # probably you mean  echo "" >> ""
    else
        echo -e "" | sudo tee -a "" > /dev/null
    fi
}

无论如何，尝试使用基于行的工具都失败了，所以是时候推出一种更严肃的编程语言，它不会将换行约定强加给我们。只需将两个文件读入变量，并使用非正则表达式搜索：

#!/usr/bin/perl -w
# multi_line_match.pl  pattern_file  target_file
# exit(0) if a match is found, else exit(1)

#use IO::File;
use File::Slurp;
my $pat = read_file($ARGV[0]);
my $target = read_file($ARGV[1]);

if ((substr($target, 0, length($pat)) eq $pat) or index($target, "\n".$pat) >= 0) {
    exit(0);
}
exit(1);

请参阅 What is the best way to slurp a file into a string in Perl? 以避免对 File::Slurp 的依赖（它不是标准 perl 发行版的一部分，也不是默认的 Ubuntu 15.04 系统）。我选择 File::Slurp 部分是为了程序正在做的事情的可读性，对于非 perl-geeks，相比之下：

my $contents = do { local(@ARGV, $/) = $file; <> };

我正在努力避免将整个文件读入内存，这是来自 http://www.perlmonks.org/?node_id=98208 的想法。我认为不匹配的案例通常仍会一次读取整个文件。此外，处理文件开头匹配项的逻辑非常复杂，我不想花很长时间进行测试以确保它在所有情况下都是正确的。这是我在放弃之前所拥有的：

#IO::File->input_record_separator($pat);
$/ = $pat;  # pat must include a trailing newline if you want it to match one

my $fh = IO::File->new($ARGV[2], O_RDONLY)
    or die 'Could not open file ', $ARGV[2], ": $!";

$tail = substr($fh->getline, -1);  #fast forward to the first match
#print each occurence in the file
#print IO::File->input_record_separator  while $fh->getline;

#FIXME: something clever here to handle the case where $pat matches at the beginning of the file.
do {
    # fixme: need to check defined($fh->getline)
    if (($tail eq '\n') or ($tail = substr($fh->getline, -1))) {
    exit(0);  # if there's a 2nd line
    }
} while($tail);

exit(1);
$fh->close;

另一个想法是通过tr '\n' '\r'之类的方式过滤要搜索的模式和文件，所以它们都是单行的。（\r 可能是一个安全的选择，不会与文件或模式中已有的任何内容发生冲突。）

Answer 3

我再次解决了这个问题，我认为 awk 可以更好地处理这个问题：

awk 'FNR==NR {a[FNR]=[=10=]; next}
     FNR==1 && NR>1 {for (i in a) len++}
     {for (i=last; i<=len; i++) {
         if (a[i]==[=10=]) 
            {last=i; next}
     } status=1}
     END {print status+0}' file pattern

想法是： - 读取数组a[line_number] = line 中内存中的所有文件file。 - 计算数组中的元素。 - 遍历文件 pattern 并检查当前行是否出现在光标所在位置和文件末尾 file 之间的任何时间 file 中。如果匹配，则将光标移动到找到它的位置。如果没有，则将状态设置为 1 - 即 pattern 中有一行在上一次匹配后未出现在 file 中。 - 打印状态，它将是 0 除非之前任何时候设置为 1。

测试

他们确实匹配：

$ tail f p
==> f <==
222
333
555

==> p <==
222
333
$ awk 'FNR==NR {a[FNR]=[=11=]; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==[=11=]) {last=i; next}} status=1} END {print status+0}' f p
0

他们没有：

$ tail f p
==> f <==
333
222
555

==> p <==
222
333
$ awk 'FNR==NR {a[FNR]=[=12=]; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==[=12=]) {last=i; next}} status=1} END {print status+0}' f p
1

与seq:

$ awk 'FNR==NR {a[FNR]=[=13=]; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==[=13=]) {last=i; next}} status=1} END {print status+0}' <(seq 2 20) <(seq 10)
1
$ awk 'FNR==NR {a[FNR]=[=13=]; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==[=13=]) {last=i; next}} status=1} END {print status+0}' <(seq 20) <(seq 10)
0

如何检查一个文件是否是另一个文件的一部分？

How to check if one file is part of other?

linux

bash

command-line

pcregrep

说明

测试

测试