如何从文件中删除空格并从另一个文件中提取相应的索引？ - bash

Question

我有三个文件：

file1.txt:

XYZ与ABC
DFC什么
FBFBBBFde
warlaugh世界

file2.txt:

XYZ 与 ABC
warlaugh 世界

file3.txt:

XYZ with abc
DFC whatever
FBFBBBF
world of warlaugh

file2.txt 是来自 file1.txt 的带有空格的已处理文件。 file1.txt 的行与 file3.txt 对齐，即 foobaristhehelloworld <-> XYZ with abc。

由于某种原因，处理丢弃了来自 file2.txt 的行，但更重要的是在处理后从 file3.txt 中检索相应的行。

我如何检查 file2.txt 中删除了哪些行，然后生成如下所示的 file4.txt：

file4.txt:

XYZ with abc
world of warlaugh

我可以用 python 做到这一点，但我确信有一个简单的方法可以使用 sed/awk 或 bash 技巧：

with open('file1.txt', 'r') as file1, open('file2.txt') as file2, open('file3.txt', 'r') as file3:
    file2_nospace = [i.replace(' ', '') for i in file2.readlines()]
    file2_indices = [i for i,j in enumerate(file1.readlines()) if j in file2_nospace]
    file4 = [j for i,j in enumerate(file3.readlines()) if i in file2_indices]

    open('file4.txt', 'w').write('\n'.join(file4))

如何使用 sed/awk/grep 或 bash 技巧创建 file4.txt？

Answer 1

首先删除 file2.txt 中的空格，使其行像 file1.txt :

sed 's/ //g' file2.txt

然后将其用作与 file1.txt 匹配的模式。使用 grep -f 命令执行此操作并使用 -n 查看 file1.txt 的行号，该行号与 file2.txt 中构造的模式相匹配：

$ grep -nf <(sed 's/ //g' file2.txt) file1.txt
1:XYZ与ABC
4:warlaugh世界

现在您需要删除 : 之后的任何字符以创建新模式以匹配 file3.txt 行：

$ grep -nf <(sed 's/ //g' file2.txt) file1.txt | sed 's/:.*/:/'
1:
4:

要为 file3.txt 的每一行添加行号，请使用：

$ nl -s':' file3.txt | sed -r 's/^ +//'
1:XYZ with abc
2:DFC whatever
3:FBFBBBF
4:world of warlaugh

现在您可以使用第一个输出作为与第二个匹配的模式：

$ grep -f <(grep -nf <(sed 's/ //g' file2.txt) file1.txt | sed 's/:.*/:/')  <(nl -s':' file3.txt | sed -r 's/^ +//')
1:XYZ with abc
4:world of warlaugh

要删除起始行号，只需使用 cut:

$ grep -f <(grep -nf <(sed 's/ //g' file2.txt) file1.txt | sed 's/:.*/:/')  <(nl -s':' file3.txt | sed -r 's/^ +//') | cut -d':' -f2
XYZ with abc
world of warlaugh

最后将结果保存到file4.txt :

$ grep -f <(grep -nf <(sed 's/ //g' file2.txt) file1.txt | sed 's/:.*/:/')  <(nl -s':' file3.txt | sed -r 's/^ +//') | cut -d':' -f2 > file4.txt

Answer 2

您可以通过一次调用 awk 来完成类似的操作：

awk 'FILENAME ~ /file2.txt/ { gsub(/ /, ""); a[[=10=]]; next }
     FILENAME ~ /file1.txt/ && [=10=] in a { b[FNR]; next }
     FILENAME ~ /file3.txt/ && FNR in b { print }' file2.txt file1.txt file3.txt

您也可以使用两个 awk 来避免使用 FILENAME 变量：

awk 'FNR==NR { gsub(/ /, ""); a[[=11=]]; next } 
    [=11=] in a { print FNR }' file2.txt file1.txt | 
awk 'FNR==NR { a[[=11=]]; next } FNR in a { print }' - file3.txt

使用> file4.txt输出到file4.txt。

基本上是

取出file2.txt并在剥离空格后将其存储在关联数组中。
将行号形式 file1.txt 与该关联数组进行比较，并按文件行号将其存储在另一个关联数组中。
测试 file3.txt 中的行号是否在第二个关联数组中，并在匹配时打印。

Answer 3

遍历原始文件，在file2中寻找对应的行。当行匹配时，打印 file3 中的相应行。

linenr=0
filternr=1
for line in $(cat file1.txt); do
   (( linenr = linenr + 1 ))
   line2=$(sed -n ${filternr}p file2.txt | cut -d" " -f1)
   if [[ "${line}" = ${line2}* ]]; then
      (( filternr = filternr + 1 ))
      sed -n ${linenr}p file3.txt
   fi
done > file4.txt

当文件很大时（实际上是file2中的行数很大时），你想改变这个解决方案，避免sed每次都遍历file2和file3。 write/understad/maintain...

的解决方案就不那么简单了

Answer 4

在每个文件中查找一次可以通过 diff 和标准输入的重定向来完成。
此解决方案仅在您确定它们没有“|”字符时才有效：

#/bin/bash

function mycheck {
   if [ -z "${filteredline}" ]; then
      exec 0<file2.txt
      read filteredline
   fi
   line2=${filteredline%% *}
   if [[ "${line}" = ${line2}* ]]; then
      echo ${line} | sed 's/.*|\t//'
      read filteredline
      if [ -z "${filteredline}" ]; then
         break;
      fi
   fi
}

IFS="
"
for line in $(diff -y file1.txt file3.txt); do
   mycheck "${line}"
done > file4.txt

如何从文件中删除空格并从另一个文件中提取相应的索引？ - bash

How to remove whitespace from file and extract the corresponding indices from another file? - bash

python

regex

bash

awk

sed