如何嵌套循环 2 个文件然后在不使用“while read”的情况下比较列?

How to nested loop 2 files then compare columns without using `while read`?

根据特定列的相同值,有 2 个文件具有相同的 structure.Output 彼此不同的列。

##!/bin/bash
set -e

result_dir='/home/folder1'

#2 test files
cat << EOF > $result_dir/old
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
EOF
cat << EOF > $result_dir/new
1 a /home
4 b /home/me
6 f /home/me/file 2
5 c /home/oth/file
EOF

#loop
changed=()
while read -r -u 5 OWNER GROUP FOLDER; do
    temp=''
    while read -r -u 6 OWNER_NEW GROUP_NEW FOLDER_NEW; do
    #exist in both old & new
    if [[ "$FOLDER" == "$FOLDER_NEW" ]]; then
        temp+=$FOLDER
        if [[ $OWNER != $OWNER_NEW ]]; then
        temp+=$sep$OWNER_NEW
        else
        temp+=$sep
        fi
        if [[ $GROUP != $GROUP_NEW ]]; then
        temp+=$sep$GROUP_NEW
        else
        temp+=$sep       
        fi
        #changed?
        if [[ "$(echo -e "${temp}" | sed -e 's/[[:space:]]*$//')" != $FOLDER ]] ; then
        changed+=($temp)
        fi
        break
    fi
    done 6<$result_dir/acl_folder_new

#old loop
done 5<$result_dir/acl_folder_old

echo -e "${changed[@]}"

输出如下:

/home/me 4
/home/me/file 2  f

一切正常,但当文件包含超过 10000 行时速度太慢 post1,post2

如何在不使用 while read 的情况下嵌套循环 2 个文件然后比较列?

您可以改用 GAWK:

BEGIN {
   while (getline < "old.txt") {
      owner = 
      group = 
      folder = 
      old[folder]["owner"] = owner
      old[folder]["group"] = group
   }
   while (getline < "new.txt") {
      owner = 
      group = 
      folder = 
      if (folder in old) {
         if (owner != old[folder]["owner"] || group != old[folder]["group"]) {
            print
         }
      }
   }
}

或PHP:

<?php

foreach (file('old.txt', FILE_IGNORE_NEW_LINES) as $r) {
   $c = explode(' ', $r);
   $folder = $c[2];
   $old[$folder]['owner'] = $c[0];
   $old[$folder]['group'] = $c[1];
}

foreach (file('new.txt', FILE_IGNORE_NEW_LINES) as $r) {
   $c = explode(' ', $r);
   $owner = $c[0];
   $group = $c[1];
   $folder = $c[2];
   if (key_exists($folder, $old)) {
      if ($owner != $old[$folder]['owner'] || $group != $old[$folder]['group']) {
         echo $r, "\n";
      }
   }
}

更新: OP 最近评论说只有 mawk 可用;我无法访问 mawk,所以不确定以下内容是否有效...

假设:

  • 两个输入文件:oldnew
  • 两个文件都有 3 个字段(space 分隔)我们将标记为 usergroupfolder
  • 虽然示例数据显示单个字符 usergroup 值,但假设这些值可能是多个字符
  • usergroup 字段不包含白色 space
  • folder字段可以包含白色space(例如,/home/me/file 2

目标:

  • 如果字段 #3 (folder) 在两个文件中都匹配,并且...
  • user and/or group 不一样...
  • 打印folder名称和与new文件不同的字段;格式:folder [user(new)] [group(new)]

示例数据:

$ cat old
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
9 X /home/both/are/diff er ent

$ cat new
1 a /home
4 b /home/me                                 # user is different
6 f /home/me/file 2                          # group is different
5 c /home/oth/file
124 long_group /home/both/are/diff er ent    # user and group are different

注意: 文件 new 中的评论不存在;仅在此处添加以突出显示应标记为不同的内容

一个awk想法:

awk '

FNR==NR { folder=""                             # file #1 processing
          for (i=3; i<NF; i++)
              folder=folder $(i) OFS
          folder=folder $(NF)

          user[folder]=
          group[folder]=
          next
        }

        { folder=""                             # file #2 processing
          for (i=3; i<NF; i++)
              folder=folder $(i) OFS
          folder=folder $(NF)

          output=folder

          if (folder in user) {
              if (  != user[folder]  ) output=output OFS 
              if (  != group[folder] ) output=output OFS 
          }

          if ( output != folder )        print output
        }
' old new

这会生成:

/home/me 4
/home/me/file 2 f
/home/both/are/diff er ent 124 long_group

术语“文件夹”来自 Windows。在 Unix 中,等价物是“目录”。以下内容将在您的目录名称中包含空格(正如您在样本输入中使用 /home/me/file 2 那样,但这不足以测试给定脚本是否包含它)并且可以在任何 shell 上使用任何 awk每个 Unix 盒子:

$ cat tst.sh
#!/usr/bin/env bash

result_dir='/home/directory1'
mkdir -p "$result_dir" || exit

#2 test files
cat << EOF > "$result_dir/old"
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
EOF

cat << EOF > "$result_dir/new"
1 a /home
4 b /home/me
6 f /home/me/file 2
5 c /home/oth/file
EOF

awk '
{
    match([=10=],/^([^ ]+ ){2}/)
    dir = substr([=10=],RLENGTH+1)
    [=10=] = substr([=10=],1,RLENGTH-1)
}
NR==FNR {
    olds[dir] = [=10=]
    next
}
dir in olds {
    split(olds[dir],old)
    for (i=1; i<=NF; i++) {
        if ($i != old[i]) {
            print dir, $i
        }
    }
}
' "$result_dir/old" "$result_dir/new"

$ ./tst.sh
/home/me 4
/home/me/file 2 f

这就是您要避免嵌套循环的原因:对于 acl_folder_old 中的每一行,您都读取并处理整个文件 acl_folder_new。如果两行都有 10,000 行,那么您总共要读取 100,010,000(= 10,000 + 10,000 * 10,000)行——加上您要启动 sed 一亿次。如果您只读取每个文件一次,那么您总共要读取 20,000 行。您寻求 awk 解决方案是对的。

awk 会比 bash 快,但这里有一个 bash 的解决方案用于比较。这需要 bash 4.0+ 用于关联数组:

#!/usr/bin/env bash

declare -A owners
declare -A groups

while read -r owner group path; do
    owners[$path]=$owner
    groups[$path]=$group
done < "$result_dir/acl_folder_old"

while read -r owner group path; do
    new_owner=""; new_group=""
    if [[ -n ${owners[$path]} ]]; then
        [[ $owner != "${owners[$path]}" ]] && new_owner=$owner
        [[ $group != "${groups[$path]}" ]] && new_group=$group
        if [[ -n $new_owner || -n $new_group ]]; then
            # using semicolon as the sep char
            printf '%s;%s;%s\n' "$path" "$new_owner" "$new_group"
        fi
    fi
done < "$result_dir/acl_folder_new"

输出

/home/me;4;
/home/me/file 2;;f