如何嵌套循环 2 个文件然后在不使用“while read”的情况下比较列?
How to nested loop 2 files then compare columns without using `while read`?
根据特定列的相同值,有 2 个文件具有相同的 structure.Output 彼此不同的列。
##!/bin/bash
set -e
result_dir='/home/folder1'
#2 test files
cat << EOF > $result_dir/old
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
EOF
cat << EOF > $result_dir/new
1 a /home
4 b /home/me
6 f /home/me/file 2
5 c /home/oth/file
EOF
#loop
changed=()
while read -r -u 5 OWNER GROUP FOLDER; do
temp=''
while read -r -u 6 OWNER_NEW GROUP_NEW FOLDER_NEW; do
#exist in both old & new
if [[ "$FOLDER" == "$FOLDER_NEW" ]]; then
temp+=$FOLDER
if [[ $OWNER != $OWNER_NEW ]]; then
temp+=$sep$OWNER_NEW
else
temp+=$sep
fi
if [[ $GROUP != $GROUP_NEW ]]; then
temp+=$sep$GROUP_NEW
else
temp+=$sep
fi
#changed?
if [[ "$(echo -e "${temp}" | sed -e 's/[[:space:]]*$//')" != $FOLDER ]] ; then
changed+=($temp)
fi
break
fi
done 6<$result_dir/acl_folder_new
#old loop
done 5<$result_dir/acl_folder_old
echo -e "${changed[@]}"
输出如下:
/home/me 4
/home/me/file 2 f
一切正常,但当文件包含超过 10000 行时速度太慢 post1,post2
如何在不使用 while read
的情况下嵌套循环 2 个文件然后比较列?
您可以改用 GAWK:
BEGIN {
while (getline < "old.txt") {
owner =
group =
folder =
old[folder]["owner"] = owner
old[folder]["group"] = group
}
while (getline < "new.txt") {
owner =
group =
folder =
if (folder in old) {
if (owner != old[folder]["owner"] || group != old[folder]["group"]) {
print
}
}
}
}
或PHP:
<?php
foreach (file('old.txt', FILE_IGNORE_NEW_LINES) as $r) {
$c = explode(' ', $r);
$folder = $c[2];
$old[$folder]['owner'] = $c[0];
$old[$folder]['group'] = $c[1];
}
foreach (file('new.txt', FILE_IGNORE_NEW_LINES) as $r) {
$c = explode(' ', $r);
$owner = $c[0];
$group = $c[1];
$folder = $c[2];
if (key_exists($folder, $old)) {
if ($owner != $old[$folder]['owner'] || $group != $old[$folder]['group']) {
echo $r, "\n";
}
}
}
更新: OP 最近评论说只有 mawk
可用;我无法访问 mawk
,所以不确定以下内容是否有效...
假设:
- 两个输入文件:
old
和 new
- 两个文件都有 3 个字段(space 分隔)我们将标记为
user
、group
和 folder
- 虽然示例数据显示单个字符
user
和 group
值,但假设这些值可能是多个字符
user
和 group
字段不包含白色 space
folder
字段可以包含白色space(例如,/home/me/file 2
)
目标:
- 如果字段 #3 (
folder
) 在两个文件中都匹配,并且...
-
user
and/or group
不一样...
- 打印
folder
名称和与new
文件不同的字段;格式:folder [user(new)] [group(new)]
示例数据:
$ cat old
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
9 X /home/both/are/diff er ent
$ cat new
1 a /home
4 b /home/me # user is different
6 f /home/me/file 2 # group is different
5 c /home/oth/file
124 long_group /home/both/are/diff er ent # user and group are different
注意: 文件 new
中的评论不存在;仅在此处添加以突出显示应标记为不同的内容
一个awk
想法:
awk '
FNR==NR { folder="" # file #1 processing
for (i=3; i<NF; i++)
folder=folder $(i) OFS
folder=folder $(NF)
user[folder]=
group[folder]=
next
}
{ folder="" # file #2 processing
for (i=3; i<NF; i++)
folder=folder $(i) OFS
folder=folder $(NF)
output=folder
if (folder in user) {
if ( != user[folder] ) output=output OFS
if ( != group[folder] ) output=output OFS
}
if ( output != folder ) print output
}
' old new
这会生成:
/home/me 4
/home/me/file 2 f
/home/both/are/diff er ent 124 long_group
术语“文件夹”来自 Windows。在 Unix 中,等价物是“目录”。以下内容将在您的目录名称中包含空格(正如您在样本输入中使用 /home/me/file 2
那样,但这不足以测试给定脚本是否包含它)并且可以在任何 shell 上使用任何 awk每个 Unix 盒子:
$ cat tst.sh
#!/usr/bin/env bash
result_dir='/home/directory1'
mkdir -p "$result_dir" || exit
#2 test files
cat << EOF > "$result_dir/old"
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
EOF
cat << EOF > "$result_dir/new"
1 a /home
4 b /home/me
6 f /home/me/file 2
5 c /home/oth/file
EOF
awk '
{
match([=10=],/^([^ ]+ ){2}/)
dir = substr([=10=],RLENGTH+1)
[=10=] = substr([=10=],1,RLENGTH-1)
}
NR==FNR {
olds[dir] = [=10=]
next
}
dir in olds {
split(olds[dir],old)
for (i=1; i<=NF; i++) {
if ($i != old[i]) {
print dir, $i
}
}
}
' "$result_dir/old" "$result_dir/new"
$ ./tst.sh
/home/me 4
/home/me/file 2 f
这就是您要避免嵌套循环的原因:对于 acl_folder_old 中的每一行,您都读取并处理整个文件 acl_folder_new。如果两行都有 10,000 行,那么您总共要读取 100,010,000(= 10,000 + 10,000 * 10,000)行——加上您要启动 sed 一亿次。如果您只读取每个文件一次,那么您总共要读取 20,000 行。您寻求 awk 解决方案是对的。
awk 会比 bash 快,但这里有一个 bash 的解决方案用于比较。这需要 bash 4.0+ 用于关联数组:
#!/usr/bin/env bash
declare -A owners
declare -A groups
while read -r owner group path; do
owners[$path]=$owner
groups[$path]=$group
done < "$result_dir/acl_folder_old"
while read -r owner group path; do
new_owner=""; new_group=""
if [[ -n ${owners[$path]} ]]; then
[[ $owner != "${owners[$path]}" ]] && new_owner=$owner
[[ $group != "${groups[$path]}" ]] && new_group=$group
if [[ -n $new_owner || -n $new_group ]]; then
# using semicolon as the sep char
printf '%s;%s;%s\n' "$path" "$new_owner" "$new_group"
fi
fi
done < "$result_dir/acl_folder_new"
输出
/home/me;4;
/home/me/file 2;;f
根据特定列的相同值,有 2 个文件具有相同的 structure.Output 彼此不同的列。
##!/bin/bash
set -e
result_dir='/home/folder1'
#2 test files
cat << EOF > $result_dir/old
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
EOF
cat << EOF > $result_dir/new
1 a /home
4 b /home/me
6 f /home/me/file 2
5 c /home/oth/file
EOF
#loop
changed=()
while read -r -u 5 OWNER GROUP FOLDER; do
temp=''
while read -r -u 6 OWNER_NEW GROUP_NEW FOLDER_NEW; do
#exist in both old & new
if [[ "$FOLDER" == "$FOLDER_NEW" ]]; then
temp+=$FOLDER
if [[ $OWNER != $OWNER_NEW ]]; then
temp+=$sep$OWNER_NEW
else
temp+=$sep
fi
if [[ $GROUP != $GROUP_NEW ]]; then
temp+=$sep$GROUP_NEW
else
temp+=$sep
fi
#changed?
if [[ "$(echo -e "${temp}" | sed -e 's/[[:space:]]*$//')" != $FOLDER ]] ; then
changed+=($temp)
fi
break
fi
done 6<$result_dir/acl_folder_new
#old loop
done 5<$result_dir/acl_folder_old
echo -e "${changed[@]}"
输出如下:
/home/me 4
/home/me/file 2 f
一切正常,但当文件包含超过 10000 行时速度太慢 post1,post2
如何在不使用 while read
的情况下嵌套循环 2 个文件然后比较列?
您可以改用 GAWK:
BEGIN {
while (getline < "old.txt") {
owner =
group =
folder =
old[folder]["owner"] = owner
old[folder]["group"] = group
}
while (getline < "new.txt") {
owner =
group =
folder =
if (folder in old) {
if (owner != old[folder]["owner"] || group != old[folder]["group"]) {
print
}
}
}
}
或PHP:
<?php
foreach (file('old.txt', FILE_IGNORE_NEW_LINES) as $r) {
$c = explode(' ', $r);
$folder = $c[2];
$old[$folder]['owner'] = $c[0];
$old[$folder]['group'] = $c[1];
}
foreach (file('new.txt', FILE_IGNORE_NEW_LINES) as $r) {
$c = explode(' ', $r);
$owner = $c[0];
$group = $c[1];
$folder = $c[2];
if (key_exists($folder, $old)) {
if ($owner != $old[$folder]['owner'] || $group != $old[$folder]['group']) {
echo $r, "\n";
}
}
}
更新: OP 最近评论说只有 mawk
可用;我无法访问 mawk
,所以不确定以下内容是否有效...
假设:
- 两个输入文件:
old
和new
- 两个文件都有 3 个字段(space 分隔)我们将标记为
user
、group
和folder
- 虽然示例数据显示单个字符
user
和group
值,但假设这些值可能是多个字符 user
和group
字段不包含白色 spacefolder
字段可以包含白色space(例如,/home/me/file 2
)
目标:
- 如果字段 #3 (
folder
) 在两个文件中都匹配,并且... -
user
and/orgroup
不一样... - 打印
folder
名称和与new
文件不同的字段;格式:folder [user(new)] [group(new)]
示例数据:
$ cat old
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
9 X /home/both/are/diff er ent
$ cat new
1 a /home
4 b /home/me # user is different
6 f /home/me/file 2 # group is different
5 c /home/oth/file
124 long_group /home/both/are/diff er ent # user and group are different
注意: 文件 new
中的评论不存在;仅在此处添加以突出显示应标记为不同的内容
一个awk
想法:
awk '
FNR==NR { folder="" # file #1 processing
for (i=3; i<NF; i++)
folder=folder $(i) OFS
folder=folder $(NF)
user[folder]=
group[folder]=
next
}
{ folder="" # file #2 processing
for (i=3; i<NF; i++)
folder=folder $(i) OFS
folder=folder $(NF)
output=folder
if (folder in user) {
if ( != user[folder] ) output=output OFS
if ( != group[folder] ) output=output OFS
}
if ( output != folder ) print output
}
' old new
这会生成:
/home/me 4
/home/me/file 2 f
/home/both/are/diff er ent 124 long_group
术语“文件夹”来自 Windows。在 Unix 中,等价物是“目录”。以下内容将在您的目录名称中包含空格(正如您在样本输入中使用 /home/me/file 2
那样,但这不足以测试给定脚本是否包含它)并且可以在任何 shell 上使用任何 awk每个 Unix 盒子:
$ cat tst.sh
#!/usr/bin/env bash
result_dir='/home/directory1'
mkdir -p "$result_dir" || exit
#2 test files
cat << EOF > "$result_dir/old"
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
EOF
cat << EOF > "$result_dir/new"
1 a /home
4 b /home/me
6 f /home/me/file 2
5 c /home/oth/file
EOF
awk '
{
match([=10=],/^([^ ]+ ){2}/)
dir = substr([=10=],RLENGTH+1)
[=10=] = substr([=10=],1,RLENGTH-1)
}
NR==FNR {
olds[dir] = [=10=]
next
}
dir in olds {
split(olds[dir],old)
for (i=1; i<=NF; i++) {
if ($i != old[i]) {
print dir, $i
}
}
}
' "$result_dir/old" "$result_dir/new"
$ ./tst.sh
/home/me 4
/home/me/file 2 f
这就是您要避免嵌套循环的原因:对于 acl_folder_old 中的每一行,您都读取并处理整个文件 acl_folder_new。如果两行都有 10,000 行,那么您总共要读取 100,010,000(= 10,000 + 10,000 * 10,000)行——加上您要启动 sed 一亿次。如果您只读取每个文件一次,那么您总共要读取 20,000 行。您寻求 awk 解决方案是对的。
awk 会比 bash 快,但这里有一个 bash 的解决方案用于比较。这需要 bash 4.0+ 用于关联数组:
#!/usr/bin/env bash
declare -A owners
declare -A groups
while read -r owner group path; do
owners[$path]=$owner
groups[$path]=$group
done < "$result_dir/acl_folder_old"
while read -r owner group path; do
new_owner=""; new_group=""
if [[ -n ${owners[$path]} ]]; then
[[ $owner != "${owners[$path]}" ]] && new_owner=$owner
[[ $group != "${groups[$path]}" ]] && new_group=$group
if [[ -n $new_owner || -n $new_group ]]; then
# using semicolon as the sep char
printf '%s;%s;%s\n' "$path" "$new_owner" "$new_group"
fi
fi
done < "$result_dir/acl_folder_new"
输出
/home/me;4;
/home/me/file 2;;f