如何增强从多个 CSV 文件中查找的脚本
How to enhance a script for look ups from multiple CSV files
我需要增强下面的脚本,它需要一个包含近百万行的输入文件。针对每一行,它在 3 个查找文件中具有不同的值,我打算将其作为逗号分隔值添加到我的输出中。
下面的脚本工作正常,但需要几个小时才能完成。我正在寻找一个真正快速的解决方案,它对系统的负担也不那么大。
#!/bin/bash
while read -r ONT
do
{
ONSTATUS=$(grep "$ONT," lookupfile1.csv | cut -d" " -f2)
CID=$(grep "$ONT." lookupfile3.csv | head -1 | cut -d, -f2)
line1=$(grep "$ONT.C2.P1," lookupfile2.csv | head -1 | cut -d"," -f2,7 | sed 's/ //')
line2=$(grep "$ONT.C2.P2," lookupfile2.csv | head -1 | cut -d"," -f2,7 | sed 's/ //')
echo "$ONT,$ONSTATUS,$CID,$line1,$line2" >> BUwithPO.csv
} &
done < inputfile.csv
inputfile.csv 包含如下所示的行:
343OL5:LT1.PN1.ONT1
343OL5:LT1.PN1.ONT10
225OL0:LT1.PN1.ONT34
225OL0:LT1.PN1.ONT39
343OL5:LT1.PN1.ONT100
225OL0:LT1.PN1.ONT57
lookupfile1.csv 包含:
343OL5:LT1.PN1.ONT100, Down,Locked,No
225OL0:LT1.PN1.ONT57, Up,Unlocked,Yes
343OL5:LT1.PN1.ONT1, Down,Unlocked,No
225OL0:LT1.PN1.ONT34, Up,Unlocked,Yes
225OL0:LT1.PN1.ONT39, Up,Unlocked,Yes
lookupfile2.csv 包含:
225OL0:LT1.PN1.ONT34.C2.P1, +123125302766,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
225OL0:LT1.PN1.ONT57.C2.P1, +123125334019,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
225OL0:LT1.PN1.ONT57.C2.P2, +123125334819,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
343OL5:LT1.PN1.ONT100.C2.P11, +123128994019,REG,DigitMap,Unlocked,_media_ANT,FD_BSFU.xml,
lookupfile3.csv 包含:
343OL5:LT1.PON1.ONT100.SERV1,12-654-0330
343OL5:LT1.PON1.ONT100.C1.P1,12-654-0330
343OL5:LT7.PON8.ONT75.SERV1,12-664-1186
225OL0:LT1.PN1.ONT34.C1.P1.FLOW1,12-530-2766
225OL0:LT1.PN1.ONT57.C1.P1.FLOW1,12-533-4019
输出是:
225OL0:LT1.PN1.ONT57, Up,Unlocked,Yes,12-533-4019,+123125334019,FD_BSFU.xml,+123125334819,FD_BSFU.xml
225OL0:LT1.PN1.ONT34, Up,Unlocked,Yes,12-530-2766,+123125302766,FD_BSFU.xml,
343OL5:LT1.PN1.ONT1, Down,Unlocked,No,,,
343OL5:LT1.PN1.ONT100, Down,Locked,No,,,
343OL5:LT1.PN1.ONT10,,,,
225OL0:LT1.PN1.ONT39, Up,Unlocked,Yes,,,
如您所见,瓶颈将在循环内多次执行 grep
。您可以通过使用关联数组创建查找 table 来提高效率。
如果 awk
可用,请尝试以下操作:
[更新]
#!/bin/bash
awk '
FILENAME=="lookupfile1.csv" {
sub(",$", "", );
onstatus[] =
}
FILENAME=="lookupfile2.csv" {
split(, a, ",")
if (sub("\.C2\.P1,$", "", )) line1[] = a[1]","a[6]
else if (sub("\.C2\.P2,$", "", )) line2[] = a[1]","a[6]
}
FILENAME=="lookupfile3.csv" {
split([=10=], a, ",")
if (match(a[1], ".+\.ONT[0-9]+")) {
ont = substr(a[1], RSTART, RLENGTH)
cid[ont] = a[2]
}
}
FILENAME=="inputfile.csv" {
print [=10=]","onstatus[[=10=]]","cid[[=10=]]","line1[[=10=]]","line2[[=10=]]
}
' lookupfile1.csv lookupfile2.csv lookupfile3.csv inputfile.csv > BUwithPO.csv
{编辑]
如果您需要指定文件的绝对路径,请尝试:
#!/bin/bash
awk '
FILENAME ~ /lookupfile1.csv$/ {
sub(",$", "", );
onstatus[] =
}
FILENAME ~ /lookupfile2.csv$/ {
split(, a, ",")
if (sub("\.C2\.P1,$", "", )) line1[] = a[1]","a[6]
else if (sub("\.C2\.P2,$", "", )) line2[] = a[1]","a[6]
}
FILENAME ~ /lookupfile3.csv$/ {
split([=11=], a, ",")
if (match(a[1], ".+\.ONT[0-9]+")) {
ont = substr(a[1], RSTART, RLENGTH)
cid[ont] = a[2]
}
}
FILENAME ~ /inputfile.csv$/ {
print [=11=]","onstatus[[=11=]]","cid[[=11=]]","line1[[=11=]]","line2[[=11=]]
}
' /path/to/lookupfile1.csv /path/to/lookupfile2.csv /path/to/lookupfile3.csv /path/to/inputfile.csv > /path/to/BUwithPO.csv
希望这对您有所帮助。
如果如您在评论中指出的那样,由于缺少GNU awk
提供的gensub
而无法使用@tshiono提供的解决方案,您可以替换gensub
使用临时变量两次调用 sub
来完成修剪所需的后缀。
示例:
awk '
FILENAME=="lookupfile1.csv" {
sub(",$", "", );
onstatus[] =
}
FILENAME=="lookupfile2.csv" {
split(, a, ",")
if (sub("\.C2\.P1,$", "", )) line1[] = a[1]","a[6]
else if (sub("\.C2\.P2,$", "", )) line2[] = a[1]","a[6]
}
FILENAME=="lookupfile3.csv" {
split([=10=], a, ",")
# ont = gensub("(\.ONT[0-9]+).*", "\1", 1, a[1])
sfx = a[1]
sub(/^.*[.]ONT[^.]*/, "", sfx)
sub(sfx, "", a[1])
# cid[ont] = a[2]
cid[a[1]] = a[2]
}
FILENAME=="inputfile.csv" {
print [=10=]","onstatus[[=10=]]","cid[[=10=]]","line1[[=10=]]","line2[[=10=]]
}
' lookupfile1.csv lookupfile2.csv lookupfile3.csv inputfile.csv > BUwithPO.csv
我在与 FILENAME=="lookupfile3.csv"
相关的部分注释掉了 gensub
的使用,并使用 [=19= 两次调用 sub
替换了 gensub
表达式](后缀)作为临时变量。
试一试,如果你能用,请告诉我。
Perl 解决方案
以下脚本类似于 awk
解决方案,但用 Perl 编写。
将其另存为 filter.pl
并使其可执行。
#!/usr/bin/env perl
use strict;
use warnings;
my %lookup1;
my %lookup2_1;
my %lookup2_2;
my %lookup3;
while( <> ) {
if ( $ARGV eq 'lookupfile1.csv' ) {
# 225OL0:LT1.PN1.ONT34, Up,Unlocked,Yes
# ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
if (/^([^,]+),\s*(.*)$/) {
$lookup1{} = ;
}
} elsif ( $ARGV eq 'lookupfile2.csv' ) {
# 225OL0:LT1.PN1.ONT34.C2.P1, +123125302766,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
# ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^
if (/^(.+ONT\d+)\.C2\.P1,\s*([^,]+),(?:[^,]+,){4}([^,]+)/) {
$lookup2_1{} = ",";
} elsif (/^(.+ONT\d+)\.C2\.P2,\s*([^,]+),(?:[^,]+,){4}([^,]+)/) {
$lookup2_2{} = ",";
}
} elsif ( $ARGV eq 'lookupfile3.csv' ) {
# 225OL0:LT1.PN1.ONT34.C1.P1.FLOW1,12-530-2766
# ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^
if (/^(.+ONT\d+)[^,]+,\s*(.*)$/) {
$lookup3{} = ;
}
} else { # assume 'inputfile.csv'
no warnings 'uninitialized'; # because not all keys ($_) have values in the lookup tables
# 225OL0:LT1.PN1.ONT34
chomp;
print "$_,$lookup1{$_},$lookup3{$_},$lookup2_1{$_},$lookup2_2{$_}\n";
}
}
像这样执行:
./filter.pl lookupfile{1,2,3}.csv inputfile.csv > BUwithPO.csv
首先查找文件很重要(如 awk
解决方案,顺便说一句。)因为
他们构建了四个字典(hashes 用 Perl 的说法)%lookup1
、%lookup2_1
等。
然后 inputfile.csv
中的值与那些字典匹配。
我需要增强下面的脚本,它需要一个包含近百万行的输入文件。针对每一行,它在 3 个查找文件中具有不同的值,我打算将其作为逗号分隔值添加到我的输出中。
下面的脚本工作正常,但需要几个小时才能完成。我正在寻找一个真正快速的解决方案,它对系统的负担也不那么大。
#!/bin/bash
while read -r ONT
do
{
ONSTATUS=$(grep "$ONT," lookupfile1.csv | cut -d" " -f2)
CID=$(grep "$ONT." lookupfile3.csv | head -1 | cut -d, -f2)
line1=$(grep "$ONT.C2.P1," lookupfile2.csv | head -1 | cut -d"," -f2,7 | sed 's/ //')
line2=$(grep "$ONT.C2.P2," lookupfile2.csv | head -1 | cut -d"," -f2,7 | sed 's/ //')
echo "$ONT,$ONSTATUS,$CID,$line1,$line2" >> BUwithPO.csv
} &
done < inputfile.csv
inputfile.csv 包含如下所示的行:
343OL5:LT1.PN1.ONT1
343OL5:LT1.PN1.ONT10
225OL0:LT1.PN1.ONT34
225OL0:LT1.PN1.ONT39
343OL5:LT1.PN1.ONT100
225OL0:LT1.PN1.ONT57
lookupfile1.csv 包含:
343OL5:LT1.PN1.ONT100, Down,Locked,No
225OL0:LT1.PN1.ONT57, Up,Unlocked,Yes
343OL5:LT1.PN1.ONT1, Down,Unlocked,No
225OL0:LT1.PN1.ONT34, Up,Unlocked,Yes
225OL0:LT1.PN1.ONT39, Up,Unlocked,Yes
lookupfile2.csv 包含:
225OL0:LT1.PN1.ONT34.C2.P1, +123125302766,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
225OL0:LT1.PN1.ONT57.C2.P1, +123125334019,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
225OL0:LT1.PN1.ONT57.C2.P2, +123125334819,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
343OL5:LT1.PN1.ONT100.C2.P11, +123128994019,REG,DigitMap,Unlocked,_media_ANT,FD_BSFU.xml,
lookupfile3.csv 包含:
343OL5:LT1.PON1.ONT100.SERV1,12-654-0330
343OL5:LT1.PON1.ONT100.C1.P1,12-654-0330
343OL5:LT7.PON8.ONT75.SERV1,12-664-1186
225OL0:LT1.PN1.ONT34.C1.P1.FLOW1,12-530-2766
225OL0:LT1.PN1.ONT57.C1.P1.FLOW1,12-533-4019
输出是:
225OL0:LT1.PN1.ONT57, Up,Unlocked,Yes,12-533-4019,+123125334019,FD_BSFU.xml,+123125334819,FD_BSFU.xml
225OL0:LT1.PN1.ONT34, Up,Unlocked,Yes,12-530-2766,+123125302766,FD_BSFU.xml,
343OL5:LT1.PN1.ONT1, Down,Unlocked,No,,,
343OL5:LT1.PN1.ONT100, Down,Locked,No,,,
343OL5:LT1.PN1.ONT10,,,,
225OL0:LT1.PN1.ONT39, Up,Unlocked,Yes,,,
如您所见,瓶颈将在循环内多次执行 grep
。您可以通过使用关联数组创建查找 table 来提高效率。
如果 awk
可用,请尝试以下操作:
[更新]
#!/bin/bash
awk '
FILENAME=="lookupfile1.csv" {
sub(",$", "", );
onstatus[] =
}
FILENAME=="lookupfile2.csv" {
split(, a, ",")
if (sub("\.C2\.P1,$", "", )) line1[] = a[1]","a[6]
else if (sub("\.C2\.P2,$", "", )) line2[] = a[1]","a[6]
}
FILENAME=="lookupfile3.csv" {
split([=10=], a, ",")
if (match(a[1], ".+\.ONT[0-9]+")) {
ont = substr(a[1], RSTART, RLENGTH)
cid[ont] = a[2]
}
}
FILENAME=="inputfile.csv" {
print [=10=]","onstatus[[=10=]]","cid[[=10=]]","line1[[=10=]]","line2[[=10=]]
}
' lookupfile1.csv lookupfile2.csv lookupfile3.csv inputfile.csv > BUwithPO.csv
{编辑]
如果您需要指定文件的绝对路径,请尝试:
#!/bin/bash
awk '
FILENAME ~ /lookupfile1.csv$/ {
sub(",$", "", );
onstatus[] =
}
FILENAME ~ /lookupfile2.csv$/ {
split(, a, ",")
if (sub("\.C2\.P1,$", "", )) line1[] = a[1]","a[6]
else if (sub("\.C2\.P2,$", "", )) line2[] = a[1]","a[6]
}
FILENAME ~ /lookupfile3.csv$/ {
split([=11=], a, ",")
if (match(a[1], ".+\.ONT[0-9]+")) {
ont = substr(a[1], RSTART, RLENGTH)
cid[ont] = a[2]
}
}
FILENAME ~ /inputfile.csv$/ {
print [=11=]","onstatus[[=11=]]","cid[[=11=]]","line1[[=11=]]","line2[[=11=]]
}
' /path/to/lookupfile1.csv /path/to/lookupfile2.csv /path/to/lookupfile3.csv /path/to/inputfile.csv > /path/to/BUwithPO.csv
希望这对您有所帮助。
如果如您在评论中指出的那样,由于缺少GNU awk
提供的gensub
而无法使用@tshiono提供的解决方案,您可以替换gensub
使用临时变量两次调用 sub
来完成修剪所需的后缀。
示例:
awk '
FILENAME=="lookupfile1.csv" {
sub(",$", "", );
onstatus[] =
}
FILENAME=="lookupfile2.csv" {
split(, a, ",")
if (sub("\.C2\.P1,$", "", )) line1[] = a[1]","a[6]
else if (sub("\.C2\.P2,$", "", )) line2[] = a[1]","a[6]
}
FILENAME=="lookupfile3.csv" {
split([=10=], a, ",")
# ont = gensub("(\.ONT[0-9]+).*", "\1", 1, a[1])
sfx = a[1]
sub(/^.*[.]ONT[^.]*/, "", sfx)
sub(sfx, "", a[1])
# cid[ont] = a[2]
cid[a[1]] = a[2]
}
FILENAME=="inputfile.csv" {
print [=10=]","onstatus[[=10=]]","cid[[=10=]]","line1[[=10=]]","line2[[=10=]]
}
' lookupfile1.csv lookupfile2.csv lookupfile3.csv inputfile.csv > BUwithPO.csv
我在与 FILENAME=="lookupfile3.csv"
相关的部分注释掉了 gensub
的使用,并使用 [=19= 两次调用 sub
替换了 gensub
表达式](后缀)作为临时变量。
试一试,如果你能用,请告诉我。
Perl 解决方案
以下脚本类似于 awk
解决方案,但用 Perl 编写。
将其另存为 filter.pl
并使其可执行。
#!/usr/bin/env perl
use strict;
use warnings;
my %lookup1;
my %lookup2_1;
my %lookup2_2;
my %lookup3;
while( <> ) {
if ( $ARGV eq 'lookupfile1.csv' ) {
# 225OL0:LT1.PN1.ONT34, Up,Unlocked,Yes
# ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
if (/^([^,]+),\s*(.*)$/) {
$lookup1{} = ;
}
} elsif ( $ARGV eq 'lookupfile2.csv' ) {
# 225OL0:LT1.PN1.ONT34.C2.P1, +123125302766,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
# ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^
if (/^(.+ONT\d+)\.C2\.P1,\s*([^,]+),(?:[^,]+,){4}([^,]+)/) {
$lookup2_1{} = ",";
} elsif (/^(.+ONT\d+)\.C2\.P2,\s*([^,]+),(?:[^,]+,){4}([^,]+)/) {
$lookup2_2{} = ",";
}
} elsif ( $ARGV eq 'lookupfile3.csv' ) {
# 225OL0:LT1.PN1.ONT34.C1.P1.FLOW1,12-530-2766
# ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^
if (/^(.+ONT\d+)[^,]+,\s*(.*)$/) {
$lookup3{} = ;
}
} else { # assume 'inputfile.csv'
no warnings 'uninitialized'; # because not all keys ($_) have values in the lookup tables
# 225OL0:LT1.PN1.ONT34
chomp;
print "$_,$lookup1{$_},$lookup3{$_},$lookup2_1{$_},$lookup2_2{$_}\n";
}
}
像这样执行:
./filter.pl lookupfile{1,2,3}.csv inputfile.csv > BUwithPO.csv
首先查找文件很重要(如 awk
解决方案,顺便说一句。)因为
他们构建了四个字典(hashes 用 Perl 的说法)%lookup1
、%lookup2_1
等。
然后 inputfile.csv
中的值与那些字典匹配。