减少 'While read' 循环的处理时间
Reduce processing time for 'While read' loop
shell 脚本编写新手..
我有一个巨大的 csv 文件,长度不等 f11,例如
“000000aaad000000bhb200000uwwed...”
“000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew...”
.
.
将字符串拆分为10个大小后,我需要6-9个字符。然后我必须使用定界符'|'像
一样加入他们
0aaa|0bhb|uwwe...
0aba|bbrb|0wwq|caba|0bhb|0qwe...
并将处理后的 f11 与其他字段合并
这是处理 10k 条记录所花费的时间 ->
实际 4m43.506s
用户 0m12.366s
系统 0m12.131s
20K 条记录 ->
真实 5m20.244s
用户 2m21.591s
系统 3m20.042s
80K 条记录(约 370 万条 f11 拆分并与“|”合并)->
实际 21m18.854s
用户 9m41.944s
系统 13m29.019s
我预计处理 650K 条记录的时间是 30 分钟(大约 5600 万次 f11 拆分和合并)。有什么优化方法吗?
while read -r line1; do
f10=$( echo $line1 | cut -d',' -f1,2,3,4,5,7,9,10)
echo $f10 >> $path/other_fields
f11=$( echo $line1 | cut -d',' -f11 )
f11_trim=$(echo "$f11" | tr -d '"')
echo $f11_trim | fold -w10 > $path/f11_extract
cat $path/f11_extract | awk '{print }' | cut -c6-9 >> $path/str_list_trim
arr=($(cat $path/str_list_trim))
printf "%s|" ${arr[@]} >> $path/str_list_serialized
printf '\n' >> $path/str_list_serialized
arr=()
rm $path/f11_extract
rm $path/str_list_trim
done < $input
sed -i 's/.$//' $path/str_list_serialized
sed -i 's/\(.*\)/""/g' $path/str_list_serialized
paste -d "," $path/other_fields $path/str_list_serialized > $path/final_out
由于以下原因,您的代码效率不高:
- 在循环中调用包括 awk 在内的多个命令。
- 正在生成许多中间时间文件。
你可以用 awk 完成这项工作:
awk -F, -v OFS="," ' # assign input/output field separator to a comma
{
len = length() # length of the 11th field
s = ""; d = "" # clear output string and the delimiter
for (i = 1; i <= len / 10; i++) { # iterate over the 11th field
s = s d substr(, (i - 1) * 10 + 6, 4) # concatenate 6-9th substring of 10 characters long chunks
d = "|" # set the delimiter to a pipe character
}
= "\"" s "\"" # assign the 11th field to the generated string
} 1' "$input" # the final "1" tells awk to print all fields
输入示例:
1,2,3,4,5,6,7,8,9,10,000000aaad000000bhb200000uwwed
1,2,3,4,5,6,7,8,9,10,000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew
输出:
1,2,3,4,5,6,7,8,9,10,"0aaa|0bhb|uwwe"
1,2,3,4,5,6,7,8,9,10,"0aba|bbrb|0wwq|caba|0bhb|0qwe"
shell 脚本编写新手..
我有一个巨大的 csv 文件,长度不等 f11,例如
“000000aaad000000bhb200000uwwed...”
“000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew...”
.
.
将字符串拆分为10个大小后,我需要6-9个字符。然后我必须使用定界符'|'像
一样加入他们0aaa|0bhb|uwwe...
0aba|bbrb|0wwq|caba|0bhb|0qwe...
并将处理后的 f11 与其他字段合并
这是处理 10k 条记录所花费的时间 ->
实际 4m43.506s
用户 0m12.366s
系统 0m12.131s
20K 条记录 ->
真实 5m20.244s
用户 2m21.591s
系统 3m20.042s
80K 条记录(约 370 万条 f11 拆分并与“|”合并)->
实际 21m18.854s
用户 9m41.944s
系统 13m29.019s
我预计处理 650K 条记录的时间是 30 分钟(大约 5600 万次 f11 拆分和合并)。有什么优化方法吗?
while read -r line1; do
f10=$( echo $line1 | cut -d',' -f1,2,3,4,5,7,9,10)
echo $f10 >> $path/other_fields
f11=$( echo $line1 | cut -d',' -f11 )
f11_trim=$(echo "$f11" | tr -d '"')
echo $f11_trim | fold -w10 > $path/f11_extract
cat $path/f11_extract | awk '{print }' | cut -c6-9 >> $path/str_list_trim
arr=($(cat $path/str_list_trim))
printf "%s|" ${arr[@]} >> $path/str_list_serialized
printf '\n' >> $path/str_list_serialized
arr=()
rm $path/f11_extract
rm $path/str_list_trim
done < $input
sed -i 's/.$//' $path/str_list_serialized
sed -i 's/\(.*\)/""/g' $path/str_list_serialized
paste -d "," $path/other_fields $path/str_list_serialized > $path/final_out
由于以下原因,您的代码效率不高:
- 在循环中调用包括 awk 在内的多个命令。
- 正在生成许多中间时间文件。
你可以用 awk 完成这项工作:
awk -F, -v OFS="," ' # assign input/output field separator to a comma
{
len = length() # length of the 11th field
s = ""; d = "" # clear output string and the delimiter
for (i = 1; i <= len / 10; i++) { # iterate over the 11th field
s = s d substr(, (i - 1) * 10 + 6, 4) # concatenate 6-9th substring of 10 characters long chunks
d = "|" # set the delimiter to a pipe character
}
= "\"" s "\"" # assign the 11th field to the generated string
} 1' "$input" # the final "1" tells awk to print all fields
输入示例:
1,2,3,4,5,6,7,8,9,10,000000aaad000000bhb200000uwwed
1,2,3,4,5,6,7,8,9,10,000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew
输出:
1,2,3,4,5,6,7,8,9,10,"0aaa|0bhb|uwwe"
1,2,3,4,5,6,7,8,9,10,"0aba|bbrb|0wwq|caba|0bhb|0qwe"