如何在 bash 脚本中加快读取 txt 文件的速度
How to speed up reading txt file in bash script
我正在构建一个脚本,该脚本读取每天 24 小时的温度数据,以提取较小域的经纬度区域。每个数据文件温度-经度-纬度三列188426行。
> ==> 20120810234500.txt <==
> 0.0362,-12.5000,33.5000
> -0.0188,-12.5000,33.5400
> -0.0732,-12.5000,33.5800
> -0.1263,-12.5000,33.6200
> -0.1778,-12.5000,33.6600
> -0.2278,-12.5000,33.7000
> -0.2761,-12.5000,33.7400
> -0.3226,-12.5000,33.7800
> -0.3677,-12.5000,33.8200
> -0.4115,-12.5000,33.8600
我已经使用 for 和 while 循环以及 awk 命令来读取数据,但是读取、提取和抓取新的较小文件需要很长时间(至少对我而言)。在这里你可以看到脚本的相关部分
# Start 24 hours loop
lom1=-3
lom2=3
lam1=35
lam2=42
nhoras=24
n=1
while [ $n -le $nhoras ]
do
# File name (nom_file) and length (nstation=188426)
nom_file=`awk -v i=$n 'BEGIN { FS = ","} NR==i { print }' lista_datos.txt`
nstation=`awk 'END{print NR}' $nom_file`
# Original data came from windows system and has carriage returns
dos2unix -q $nom_file
# Date, time values from file name
year=`echo $nom_file | cut -c 1-4`
month=`echo $nom_file | cut -c 5-6`
day=`echo $nom_file | cut -c 7-8`
hour=`echo $nom_file | cut -c 9-14`
# Part of the string to write in the new smaller file
var1=`echo $nom_file | awk '{print substr([=12=],1,4) " " substr([=12=],5,2) " " substr([=12=],7,2) " " substr([=12=],9,6)}'`
# Read rows 65000 to 125000 to gain processing time
m=65000
#while [ $m -le $nstation ] # Bucle extración datos
while [ $m -le 125000 ] # Bucle extración datos
do
station_id=$m
elevation=1.5
lat=`awk -v i=$m 'BEGIN { FS = ","} NR==i { print }' $nom_file`
lon=`awk -v i=$m 'BEGIN { FS = ","} NR==i { print }' $nom_file`
# As lon/lat are floating point I use this workaround to get a smaller region
lom1=`echo $lon'>'$lon1 | bc -l`
lom2=`echo $lon'<'$lon2 | bc -l`
lam1=`echo $lat'>'$lat1 | bc -l`
lam2=`echo $lat'<'$lat2 | bc -l`
if [ $lom1 -eq 1 ] && [ $lom2 -eq 1 ];
then
if [ $lam1 -eq 1 ] && [ $lam2 -eq 1 ];
then
# Second part of the string to write in the new smaller file
var2=`awk -v i=$m -v e=$elevation 'BEGIN { FS = ","} NR==i { print "'${station_id}' " " " " '${elevation}' 000 " " 000" }' $nom_file`
# Paste
paste <(echo "$var1") <(echo "$var2") -d ' ' >> out.txt
fi # final condición lat
fi # final condición lon
m=$(( $m + 1 ))
done # End of extracting loop
# Save results
cat cabecera-dp-s.txt out.txt > dp-s$year-$month-$day-$hour
rm out.txt
n=$(( $n + 1 ))
done # End 24 hours loop
现在处理一个输入文件需要两个小时。有什么方法可以加快这个过程吗?
提前致谢
感谢所有评论,特别感谢@fedorqui
正确使用 awk 处理速度已显着提高。我第一次尝试在 2 小时内处理了一个文件,现在在 93 分钟内处理了 24 个文件。应该有改进的余地,但现在对我来说很好。再次感谢。
我附上脚本,也许对某人有用
#!/bin/bash
# RUTAS
base=/home/meteo/PROJECTES/TERMED
dades=$base/DADES
files=$base/FILES
msg_data=$dades/MSG/Agosto
treball=$base/TREBALL
# INICIO DEL SCRIPT
cd $treball
rm *
# Header for final output
cp $files/cabecera-dp-s.txt ./
# Inicio bucle dia
for dia
in 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
do
cp $msg_data/$dia/* ./
ls 2*.txt > lista_datos.txt
awk '{print substr([=10=],9,6)}' lista_datos.txt > lista_horas.txt
nhoras=`awk 'END{print NR}' lista_horas.txt`
# Inicio bucle hora
n=1
while [ $n -le $nhoras ]
do
# File name and size
nom_file=`awk -v i=$n 'BEGIN { FS = ","} NR==i { print }' lista_datos.txt`
nstation=`awk 'END{print NR}' $nom_file`
# avoid carriage returns
dos2unix -q $nom_file
# Date values
year=`echo $nom_file | cut -c 1-4`
month=`echo $nom_file | cut -c 5-6`
day=`echo $nom_file | cut -c 7-8`
hour=`echo $nom_file | cut -c 9-14`
# Extract region, thanks fedorqui
awk -F, '>=-3 && <=3 && >=35 && <=42' $nom_file > output-$year$month$day$hour.txt
# Parte 1 de la línea de datos RAMS
var1=`echo $nom_file | awk '{print substr([=10=],1,4) " " substr([=10=],5,2) " " substr([=10=],7,2) " " substr([=10=],9,6)}'`
# station_id, latitud, longitud, elevación y temperatura para cada punto
m=1
nstation=`awk 'END{print NR}' output-$year$month$day$hour.txt`
while [ $m -le $nstation ] # Bucle extración datos
do
station_id=$m
elevation=1.5
# Parte 2 de la línea de datos RAMS
var2=`awk -v i=$m -v e=$elevation 'BEGIN { FS = ","} NR==i { print "'${station_id}' " " " " '${elevation}' 000 " " 000" }' output-$year$month$day$hour.txt`
# Pegamos las dos partes para construir la línea de datos
paste <(echo "$var1") <(echo "$var2") -d ' ' >> out.txt
m=$(( $m + 1 ))
done # Final bucle extracción datos
# Guardamos la salida con el formato y nombre RAMS
cat cabecera-dp-s.txt out.txt > dp-s$year-$month-$day-$hour
n=$(( $n + 1 ))
rm out.txt
done # Final bucle horas
# Borra datos para evitar conflicto con lista_horas, lista_datos
rm *txt
done # Final bucle dia
我正在构建一个脚本,该脚本读取每天 24 小时的温度数据,以提取较小域的经纬度区域。每个数据文件温度-经度-纬度三列188426行。
> ==> 20120810234500.txt <==
> 0.0362,-12.5000,33.5000
> -0.0188,-12.5000,33.5400
> -0.0732,-12.5000,33.5800
> -0.1263,-12.5000,33.6200
> -0.1778,-12.5000,33.6600
> -0.2278,-12.5000,33.7000
> -0.2761,-12.5000,33.7400
> -0.3226,-12.5000,33.7800
> -0.3677,-12.5000,33.8200
> -0.4115,-12.5000,33.8600
我已经使用 for 和 while 循环以及 awk 命令来读取数据,但是读取、提取和抓取新的较小文件需要很长时间(至少对我而言)。在这里你可以看到脚本的相关部分
# Start 24 hours loop
lom1=-3
lom2=3
lam1=35
lam2=42
nhoras=24
n=1
while [ $n -le $nhoras ]
do
# File name (nom_file) and length (nstation=188426)
nom_file=`awk -v i=$n 'BEGIN { FS = ","} NR==i { print }' lista_datos.txt`
nstation=`awk 'END{print NR}' $nom_file`
# Original data came from windows system and has carriage returns
dos2unix -q $nom_file
# Date, time values from file name
year=`echo $nom_file | cut -c 1-4`
month=`echo $nom_file | cut -c 5-6`
day=`echo $nom_file | cut -c 7-8`
hour=`echo $nom_file | cut -c 9-14`
# Part of the string to write in the new smaller file
var1=`echo $nom_file | awk '{print substr([=12=],1,4) " " substr([=12=],5,2) " " substr([=12=],7,2) " " substr([=12=],9,6)}'`
# Read rows 65000 to 125000 to gain processing time
m=65000
#while [ $m -le $nstation ] # Bucle extración datos
while [ $m -le 125000 ] # Bucle extración datos
do
station_id=$m
elevation=1.5
lat=`awk -v i=$m 'BEGIN { FS = ","} NR==i { print }' $nom_file`
lon=`awk -v i=$m 'BEGIN { FS = ","} NR==i { print }' $nom_file`
# As lon/lat are floating point I use this workaround to get a smaller region
lom1=`echo $lon'>'$lon1 | bc -l`
lom2=`echo $lon'<'$lon2 | bc -l`
lam1=`echo $lat'>'$lat1 | bc -l`
lam2=`echo $lat'<'$lat2 | bc -l`
if [ $lom1 -eq 1 ] && [ $lom2 -eq 1 ];
then
if [ $lam1 -eq 1 ] && [ $lam2 -eq 1 ];
then
# Second part of the string to write in the new smaller file
var2=`awk -v i=$m -v e=$elevation 'BEGIN { FS = ","} NR==i { print "'${station_id}' " " " " '${elevation}' 000 " " 000" }' $nom_file`
# Paste
paste <(echo "$var1") <(echo "$var2") -d ' ' >> out.txt
fi # final condición lat
fi # final condición lon
m=$(( $m + 1 ))
done # End of extracting loop
# Save results
cat cabecera-dp-s.txt out.txt > dp-s$year-$month-$day-$hour
rm out.txt
n=$(( $n + 1 ))
done # End 24 hours loop
现在处理一个输入文件需要两个小时。有什么方法可以加快这个过程吗?
提前致谢
感谢所有评论,特别感谢@fedorqui
正确使用 awk 处理速度已显着提高。我第一次尝试在 2 小时内处理了一个文件,现在在 93 分钟内处理了 24 个文件。应该有改进的余地,但现在对我来说很好。再次感谢。
我附上脚本,也许对某人有用
#!/bin/bash
# RUTAS
base=/home/meteo/PROJECTES/TERMED
dades=$base/DADES
files=$base/FILES
msg_data=$dades/MSG/Agosto
treball=$base/TREBALL
# INICIO DEL SCRIPT
cd $treball
rm *
# Header for final output
cp $files/cabecera-dp-s.txt ./
# Inicio bucle dia
for dia
in 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
do
cp $msg_data/$dia/* ./
ls 2*.txt > lista_datos.txt
awk '{print substr([=10=],9,6)}' lista_datos.txt > lista_horas.txt
nhoras=`awk 'END{print NR}' lista_horas.txt`
# Inicio bucle hora
n=1
while [ $n -le $nhoras ]
do
# File name and size
nom_file=`awk -v i=$n 'BEGIN { FS = ","} NR==i { print }' lista_datos.txt`
nstation=`awk 'END{print NR}' $nom_file`
# avoid carriage returns
dos2unix -q $nom_file
# Date values
year=`echo $nom_file | cut -c 1-4`
month=`echo $nom_file | cut -c 5-6`
day=`echo $nom_file | cut -c 7-8`
hour=`echo $nom_file | cut -c 9-14`
# Extract region, thanks fedorqui
awk -F, '>=-3 && <=3 && >=35 && <=42' $nom_file > output-$year$month$day$hour.txt
# Parte 1 de la línea de datos RAMS
var1=`echo $nom_file | awk '{print substr([=10=],1,4) " " substr([=10=],5,2) " " substr([=10=],7,2) " " substr([=10=],9,6)}'`
# station_id, latitud, longitud, elevación y temperatura para cada punto
m=1
nstation=`awk 'END{print NR}' output-$year$month$day$hour.txt`
while [ $m -le $nstation ] # Bucle extración datos
do
station_id=$m
elevation=1.5
# Parte 2 de la línea de datos RAMS
var2=`awk -v i=$m -v e=$elevation 'BEGIN { FS = ","} NR==i { print "'${station_id}' " " " " '${elevation}' 000 " " 000" }' output-$year$month$day$hour.txt`
# Pegamos las dos partes para construir la línea de datos
paste <(echo "$var1") <(echo "$var2") -d ' ' >> out.txt
m=$(( $m + 1 ))
done # Final bucle extracción datos
# Guardamos la salida con el formato y nombre RAMS
cat cabecera-dp-s.txt out.txt > dp-s$year-$month-$day-$hour
n=$(( $n + 1 ))
rm out.txt
done # Final bucle horas
# Borra datos para evitar conflicto con lista_horas, lista_datos
rm *txt
done # Final bucle dia