改进 python csv 处理循环
Improve python csv processing loop
我有一个 csv 文件,我需要从中导出多列数据。第 4 列包含线名。每当此列中的值发生变化时,我正在导出的数据都需要写入一个新的单独文件。以下代码有效,但速度很慢。关于如何改进这一点的任何提示?
加法:数据样本:
Altitude,Date,Db,Depth,Linename,Qmag,TF,TF_HP,X,X_ob,X_org,Y,Y_ob,Y_org
10.87,10/2/2015,148,21.8342,10,1316,48831.936,0.060026123,506479.5515,506479.46,506479.46,5726744.3,5726743.73,5726743.73
10.84,10/2/2015,148,21.8342,11,1316,48831.969,0.092713686,506479.7927,506479.77,506479.77,5726744.443,5726744.2,5726744.2
10.85,10/2/2015,148,21.8669,11,1313,48832.014,0.137400275,506479.9672,506479.77,506479.77,5726744.741,5726744.2,5726744.2
10.82,10/2/2015,148,21.8342,12,1311,48831.969,0.092093953,506480.1677,506479.92,506479.92,5726744.945,5726744.44,5726744.44
10.83,10/2/2015,148,21.8669,12,1309,48831.969,0.091807708,506480.326,506480.08,506480.08,5726745.195,5726744.68,5726744.68
Python代码:
import glob,csv,os, itertools
list_of_files = glob.glob('C:/test/*.csv')
directory = 'C:/test/conv/'
if not os.path.exists(directory):
os.makedirs(directory)
for filename in list_of_files:
with open(filename,"r") as source:
header_line = next(source)
rdr= csv.reader( source, delimiter = ',',lineterminator='\n' )
x=0
for row in itertools.islice(rdr,0,None):
itemRow4 = row[4]
outfileName=directory+itemRow4+'.csv'
with open(outfileName,"a") as result:
wtr=csv.writer(result, lineterminator='\n')
if x == 0:
previousitemRow4 = row[4]
x = x+1
if previousitemRow4 == itemRow4:
wtr.writerow((row[8],row[11],row[6],row[0]))
previousitemRow4 = itemRow4
if previousitemRow4 != itemRow4:
wtr.writerow((row[8],row[11],row[6],row[0]))
print 'next line'
previousitemRow4 = itemRow4
使用标准 Unix shell 工具 cut
, sort
, uniq
, and grep
:
$ cut -d, -f5 < in.csv | sort | uniq | while read lineno
do
grep ",${lineno}," in.csv > out-${lineno}.csv
done
$ ls out-*.csv
out-10.csv out-11.csv out-12.csv out-Linename.csv
可能 grep
表达式不够复杂,因为 lineno 可能出现在其他列中,但第 5 列除外。在这种情况下,可以使用一个简单的正则表达式使 grep
仅在第 5 列中找到 lineno。
根据Eumiro的建议,我想出了这个解决方案。我尝试使用列表词典,但无法正常工作。以下解决方案有效并且速度非常快。谢谢大家的帮助!
import glob,csv,os, itertools
list_of_files = glob.glob('C:/test/*.csv')
print 'By the Power of the Python, Split these here files!'
directory = 'C:/test/conv/'
if not os.path.exists(directory):
os.makedirs(directory)
for filename in list_of_files:
storage = []
specialStorage = []
with open(filename,"r") as source:
header_line = next(source)
rdr= csv.reader( source, delimiter = ',',lineterminator='\n' )
x=0
resetValue = 0
for row in itertools.islice(rdr,0,None):
itemRow4 = row[4]
if x == 0:
previousitemRow4 = row[4]
x = x+1
outfileName=directory+previousitemRow4+'.csv'
if previousitemRow4 == itemRow4:
storage.append((row[8],row[11],row[6],row[0]))
previousitemRow4 = itemRow4
if previousitemRow4 != itemRow4:
with open(outfileName,"a") as result:
wtr=csv.writer(result, lineterminator='\n')
previousitemRow4 = itemRow4
if len(specialStorage) !=0:
wtr.writerow(specialStorage)
wtr.writerows(storage)
storage = []
specialStorage=(row[8],row[11],row[6],row[0])
else:
with open(outfileName,"a") as result:
wtr=csv.writer(result, lineterminator='\n')
previousitemRow4 = itemRow4
wtr.writerow(specialStorage)
wtr.writerows(storage)
storage = []
print 'end of file reached'
我有一个 csv 文件,我需要从中导出多列数据。第 4 列包含线名。每当此列中的值发生变化时,我正在导出的数据都需要写入一个新的单独文件。以下代码有效,但速度很慢。关于如何改进这一点的任何提示?
加法:数据样本:
Altitude,Date,Db,Depth,Linename,Qmag,TF,TF_HP,X,X_ob,X_org,Y,Y_ob,Y_org
10.87,10/2/2015,148,21.8342,10,1316,48831.936,0.060026123,506479.5515,506479.46,506479.46,5726744.3,5726743.73,5726743.73
10.84,10/2/2015,148,21.8342,11,1316,48831.969,0.092713686,506479.7927,506479.77,506479.77,5726744.443,5726744.2,5726744.2
10.85,10/2/2015,148,21.8669,11,1313,48832.014,0.137400275,506479.9672,506479.77,506479.77,5726744.741,5726744.2,5726744.2
10.82,10/2/2015,148,21.8342,12,1311,48831.969,0.092093953,506480.1677,506479.92,506479.92,5726744.945,5726744.44,5726744.44
10.83,10/2/2015,148,21.8669,12,1309,48831.969,0.091807708,506480.326,506480.08,506480.08,5726745.195,5726744.68,5726744.68
Python代码:
import glob,csv,os, itertools
list_of_files = glob.glob('C:/test/*.csv')
directory = 'C:/test/conv/'
if not os.path.exists(directory):
os.makedirs(directory)
for filename in list_of_files:
with open(filename,"r") as source:
header_line = next(source)
rdr= csv.reader( source, delimiter = ',',lineterminator='\n' )
x=0
for row in itertools.islice(rdr,0,None):
itemRow4 = row[4]
outfileName=directory+itemRow4+'.csv'
with open(outfileName,"a") as result:
wtr=csv.writer(result, lineterminator='\n')
if x == 0:
previousitemRow4 = row[4]
x = x+1
if previousitemRow4 == itemRow4:
wtr.writerow((row[8],row[11],row[6],row[0]))
previousitemRow4 = itemRow4
if previousitemRow4 != itemRow4:
wtr.writerow((row[8],row[11],row[6],row[0]))
print 'next line'
previousitemRow4 = itemRow4
使用标准 Unix shell 工具 cut
, sort
, uniq
, and grep
:
$ cut -d, -f5 < in.csv | sort | uniq | while read lineno
do
grep ",${lineno}," in.csv > out-${lineno}.csv
done
$ ls out-*.csv
out-10.csv out-11.csv out-12.csv out-Linename.csv
可能 grep
表达式不够复杂,因为 lineno 可能出现在其他列中,但第 5 列除外。在这种情况下,可以使用一个简单的正则表达式使 grep
仅在第 5 列中找到 lineno。
根据Eumiro的建议,我想出了这个解决方案。我尝试使用列表词典,但无法正常工作。以下解决方案有效并且速度非常快。谢谢大家的帮助!
import glob,csv,os, itertools
list_of_files = glob.glob('C:/test/*.csv')
print 'By the Power of the Python, Split these here files!'
directory = 'C:/test/conv/'
if not os.path.exists(directory):
os.makedirs(directory)
for filename in list_of_files:
storage = []
specialStorage = []
with open(filename,"r") as source:
header_line = next(source)
rdr= csv.reader( source, delimiter = ',',lineterminator='\n' )
x=0
resetValue = 0
for row in itertools.islice(rdr,0,None):
itemRow4 = row[4]
if x == 0:
previousitemRow4 = row[4]
x = x+1
outfileName=directory+previousitemRow4+'.csv'
if previousitemRow4 == itemRow4:
storage.append((row[8],row[11],row[6],row[0]))
previousitemRow4 = itemRow4
if previousitemRow4 != itemRow4:
with open(outfileName,"a") as result:
wtr=csv.writer(result, lineterminator='\n')
previousitemRow4 = itemRow4
if len(specialStorage) !=0:
wtr.writerow(specialStorage)
wtr.writerows(storage)
storage = []
specialStorage=(row[8],row[11],row[6],row[0])
else:
with open(outfileName,"a") as result:
wtr=csv.writer(result, lineterminator='\n')
previousitemRow4 = itemRow4
wtr.writerow(specialStorage)
wtr.writerows(storage)
storage = []
print 'end of file reached'