IndexError: list index out of range is thrown now that i've changed the way the file is read
IndexError: list index out of range is thrown now that i've changed the way the file is read
我正在尝试读取和重新格式化一个非常大的 (2GB+) .out 文件,该文件的结构类似于 csv。我以前使用过标准的 open(),没有这样的问题,但将其更改为 codecs.open(),因为它在处理某些字符时遇到问题。
正在投中
第一行Traceback (most recent call last):
line 21, in <module>
if(r[5]==""):
IndexError: list index out of range
,虽然r[5]处肯定有一个元素。
(运行时间为 0.301 秒)
import sys
import csv
import datetime
import codecs
maxInt=sys.maxsize
decrement=True
while decrement:
decrement=False
try:
csv.field_size_limit(maxInt)
except OverflowError:
maxInt = int(maxInt/10)
decrement = True
with codecs.open("file.out", 'rU', 'utf-16-be') as source:
rdr = csv.reader(source)
with open("out.csv","w", newline='') as result:
wtr = csv.writer(result)
wtr.writerow(("Column1", "column2", "column3", "etc..."))
for r in rdr:
if(r[5]==""):
continue
wtr.writerow((datetime.datetime.strptime(r[5], '%m/%d/%Y').strftime('%Y-%m-%d'), r[3], r[7], r[9]+r[10]+" "+r[12]))
使用 utf-8 抛出 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 12: invalid continuation byte
使用 latin-1 或 ISO-8859-1 抛出 UnicodeEncodeError: 'charmap' codec can't encode characters in position 57-58: character maps to <undefined>
,尽管在 运行 之后更多。
输入文件如下所示:
"A00017","K","G","1999","4530","01/12/1999","","","","PEOPLE TO ELECT MANGINELLI","","","","258 MAGNIOLIA DRIVE","SELDEN","NY","11784","","","404.57","","","","","","","2","","NAA","07/22/1999 08:43:59"
"A00037","K","G","1999","999999","01/12/1999","","","","CITIZENS TO ELECT TEDISCO TO ASSEMBLY","","","","","","","","","","0","","","","","","","2","","",""
"A00037","K","N","1999","1693","01/15/1999","","","","OUTSTANDING LOAN","","","","2176 GUILDERLAND AVE","SCHENECTADY","NY","12306","","","10474.8","10474.8","","","OTHER","","PREVIOUS LOAN FROM JAMES TEDISCO","","P","JM","07/15/1999 15:08:17"
"A00037","J","N","2000","1694","01/13/2000","","","","OUTSTANDING LOAN","","","","2176 GUILDERLAND","SCHENECTADY","NY","12306","","","10474.8","10474.8","","","OTHER","","LOANS FROM PREVIOUS CAMPAIGNS FROM J","","P","JM","01/14/1900 16:35:09"
"A00037","K","X","2000","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/20/2000 00:00:00"
"A00037","J","X","2001","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/17/2001 00:00:00"
"A00037","K","X","2002","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/19/2002 00:00:00"
"A00037","J","X","2003","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/21/2003 00:00:00"
"A00037","K","X","2003","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/16/2003 00:00:00"
"A00037","J","X","2004","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/22/2004 00:00:00"
我能走到这一步要归功于:
"Line contains NULL byte" in CSV reader (Python)
_csv.Error: field larger than field limit (131072)
在您正在阅读的'file.out'中,找出一行中每个单元格的元素之间的分隔符。类似于 '\t'-制表符或 ','-逗号并将其传递给 'delimiter' 属性。
尝试打印 'r' 并查看列名之间的字符或行中的值
rdr = csv.reader(source,delimiter=<separator>)
我正在尝试读取和重新格式化一个非常大的 (2GB+) .out 文件,该文件的结构类似于 csv。我以前使用过标准的 open(),没有这样的问题,但将其更改为 codecs.open(),因为它在处理某些字符时遇到问题。
正在投中
第一行Traceback (most recent call last):
line 21, in <module>
if(r[5]==""):
IndexError: list index out of range
,虽然r[5]处肯定有一个元素。
(运行时间为 0.301 秒)
import sys
import csv
import datetime
import codecs
maxInt=sys.maxsize
decrement=True
while decrement:
decrement=False
try:
csv.field_size_limit(maxInt)
except OverflowError:
maxInt = int(maxInt/10)
decrement = True
with codecs.open("file.out", 'rU', 'utf-16-be') as source:
rdr = csv.reader(source)
with open("out.csv","w", newline='') as result:
wtr = csv.writer(result)
wtr.writerow(("Column1", "column2", "column3", "etc..."))
for r in rdr:
if(r[5]==""):
continue
wtr.writerow((datetime.datetime.strptime(r[5], '%m/%d/%Y').strftime('%Y-%m-%d'), r[3], r[7], r[9]+r[10]+" "+r[12]))
使用 utf-8 抛出 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 12: invalid continuation byte
使用 latin-1 或 ISO-8859-1 抛出 UnicodeEncodeError: 'charmap' codec can't encode characters in position 57-58: character maps to <undefined>
,尽管在 运行 之后更多。
输入文件如下所示:
"A00017","K","G","1999","4530","01/12/1999","","","","PEOPLE TO ELECT MANGINELLI","","","","258 MAGNIOLIA DRIVE","SELDEN","NY","11784","","","404.57","","","","","","","2","","NAA","07/22/1999 08:43:59"
"A00037","K","G","1999","999999","01/12/1999","","","","CITIZENS TO ELECT TEDISCO TO ASSEMBLY","","","","","","","","","","0","","","","","","","2","","",""
"A00037","K","N","1999","1693","01/15/1999","","","","OUTSTANDING LOAN","","","","2176 GUILDERLAND AVE","SCHENECTADY","NY","12306","","","10474.8","10474.8","","","OTHER","","PREVIOUS LOAN FROM JAMES TEDISCO","","P","JM","07/15/1999 15:08:17"
"A00037","J","N","2000","1694","01/13/2000","","","","OUTSTANDING LOAN","","","","2176 GUILDERLAND","SCHENECTADY","NY","12306","","","10474.8","10474.8","","","OTHER","","LOANS FROM PREVIOUS CAMPAIGNS FROM J","","P","JM","01/14/1900 16:35:09"
"A00037","K","X","2000","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/20/2000 00:00:00"
"A00037","J","X","2001","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/17/2001 00:00:00"
"A00037","K","X","2002","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/19/2002 00:00:00"
"A00037","J","X","2003","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/21/2003 00:00:00"
"A00037","K","X","2003","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/16/2003 00:00:00"
"A00037","J","X","2004","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/22/2004 00:00:00"
我能走到这一步要归功于:
"Line contains NULL byte" in CSV reader (Python)
_csv.Error: field larger than field limit (131072)
在您正在阅读的'file.out'中,找出一行中每个单元格的元素之间的分隔符。类似于 '\t'-制表符或 ','-逗号并将其传递给 'delimiter' 属性。
尝试打印 'r' 并查看列名之间的字符或行中的值
rdr = csv.reader(source,delimiter=<separator>)