unpack_from() 不适用于大文件
unpack_from() des not work with big files
我尝试使用 python3.
从一些固定宽度格式的文件(来自 here)读取数据
如果我只预选几行它工作正常,但如果我想去
通过 hole 文件(大约 1000 行,每行 611 个块,4 个字符 = 2444 个字符)python 告诉我,struct.Struct(bytes).unpackFrom(bytes)
需要 a buffer of at least 2444 bytes
,目前我不知道为什么它有没有这么大的缓冲区。
我 运行 在 64 位 Linux 上,具有 4 G RAM 和 20 Gig Swap,这也许对我有帮助。
代码片段是这样的:
#edit
"""rowMask is 611 times 4s, just to prevent you from counting it... """
rowMask="4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s"
def readUsableFields(filename,stdPath):
usableFields=[]
with open(stdPath+filename,"r") as f:
count_line=0
for line in f:
count_col=0
fields=struct.Struct(bytes(rowMask,"UTF-8")).unpack_from(bytes(line,"UTF-8"))
for field in fields:
if(field!=-999):
usableFields.append([count_line,count_col])
count_col+=1
count_line+=1
return usableFields
我也看了 this and this ,但它们都不是我问题的答案。
一些帮助会很好,如果我的问题重复(我没有找到)请告诉我。
由于许多固定宽度的文件都有页脚(或页眉)代码
将在页脚上失败,因为它的长度可能不正确。
因此您必须检查正确的线长:
rowMask="4s"*611
def readUsableFields(filename,stdPath):
usableFields=[]
with open(stdPath+filename,"r") as f:
count_line=0
for line in f:
count_col=0
# len(line) = 611 * 4 +1
# as there is a trailing '[=10=]'
if(len(line)!=2445):
continue
fields=struct.Struct(bytes(rowMask,"UTF-8")).unpack_from(bytes(line,"UTF-8"))
for field in fields:
if(field!=-999):
usableFields.append([count_line,count_col])
count_col+=1
count_line+=1
f.close()
return usableFields
我尝试使用 python3.
从一些固定宽度格式的文件(来自 here)读取数据如果我只预选几行它工作正常,但如果我想去
通过 hole 文件(大约 1000 行,每行 611 个块,4 个字符 = 2444 个字符)python 告诉我,struct.Struct(bytes).unpackFrom(bytes)
需要 a buffer of at least 2444 bytes
,目前我不知道为什么它有没有这么大的缓冲区。
我 运行 在 64 位 Linux 上,具有 4 G RAM 和 20 Gig Swap,这也许对我有帮助。
代码片段是这样的:
#edit
"""rowMask is 611 times 4s, just to prevent you from counting it... """
rowMask="4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s4s"
def readUsableFields(filename,stdPath):
usableFields=[]
with open(stdPath+filename,"r") as f:
count_line=0
for line in f:
count_col=0
fields=struct.Struct(bytes(rowMask,"UTF-8")).unpack_from(bytes(line,"UTF-8"))
for field in fields:
if(field!=-999):
usableFields.append([count_line,count_col])
count_col+=1
count_line+=1
return usableFields
我也看了 this and this ,但它们都不是我问题的答案。
一些帮助会很好,如果我的问题重复(我没有找到)请告诉我。
由于许多固定宽度的文件都有页脚(或页眉)代码
将在页脚上失败,因为它的长度可能不正确。
因此您必须检查正确的线长:
rowMask="4s"*611
def readUsableFields(filename,stdPath):
usableFields=[]
with open(stdPath+filename,"r") as f:
count_line=0
for line in f:
count_col=0
# len(line) = 611 * 4 +1
# as there is a trailing '[=10=]'
if(len(line)!=2445):
continue
fields=struct.Struct(bytes(rowMask,"UTF-8")).unpack_from(bytes(line,"UTF-8"))
for field in fields:
if(field!=-999):
usableFields.append([count_line,count_col])
count_col+=1
count_line+=1
f.close()
return usableFields