从二进制文件中剥离 header

Question

我有一个原始的二进制文件，它有几个演出，我正试图以块的形式处理它。在我开始处理数据之前，我必须删除它所具有的 header。 None 字符串方法（如 .find 或检查数据块中的字符串）由于原始二进制文件格式而起作用。我想自动删除 header 但它的长度可能会有所不同，而且我目前寻找最后一个换行符的方法不起作用，因为原始二进制数据在数据中具有匹配位。

Data format:
BEGIN_HEADER\r\n
header of various line count\r\n
HEADER_END\r\n raw data starts here

我是如何阅读文件的

filename="binary_filename"
chunksize=1024
with open(filename, "rb") as f:
    chunk = f.read(chunksize)
    for index, byte in enumerate(chunk):
        if byte == ord('\n'):
            print("found one " + str(index))

有没有简单的方法来提取 HEADER_END\r\n 行而不用在文件中滑动字节数组？当前方法：

chunk = f.read(chunksize)
index=0
not_found=True
while not_found:
    if chunk[index:index+12] == b'HEADER_END\r\n':
        print("found")
        not_found=False
    index+=1

Answer 1

你可以使用 linecache:

import linecache
currentline = 0
while(linecache.getline("file.bin",currentline)!="HEADER_END\n"):
    currentline=currentline+1

#print raw data
currentline = currentline + 1
rawdata = linecache.getline("file.bin",currentline)
currentrawdata = rawdata
while(currentrawdata):
    currentrawdata = linecache.getline("file.bin",currentline+1)
    rawdata = rawdata + currentrawdata
    currentline = currentline + 1
print rawdata

更新

我们可以将问题一分为二，首先我们可以删除 header，然后我们可以将其读入块：

lines= open('test_file.bin').readlines()
currentline = 0
while(lines[currentline] != "HEADER_END\r\n"):
     currentline=currentline+1
open('newfile.bin', 'w').writelines(lines[currentline:-1])

将创建一个仅包含原始数据的文件 (newfile.bin)。现在可以分块直接读取了：

chunksize=1024
with open('newfile.bin', "rb") as f:
    chunk = f.read(chunksize)

更新 2

不使用中间文件也可以这样做：

#defines the size of the chunks
chunksize=20
filename= 'test_file.bin'
endHeaderTag = "HEADER_END\r\n"
#Identifies at which line there is HEADER_END
lines= open(filename).readlines()
currentline = 0
while(lines[currentline] != endHeaderTag):
     currentline=currentline+1
currentline=currentline+1
#Now currentline contains the index of the first line to the raw data

#With the reduce operation we generate a single string from the list of lines
#we are considering only the lines after the currentline
header_stripped = reduce(lambda x,y:x+y,lines[currentline:])

#Lastly we read successive chunks and we store them into the chunk list.
chunks = []
reminder = len(header_stripped)%chunksize
for i in range(1,len(header_stripped)/chunksize + reminder):
    chunks.append( header_stripped[(i-1)*chunksize:i*chunksize])

从二进制文件中剥离 header

strip header from binary file

python

binary-data

python-3.x