如何为每个文件的每个末尾行拆分具有特定条件的文件
How to split file with certain conditions for each end line of each file
我有一个这样的 .txt 文件:
2019-03-29 12:03:07 line1
line2
line3
....
2019-03-30 07:05:09 line1
line2
....
2019-03-31 10:03:20 line1
line2
....
我把文件分成几个文件,像这样:
inputData = 'dirname\..'
numThrd = 3
def chunkFiles():
nline = sum(1 for line in open(inputData,'r', encoding='utf-8', errors='ignore'))
chunk_size = math.floor(nline/int(numThrd))
n_thread = int(numThrd)
j = 0
with open(inputData,'r', encoding='utf-8', errors='ignore') as fileout:
for i, line in enumerate(fileout):
if (i + 1 == j * chunk_size and j != n_thread) or i == nline:
out.close()
if i + 1 == 1 or (j != n_thread and i + 1 == j * chunk_size):
chunkFile = 'rawData' + str(j+1) + '.txt'
if os.path.isfile(chunkFile ):
break
out = open(chunkFile , 'w+', encoding='utf-8', errors='ignore')
j = j + 1
fLine = line[:-1]
if not matchLine:
if out.closed != True:
out.write(line)
if i % 1000 == 0 and i != 0:
print ('Processing line %i ...' % (i))
但是,我希望拆分文件满足条件,即块文件中的最后一行必须正好在日期行之前。
我最近得到的输出:
rawData1.txt
2019-03-29 12:03:07 line1
line2
....
-------------------------
rawData2.txt
line50
line51
2019-03-30 07:05:09 line1
line2
.....
期望的输出:
rawData1.txt
2019-03-29 12:03:07 line1
line2
line3
....
-------------------------
rawData2.txt
2019-03-30 07:05:09 line1
line2
....
我应该在上面的脚本中添加什么以满足该条件?
非常感谢
您可以通过使用列表来保存您要写入的行来生成所需的输出(见下文)。
def write_chunk(filename, chunk):
with open(filename, "w") as out:
for i in chunk:
out.write(i)
chunk = []
n_chunk = 1
with open("data.txt") as f:
for line in f:
if not line[0].isspace() and chunk:
write_chunk("{}.txt".format(n_chunk), chunk)
chunk = []
n_chunk += 1
chunk.append(line)
# write final chunk
write_chunk("{}.txt".format(n_chunk), chunk)
我有一个这样的 .txt 文件:
2019-03-29 12:03:07 line1
line2
line3
....
2019-03-30 07:05:09 line1
line2
....
2019-03-31 10:03:20 line1
line2
....
我把文件分成几个文件,像这样:
inputData = 'dirname\..'
numThrd = 3
def chunkFiles():
nline = sum(1 for line in open(inputData,'r', encoding='utf-8', errors='ignore'))
chunk_size = math.floor(nline/int(numThrd))
n_thread = int(numThrd)
j = 0
with open(inputData,'r', encoding='utf-8', errors='ignore') as fileout:
for i, line in enumerate(fileout):
if (i + 1 == j * chunk_size and j != n_thread) or i == nline:
out.close()
if i + 1 == 1 or (j != n_thread and i + 1 == j * chunk_size):
chunkFile = 'rawData' + str(j+1) + '.txt'
if os.path.isfile(chunkFile ):
break
out = open(chunkFile , 'w+', encoding='utf-8', errors='ignore')
j = j + 1
fLine = line[:-1]
if not matchLine:
if out.closed != True:
out.write(line)
if i % 1000 == 0 and i != 0:
print ('Processing line %i ...' % (i))
但是,我希望拆分文件满足条件,即块文件中的最后一行必须正好在日期行之前。
我最近得到的输出:
rawData1.txt
2019-03-29 12:03:07 line1
line2
....
-------------------------
rawData2.txt
line50
line51
2019-03-30 07:05:09 line1
line2
.....
期望的输出:
rawData1.txt
2019-03-29 12:03:07 line1
line2
line3
....
-------------------------
rawData2.txt
2019-03-30 07:05:09 line1
line2
....
我应该在上面的脚本中添加什么以满足该条件?
非常感谢
您可以通过使用列表来保存您要写入的行来生成所需的输出(见下文)。
def write_chunk(filename, chunk):
with open(filename, "w") as out:
for i in chunk:
out.write(i)
chunk = []
n_chunk = 1
with open("data.txt") as f:
for line in f:
if not line[0].isspace() and chunk:
write_chunk("{}.txt".format(n_chunk), chunk)
chunk = []
n_chunk += 1
chunk.append(line)
# write final chunk
write_chunk("{}.txt".format(n_chunk), chunk)