如何分块具有特定大小和条件的大文件
How to chunk a big file with certain size and condition
我有一个大文本文件。我将该文件分成一定大小的小文件。以下是我得到的例子:
import math
import os
numThread = 4
inputData= 'dir\example.txt'
def chunk_files():
nline = sum(1 for line in open(inputData,'r', encoding='utf-8', errors='ignore'))
chunk_size = math.floor(nline/int(numThread ))
n_thread = int(numThread )
j = 0
with open(inputData,'r', encoding='utf-8', errors='ignore') as file_:
for i, line in enumerate(file_):
if (i + 1 == j * chunk_size and j != n_thread) or i == nline:
out.close()
if i + 1 == 1 or (j != n_thread and i + 1 == j * chunk_size):
chunk_file = '_raw' + str(j) + '.txt'
if os.path.isfile(chunk_file):
break
out = open(chunk_file, 'w+', encoding='utf-8', errors='ignore')
j = j + 1
if out.closed != True:
out.write(line)
if i % 1000 == 0 and i != 0:
print ('Processing line %i ...' % (i))
print ('Done.')
这是文本文件中的文本示例:
190219 7:05:30 line3 success
line3 this is the 1st success process
line3 this process need 3sec
200219 9:10:10 line2 success
line2 this is the 1st success process
由于块的大小,我获得了各种形式的拆分文本。像这样:
190219 7:05:30 line3 success
line3 this is the 1st success process
line3 this process need 3sec
200219 9:10:10 line2 success
line2 this is the 1st success process
我需要使用正则表达式 reg= re.compile(r"\b(\d{6})(?=\s\d{1,}:\d{2}:\d{2})\b")
分割后跟日期时间,如下所示:
190219 7:05:30 line3 success
line3 this is the 1st success process
line3 this process need 3sec
200219 9:10:10 line2 success
line2 this is the 1st success process
我试过了Python: regex match across file chunk boundaries。但是我的问题好像没法调整。
任何人都可以帮我将正则表达式放入 chunk_files 函数中吗?提前致谢
因为我们的行数似乎不是静态的,我们也许可以得到我们的 6 位数字和日期,然后收集我们所有的行,然后我们编写剩下的问题的脚本,也许这很简单我们在这里会感兴趣的表达式:
(\d{6})\s(\d{1,}:\d{2}:\d{2})|\s*(.*)\s*
这里有我们的数字部分:
(\d{6})\s(\d{1,}:\d{2}:\d{2})
我们的台词在这里:
\s*(.*)\s*
Demo
测试
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(\d{6})\s(\d{1,}:\d{2}:\d{2})|\s*(.*)\s*"
test_str = ("190219 7:05:30 line3 success \n"
" line3 this is the 1st success process\n"
" line3 this process need 3sec\n"
"200219 9:10:10 line2 success \n"
" line2 this is the 1st success process\n"
"190219 7:05:30 line3 success \n"
" line3 this is the 1st success process\n"
" line3 this process need 3sec\n"
"200219 9:10:10 line2 success \n"
" line2 this is the 1st success process\n"
"200219 9:10:10 line2 success \n"
" line2 this is the 1st success process\n"
" line2 this is the 1st success process\n"
" line2 this is the 1st success process\n"
" line2 this is the 1st success process\n"
" line2 this is the 1st success process\n"
" line2 this is the 1st success process")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
输出
Match 1 was found at 0-14: 190219 7:05:30
Group 1 found at 0-6: 190219
Group 2 found at 7-14: 7:05:30
Group 3 found at -1--1: None
Match 2 was found at 14-45: line3 success
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 15-29: line3 success
Match 3 was found at 45-98: line3 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 45-82: line3 this is the 1st success process
Match 4 was found at 98-127: line3 this process need 3sec
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 98-126: line3 this process need 3sec
Match 5 was found at 127-141: 200219 9:10:10
Group 1 found at 127-133: 200219
Group 2 found at 134-141: 9:10:10
Group 3 found at -1--1: None
Match 6 was found at 141-172: line2 success
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 142-156: line2 success
Match 7 was found at 172-210: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 172-209: line2 this is the 1st success process
Match 8 was found at 210-224: 190219 7:05:30
Group 1 found at 210-216: 190219
Group 2 found at 217-224: 7:05:30
Group 3 found at -1--1: None
Match 9 was found at 224-255: line3 success
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 225-239: line3 success
Match 10 was found at 255-308: line3 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 255-292: line3 this is the 1st success process
Match 11 was found at 308-337: line3 this process need 3sec
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 308-336: line3 this process need 3sec
Match 12 was found at 337-351: 200219 9:10:10
Group 1 found at 337-343: 200219
Group 2 found at 344-351: 9:10:10
Group 3 found at -1--1: None
Match 13 was found at 351-382: line2 success
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 352-366: line2 success
Match 14 was found at 382-420: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 382-419: line2 this is the 1st success process
Match 15 was found at 420-434: 200219 9:10:10
Group 1 found at 420-426: 200219
Group 2 found at 427-434: 9:10:10
Group 3 found at -1--1: None
Match 16 was found at 434-465: line2 success
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 435-449: line2 success
Match 17 was found at 465-518: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 465-502: line2 this is the 1st success process
Match 18 was found at 518-571: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 518-555: line2 this is the 1st success process
Match 19 was found at 571-624: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 571-608: line2 this is the 1st success process
Match 20 was found at 624-677: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 624-661: line2 this is the 1st success process
Match 21 was found at 677-730: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 677-714: line2 this is the 1st success process
Match 22 was found at 730-767: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 730-767: line2 this is the 1st success process
Match 23 was found at 767-767:
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 767-767:
我相信,让事情变得简单会有很大帮助。
all_parts = []
part = []
for line in l.split('\n'):
if re.search(r"^\d+\s\d+:\d+:\d+\s", line):
if part:
all_parts.append(part)
part = []
part.append(line)
else:
all_parts.append(part)
print(all_parts)
用你的test_str试试这个给出了这个:
In [37]: all_parts
Out[37]:
[['190219 7:05:30 line3 success ',
' line3 this is the 1st success process',
' line3 this process need 3sec'],
['200219 9:10:10 line2 success ',
' line2 this is the 1st success process'],
['190219 7:05:30 line3 success ',
' line3 this is the 1st success process',
' line3 this process need 3sec'],
['200219 9:10:10 line2 success ',
' line2 this is the 1st success process'],
['200219 9:10:10 line2 success ',
' line2 this is the 1st success process',
' line2 this is the 1st success process',
' line2 this is the 1st success process',
' line2 this is the 1st success process',
' line2 this is the 1st success process',
' line2 this is the 1st success process']]
然后您可以使代码 return 成为生成器/迭代器,您可以在其中轻松分块任何大小的文件并获取分块行列表。
我有一个大文本文件。我将该文件分成一定大小的小文件。以下是我得到的例子:
import math
import os
numThread = 4
inputData= 'dir\example.txt'
def chunk_files():
nline = sum(1 for line in open(inputData,'r', encoding='utf-8', errors='ignore'))
chunk_size = math.floor(nline/int(numThread ))
n_thread = int(numThread )
j = 0
with open(inputData,'r', encoding='utf-8', errors='ignore') as file_:
for i, line in enumerate(file_):
if (i + 1 == j * chunk_size and j != n_thread) or i == nline:
out.close()
if i + 1 == 1 or (j != n_thread and i + 1 == j * chunk_size):
chunk_file = '_raw' + str(j) + '.txt'
if os.path.isfile(chunk_file):
break
out = open(chunk_file, 'w+', encoding='utf-8', errors='ignore')
j = j + 1
if out.closed != True:
out.write(line)
if i % 1000 == 0 and i != 0:
print ('Processing line %i ...' % (i))
print ('Done.')
这是文本文件中的文本示例:
190219 7:05:30 line3 success
line3 this is the 1st success process
line3 this process need 3sec
200219 9:10:10 line2 success
line2 this is the 1st success process
由于块的大小,我获得了各种形式的拆分文本。像这样:
190219 7:05:30 line3 success
line3 this is the 1st success process
line3 this process need 3sec
200219 9:10:10 line2 success
line2 this is the 1st success process
我需要使用正则表达式 reg= re.compile(r"\b(\d{6})(?=\s\d{1,}:\d{2}:\d{2})\b")
分割后跟日期时间,如下所示:
190219 7:05:30 line3 success
line3 this is the 1st success process
line3 this process need 3sec
200219 9:10:10 line2 success
line2 this is the 1st success process
我试过了Python: regex match across file chunk boundaries。但是我的问题好像没法调整。
任何人都可以帮我将正则表达式放入 chunk_files 函数中吗?提前致谢
因为我们的行数似乎不是静态的,我们也许可以得到我们的 6 位数字和日期,然后收集我们所有的行,然后我们编写剩下的问题的脚本,也许这很简单我们在这里会感兴趣的表达式:
(\d{6})\s(\d{1,}:\d{2}:\d{2})|\s*(.*)\s*
这里有我们的数字部分:
(\d{6})\s(\d{1,}:\d{2}:\d{2})
我们的台词在这里:
\s*(.*)\s*
Demo
测试
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(\d{6})\s(\d{1,}:\d{2}:\d{2})|\s*(.*)\s*"
test_str = ("190219 7:05:30 line3 success \n"
" line3 this is the 1st success process\n"
" line3 this process need 3sec\n"
"200219 9:10:10 line2 success \n"
" line2 this is the 1st success process\n"
"190219 7:05:30 line3 success \n"
" line3 this is the 1st success process\n"
" line3 this process need 3sec\n"
"200219 9:10:10 line2 success \n"
" line2 this is the 1st success process\n"
"200219 9:10:10 line2 success \n"
" line2 this is the 1st success process\n"
" line2 this is the 1st success process\n"
" line2 this is the 1st success process\n"
" line2 this is the 1st success process\n"
" line2 this is the 1st success process\n"
" line2 this is the 1st success process")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
输出
Match 1 was found at 0-14: 190219 7:05:30
Group 1 found at 0-6: 190219
Group 2 found at 7-14: 7:05:30
Group 3 found at -1--1: None
Match 2 was found at 14-45: line3 success
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 15-29: line3 success
Match 3 was found at 45-98: line3 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 45-82: line3 this is the 1st success process
Match 4 was found at 98-127: line3 this process need 3sec
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 98-126: line3 this process need 3sec
Match 5 was found at 127-141: 200219 9:10:10
Group 1 found at 127-133: 200219
Group 2 found at 134-141: 9:10:10
Group 3 found at -1--1: None
Match 6 was found at 141-172: line2 success
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 142-156: line2 success
Match 7 was found at 172-210: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 172-209: line2 this is the 1st success process
Match 8 was found at 210-224: 190219 7:05:30
Group 1 found at 210-216: 190219
Group 2 found at 217-224: 7:05:30
Group 3 found at -1--1: None
Match 9 was found at 224-255: line3 success
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 225-239: line3 success
Match 10 was found at 255-308: line3 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 255-292: line3 this is the 1st success process
Match 11 was found at 308-337: line3 this process need 3sec
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 308-336: line3 this process need 3sec
Match 12 was found at 337-351: 200219 9:10:10
Group 1 found at 337-343: 200219
Group 2 found at 344-351: 9:10:10
Group 3 found at -1--1: None
Match 13 was found at 351-382: line2 success
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 352-366: line2 success
Match 14 was found at 382-420: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 382-419: line2 this is the 1st success process
Match 15 was found at 420-434: 200219 9:10:10
Group 1 found at 420-426: 200219
Group 2 found at 427-434: 9:10:10
Group 3 found at -1--1: None
Match 16 was found at 434-465: line2 success
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 435-449: line2 success
Match 17 was found at 465-518: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 465-502: line2 this is the 1st success process
Match 18 was found at 518-571: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 518-555: line2 this is the 1st success process
Match 19 was found at 571-624: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 571-608: line2 this is the 1st success process
Match 20 was found at 624-677: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 624-661: line2 this is the 1st success process
Match 21 was found at 677-730: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 677-714: line2 this is the 1st success process
Match 22 was found at 730-767: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 730-767: line2 this is the 1st success process
Match 23 was found at 767-767:
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 767-767:
我相信,让事情变得简单会有很大帮助。
all_parts = []
part = []
for line in l.split('\n'):
if re.search(r"^\d+\s\d+:\d+:\d+\s", line):
if part:
all_parts.append(part)
part = []
part.append(line)
else:
all_parts.append(part)
print(all_parts)
用你的test_str试试这个给出了这个:
In [37]: all_parts
Out[37]:
[['190219 7:05:30 line3 success ',
' line3 this is the 1st success process',
' line3 this process need 3sec'],
['200219 9:10:10 line2 success ',
' line2 this is the 1st success process'],
['190219 7:05:30 line3 success ',
' line3 this is the 1st success process',
' line3 this process need 3sec'],
['200219 9:10:10 line2 success ',
' line2 this is the 1st success process'],
['200219 9:10:10 line2 success ',
' line2 this is the 1st success process',
' line2 this is the 1st success process',
' line2 this is the 1st success process',
' line2 this is the 1st success process',
' line2 this is the 1st success process',
' line2 this is the 1st success process']]
然后您可以使代码 return 成为生成器/迭代器,您可以在其中轻松分块任何大小的文件并获取分块行列表。