读取一行并将其存储在变量中,然后读取另一行并返回到第一行。 Python 2
Read a line store it in a variable and then read another line and come back to the first line. Python 2
这是一个棘手的问题,我已经阅读了很多关于它的帖子,但我一直无法让它发挥作用。
我有一个大文件。我需要逐行阅读它,一旦我到达 "Total is: (any decimal number)"
形式的一行,就获取这个字符串并将数字保存在一个变量中。如果数字大于 40.0,那么我需要找到 Total
行上方的第四行(例如,如果 Total
行是第 39 行,则该行就是第 35 行)。该行的格式为 "(number).(space)(substring)"
。最后,我需要将这个子串解析出来并对其做进一步的处理。
这是一个输入文件的示例:
many lines that we don't care about
many lines that we don't care about
...
1. Hi45
People: bla bla bla bla bla bla
whitespace
bla bla bla bla bla
Total is: (*here there will be a decimal number*)
bla bla
white space
...
more lines we don't care about
and then more lines and then
again we get
2. How144
People: bla bla bla bla bla bla
whitespace
bla bla bla bla bla
Total is: (*here there will be a decimal number*)
bla bla
white space
我尝试了很多方法,包括使用 re.search()
方法从我需要关注的每一行中捕获我需要的内容。
这是我从另一个 Whosebug 问答中修改的代码:
import re
import linecache
number = ""
higher_line = ""
found_line = ""
with open("filename_with_many_lines.txt") as aFile:
for num, line in enumerate(aFile, 1):
searchObj = re.search(r'(\bTotal\b)(\s)(\w+)(\:)(\s)(\d+.\d+)', line)
if searchObj:
print "this is on line", line
print "this is the line number:", num
var1 = searchObj.group(6)
print var1
if float(var1) > 40.0:
number = num
higher_line = number - 4
print number
print higher_line
found_line = linecache.getline("filename_with_many_lines.txt", higher_line)
print "found the", found_line
预期输出为:
this is on line Total is: 45.5
this is the line number: 14857
14857
14853
found the 1. Hi145
this is on line Total is: 62.1
this is the line number: 14985
14985
14981
found the 2.How144
如果您需要的行总是在 Total is:
行上方四行,您可以将前面的行保留在有界的 deque
.
中
from collections import deque
with open(filename, 'r') as file:
previous_lines = deque(maxlen=4)
for line in file:
if line.startswith('Total is: '):
try:
higher_line = previous_lines[-4]
# store higher_line, do calculations, whatever
break # if you only want to do this once; else let it keep going
except IndexError:
# we don't have four previous lines yet
# I've elected to simply skip this total line in that case
pass
previous_lines.append(line)
如果添加新项目会导致其超过其最大长度,则有界 deque
(具有最大长度的项目)将丢弃另一侧的项目。在这种情况下,我们将字符串附加到 deque
的右侧,因此一旦 deque
的长度达到 4
,我们附加到右侧的每个新字符串都会导致它从左侧丢弃一个字符串。因此,在 for
循环的开始,deque
将包含当前行之前的四行,最左边是最旧的行(索引 0
)。
事实上,the documentation on collections.deque
提到的用例与我们的非常相似:
Bounded length deques provide functionality similar to the tail
filter in Unix. They are also useful for tracking transactions and other pools of data where only the most recent activity is of interest.
这会将以数字和点开头的行存储到名为 prevline
的变量中。我们仅在 re.search
returns 匹配对象时打印 prevline
。
import re
with open("file") as aFile:
prevline = ""
for num, line in enumerate(aFile,1):
m = re.match(r'\d+\.\s*.*', line) # stores the match object of the line which starts with a number and a dot
if m:
prevline += re.match(r'\d+\.\s*(.*)', line).group() # If there is any match found then this would append the whole line to the variable prevline. You could also write this line as prevline += m.group()
searchObj = re.search(r'(\bTotal\b\s+\w+:\s+(\d+\.\d+))', line) # Search for the line which contains the string Total plus a word plus a colon and a float number
if searchObj: # if there is any
score = float(searchObj.group(2)) # then the float number is assigned to the variable called score
if score > 40.0: # Do all the below operations only if the float number we fetched was greater than 40.0
print "this is the line number: ", num
print "this is the line", searchObj.group(1)
print num
print num-4
print "found the", prevline
prevline = ""
输出:
this is on line Total is: 45.5
this is the line number: 8
8
4
found the 1. Hi45
this is on line Total is: 62.1
this is the line number: 20
20
16
found the 2. How144
我建议对基于其 deque
解决方案的 Blacklight Shining 的 post 进行编辑,但被拒绝并建议将其作为答案。下面,我将展示 Blacklight 的解决方案如何解决您的问题,如果您只是盯着它看一会儿的话。
with open(filename, 'r') as file:
# Clear: we don't care about checking the first 4 lines for totals.
# Instead, we just store them for later.
previousLines = []
previousLines.append(file.readline())
previousLines.append(file.readline())
previousLines.append(file.readline())
previousLines.append(file.readline())
# The earliest we should expect a total is at line 5.
for lineNum, line in enumerate(file, 5):
if line.startswith('Total is: '):
prevLine = previousLines[0]
high_num = prevLine.split()[1] # A
score = float(line.strip("Total_is: ").strip("\n").strip()) # B
if score > 40.0:
# That's all! We've now got everything we need.
# Display results as shown in example code.
print "this is the line number : ", lineNum
print "this is the line ", line.strip('\n')
print lineNum
print (lineNum - 4)
print "found the ", prevLine
# Critical - remove old line & push current line onto deque.
previousLines = previousLines[1:] + [line]
我没有利用 deque
,但我的代码命令式地完成了同样的事情。我认为这不一定是比其他任何一个更好的答案;我 post 是为了展示如何使用非常简单的算法和简单的工具来解决您要解决的问题。 (将 Avinash 巧妙的 17 行解决方案与我的简化的 18 行解决方案进行比较。)
这种简化的方法不会让任何阅读您代码的人看起来像个向导,但它也不会意外地匹配中间行中的任何内容。如果你死定了用正则表达式打你的行,那么只需修改行 A 和 B。通用解决方案仍然有效。
重点是,记住第 4 行后面的内容的一种简单方法是将最后四行存储在内存中。
这是一个棘手的问题,我已经阅读了很多关于它的帖子,但我一直无法让它发挥作用。
我有一个大文件。我需要逐行阅读它,一旦我到达 "Total is: (any decimal number)"
形式的一行,就获取这个字符串并将数字保存在一个变量中。如果数字大于 40.0,那么我需要找到 Total
行上方的第四行(例如,如果 Total
行是第 39 行,则该行就是第 35 行)。该行的格式为 "(number).(space)(substring)"
。最后,我需要将这个子串解析出来并对其做进一步的处理。
这是一个输入文件的示例:
many lines that we don't care about
many lines that we don't care about
...
1. Hi45
People: bla bla bla bla bla bla
whitespace
bla bla bla bla bla
Total is: (*here there will be a decimal number*)
bla bla
white space
...
more lines we don't care about
and then more lines and then
again we get
2. How144
People: bla bla bla bla bla bla
whitespace
bla bla bla bla bla
Total is: (*here there will be a decimal number*)
bla bla
white space
我尝试了很多方法,包括使用 re.search()
方法从我需要关注的每一行中捕获我需要的内容。
这是我从另一个 Whosebug 问答中修改的代码:
import re
import linecache
number = ""
higher_line = ""
found_line = ""
with open("filename_with_many_lines.txt") as aFile:
for num, line in enumerate(aFile, 1):
searchObj = re.search(r'(\bTotal\b)(\s)(\w+)(\:)(\s)(\d+.\d+)', line)
if searchObj:
print "this is on line", line
print "this is the line number:", num
var1 = searchObj.group(6)
print var1
if float(var1) > 40.0:
number = num
higher_line = number - 4
print number
print higher_line
found_line = linecache.getline("filename_with_many_lines.txt", higher_line)
print "found the", found_line
预期输出为:
this is on line Total is: 45.5
this is the line number: 14857
14857
14853
found the 1. Hi145
this is on line Total is: 62.1
this is the line number: 14985
14985
14981
found the 2.How144
如果您需要的行总是在 Total is:
行上方四行,您可以将前面的行保留在有界的 deque
.
from collections import deque
with open(filename, 'r') as file:
previous_lines = deque(maxlen=4)
for line in file:
if line.startswith('Total is: '):
try:
higher_line = previous_lines[-4]
# store higher_line, do calculations, whatever
break # if you only want to do this once; else let it keep going
except IndexError:
# we don't have four previous lines yet
# I've elected to simply skip this total line in that case
pass
previous_lines.append(line)
如果添加新项目会导致其超过其最大长度,则有界 deque
(具有最大长度的项目)将丢弃另一侧的项目。在这种情况下,我们将字符串附加到 deque
的右侧,因此一旦 deque
的长度达到 4
,我们附加到右侧的每个新字符串都会导致它从左侧丢弃一个字符串。因此,在 for
循环的开始,deque
将包含当前行之前的四行,最左边是最旧的行(索引 0
)。
事实上,the documentation on collections.deque
提到的用例与我们的非常相似:
Bounded length deques provide functionality similar to the
tail
filter in Unix. They are also useful for tracking transactions and other pools of data where only the most recent activity is of interest.
这会将以数字和点开头的行存储到名为 prevline
的变量中。我们仅在 re.search
returns 匹配对象时打印 prevline
。
import re
with open("file") as aFile:
prevline = ""
for num, line in enumerate(aFile,1):
m = re.match(r'\d+\.\s*.*', line) # stores the match object of the line which starts with a number and a dot
if m:
prevline += re.match(r'\d+\.\s*(.*)', line).group() # If there is any match found then this would append the whole line to the variable prevline. You could also write this line as prevline += m.group()
searchObj = re.search(r'(\bTotal\b\s+\w+:\s+(\d+\.\d+))', line) # Search for the line which contains the string Total plus a word plus a colon and a float number
if searchObj: # if there is any
score = float(searchObj.group(2)) # then the float number is assigned to the variable called score
if score > 40.0: # Do all the below operations only if the float number we fetched was greater than 40.0
print "this is the line number: ", num
print "this is the line", searchObj.group(1)
print num
print num-4
print "found the", prevline
prevline = ""
输出:
this is on line Total is: 45.5
this is the line number: 8
8
4
found the 1. Hi45
this is on line Total is: 62.1
this is the line number: 20
20
16
found the 2. How144
我建议对基于其 deque
解决方案的 Blacklight Shining 的 post 进行编辑,但被拒绝并建议将其作为答案。下面,我将展示 Blacklight 的解决方案如何解决您的问题,如果您只是盯着它看一会儿的话。
with open(filename, 'r') as file:
# Clear: we don't care about checking the first 4 lines for totals.
# Instead, we just store them for later.
previousLines = []
previousLines.append(file.readline())
previousLines.append(file.readline())
previousLines.append(file.readline())
previousLines.append(file.readline())
# The earliest we should expect a total is at line 5.
for lineNum, line in enumerate(file, 5):
if line.startswith('Total is: '):
prevLine = previousLines[0]
high_num = prevLine.split()[1] # A
score = float(line.strip("Total_is: ").strip("\n").strip()) # B
if score > 40.0:
# That's all! We've now got everything we need.
# Display results as shown in example code.
print "this is the line number : ", lineNum
print "this is the line ", line.strip('\n')
print lineNum
print (lineNum - 4)
print "found the ", prevLine
# Critical - remove old line & push current line onto deque.
previousLines = previousLines[1:] + [line]
我没有利用 deque
,但我的代码命令式地完成了同样的事情。我认为这不一定是比其他任何一个更好的答案;我 post 是为了展示如何使用非常简单的算法和简单的工具来解决您要解决的问题。 (将 Avinash 巧妙的 17 行解决方案与我的简化的 18 行解决方案进行比较。)
这种简化的方法不会让任何阅读您代码的人看起来像个向导,但它也不会意外地匹配中间行中的任何内容。如果你死定了用正则表达式打你的行,那么只需修改行 A 和 B。通用解决方案仍然有效。
重点是,记住第 4 行后面的内容的一种简单方法是将最后四行存储在内存中。