如果下一行与模式不匹配,则连接 Python 中的行
concatenate lines in Python if next line not matches pattern
您好,我有如下所示的加演字幕文件:
00:00:29:02 00:00:35:00 text 1
text 2
00:00:36:04 00:00:44:08 text 3
text 4
00:00:44:12 00:00:48:00 text 5
00:00:49:17 00:00:52:17 text 6
在 python 我应该放什么而不是 "HELP PLEASE"
newdata = re.sub("""HELP PLEASE""", r"", filedata)
生成这样的行:
00:00:29:02 00:00:35:00 text 1 text 2
00:00:36:04 00:00:44:08 text 3 text 4
00:00:44:12 00:00:48:00 text 5
00:00:49:17 00:00:52:17 text 6
谢谢
如果文件不是太大,您可以将每一行读入一个新列表。如果一行不以时间戳开头,则弹出添加到 new_lines
的最后一行并将其添加回来并附加新行。
>>> import re
>>>
>>> # assume all_lines = somefile.readlines() or use it in the for loop below.
... # but simplying to this
... all_lines = [
... "00:00:29:02 00:00:35:00 text 1",
... "text 2",
... "00:00:36:04 00:00:44:08 text 3",
... "text 4",
... "00:00:44:12 00:00:48:00 text 5",
... "00:00:49:17 00:00:52:17 text 6",
... "text 7", # added for interest
... "text 8", # added for interest
... ]
>>>
>>> new_lines = []
>>> for line in all_lines:
... if not re.match('(?:(?:\d\d:){3}(?:\d\d) ){2}.*', line):
... # line did not start with a timestamp
... new_lines.append(new_lines.pop() + ' ' + line)
... else:
... new_lines.append(line)
...
>>> print '\n'.join(new_lines)
00:00:29:02 00:00:35:00 text 1 text 2
00:00:36:04 00:00:44:08 text 3 text 4
00:00:44:12 00:00:48:00 text 5
00:00:49:17 00:00:52:17 text 6 text 7 text 8
>>>
使用您保留 dumping/yielding 的 prev_line
变量而不是潜在的巨大 new_lines
.
变量应该不会太难
顺便说一句,如果第一行不是时间戳,这将失败。
PS:不知道为什么每个人都对正则表达式如此感兴趣。
编辑:无需创建可能庞大的 new_lines 列表...
>>> prev_line = ''
>>> for line in all_lines:
... if not re.match('(?:(?:\d\d:){3}(?:\d\d) ){2}.*', line):
... prev_line += ' ' + line
... else:
... if prev_line: # prevents the first flag '' prev_line from printing
... print prev_line
... prev_line = line
...
00:00:29:02 00:00:35:00 text 1 text 2
00:00:36:04 00:00:44:08 text 3 text 4
00:00:44:12 00:00:48:00 text 5
>>> print prev_line # make sure to print/dump the last one
00:00:49:17 00:00:52:17 text 6 text 7 text 8
>>>
两个警告:(1.) 如果一行实际上是空白,它将被跳过。 (2.) 虽然带有 prev_line
的第二个版本即使文件很大也是内存高效的,但是如果你有 many 没有时间戳的连续行它会占用内存(比如第 7 行和第 8 行)- prev_line
必须保留所有内容,直到出现带有时间戳的行。您可以通过转储到一个文件来解决它,没有明确的换行符(\n
)并在转储 以时间戳开头的行之前添加一个换行符。
您好,我有如下所示的加演字幕文件:
00:00:29:02 00:00:35:00 text 1
text 2
00:00:36:04 00:00:44:08 text 3
text 4
00:00:44:12 00:00:48:00 text 5
00:00:49:17 00:00:52:17 text 6
在 python 我应该放什么而不是 "HELP PLEASE"
newdata = re.sub("""HELP PLEASE""", r"", filedata)
生成这样的行:
00:00:29:02 00:00:35:00 text 1 text 2
00:00:36:04 00:00:44:08 text 3 text 4
00:00:44:12 00:00:48:00 text 5
00:00:49:17 00:00:52:17 text 6
谢谢
如果文件不是太大,您可以将每一行读入一个新列表。如果一行不以时间戳开头,则弹出添加到 new_lines
的最后一行并将其添加回来并附加新行。
>>> import re
>>>
>>> # assume all_lines = somefile.readlines() or use it in the for loop below.
... # but simplying to this
... all_lines = [
... "00:00:29:02 00:00:35:00 text 1",
... "text 2",
... "00:00:36:04 00:00:44:08 text 3",
... "text 4",
... "00:00:44:12 00:00:48:00 text 5",
... "00:00:49:17 00:00:52:17 text 6",
... "text 7", # added for interest
... "text 8", # added for interest
... ]
>>>
>>> new_lines = []
>>> for line in all_lines:
... if not re.match('(?:(?:\d\d:){3}(?:\d\d) ){2}.*', line):
... # line did not start with a timestamp
... new_lines.append(new_lines.pop() + ' ' + line)
... else:
... new_lines.append(line)
...
>>> print '\n'.join(new_lines)
00:00:29:02 00:00:35:00 text 1 text 2
00:00:36:04 00:00:44:08 text 3 text 4
00:00:44:12 00:00:48:00 text 5
00:00:49:17 00:00:52:17 text 6 text 7 text 8
>>>
使用您保留 dumping/yielding 的 prev_line
变量而不是潜在的巨大 new_lines
.
顺便说一句,如果第一行不是时间戳,这将失败。
PS:不知道为什么每个人都对正则表达式如此感兴趣。
编辑:无需创建可能庞大的 new_lines 列表...
>>> prev_line = ''
>>> for line in all_lines:
... if not re.match('(?:(?:\d\d:){3}(?:\d\d) ){2}.*', line):
... prev_line += ' ' + line
... else:
... if prev_line: # prevents the first flag '' prev_line from printing
... print prev_line
... prev_line = line
...
00:00:29:02 00:00:35:00 text 1 text 2
00:00:36:04 00:00:44:08 text 3 text 4
00:00:44:12 00:00:48:00 text 5
>>> print prev_line # make sure to print/dump the last one
00:00:49:17 00:00:52:17 text 6 text 7 text 8
>>>
两个警告:(1.) 如果一行实际上是空白,它将被跳过。 (2.) 虽然带有 prev_line
的第二个版本即使文件很大也是内存高效的,但是如果你有 many 没有时间戳的连续行它会占用内存(比如第 7 行和第 8 行)- prev_line
必须保留所有内容,直到出现带有时间戳的行。您可以通过转储到一个文件来解决它,没有明确的换行符(\n
)并在转储 以时间戳开头的行之前添加一个换行符。