将带对话框的字幕文件拆分为 Python 中的字符串(或文件)
Splitting subtitle files with dialogs to strings (or files) in Python
我有一组包含对话的字幕文件,如下所示:
1
00:00:02,460 --> 00:00:07,020
JOHN: Great.
2
00:00:07,020 --> 00:00:11,850
How are you today?
JANE: Quite alright.
JOHN: Perfect.
3
00:00:11,850 --> 00:00:17,230
Had a busy day?
4
00:00:17,230 --> 00:00:28,070
JANE: Not so much. And you?
5
00:00:28,070 --> 00:00:32,300
JOHN: Mine was okay too. Gimme a few extra minutes.
我只想提取,例如,JANE,然后提取两者,并得到一个结果字符串或文件,如下所示:
Quite alright
Not so much
And you
然后两个扬声器合并,像这样:
Great
How are you today
Quite alright
Perfect
Had a busy day
Not so much
And you
Mine was okay too
Gimme a few extra minutes
因此,结果是每行一个句子并删除了标点符号(除 '
之外的所有标点符号都保留用于缩写;例如 don't
)。
实际上,我已经设法清理 的标点符号和numbers/timestamps。我一直在使用正则表达式(infile
是输入文件;首先re.sub()
是为了整理插入点后没有space的实例:
for line in infile:
if not line[0].isnumeric():
line = re.sub('(?<=[,;:.!?])(?=[a-zA-Z])', r' ', line)
lines += re.sub(r'[^a-zA-Z\'\ \n]+', r'', line)
遗憾的是,我还没有找到任何优雅的方法来调节和提取属于某个特定说话者的台词。原则上,我希望能够选择是否全部保存到同一个 string/file,每个扬声器单独 string/file(或仅一个扬声器)。
你基本上只需要继续嗅探说话者的变化并建立一个很好的结构化数据数组:
current_speaker = None
dialogue = []
while(True):
the_line = fetchLine(fromWhever)
if the_line is None:
break
if the_line == '':
continue
if the_line.isnumeric():
fetchLine(fromWherever) # Get the timeline that follows a block count
continue # ignore it all for now
# Actual speaker line.
m = re.search("^(\S+):", the_line)
if m is not None:
spk = m.groups()[0]
current_speaker = spk
the_line = the_line[len(spk)+2:] # remove name, colon, and 1 space
dialogue.append({"spk":current_speaker,"text":the_line})
print(dialogue)
[{'spk': 'JOHN', 'text': 'Great.'}, {'spk': 'JOHN', 'text': 'How are you today? '}, {'spk': 'JANE', 'text': 'Quite alright. '}, {'spk': 'JOHN', 'text': 'Perfect.'}, {'spk': 'JOHN', 'text': 'Had a busy day?'}, {'spk': 'JANE', 'text': 'Not so much. And you?'}, {'spk': 'JOHN', 'text': 'Mine was okay too. Gimme a few extra minutes.'}]
之后就是post简单的事情了——处理数组把句子变成更多的条目或者写入文件等等
我有一组包含对话的字幕文件,如下所示:
1
00:00:02,460 --> 00:00:07,020
JOHN: Great.
2
00:00:07,020 --> 00:00:11,850
How are you today?
JANE: Quite alright.
JOHN: Perfect.
3
00:00:11,850 --> 00:00:17,230
Had a busy day?
4
00:00:17,230 --> 00:00:28,070
JANE: Not so much. And you?
5
00:00:28,070 --> 00:00:32,300
JOHN: Mine was okay too. Gimme a few extra minutes.
我只想提取,例如,JANE,然后提取两者,并得到一个结果字符串或文件,如下所示:
Quite alright
Not so much
And you
然后两个扬声器合并,像这样:
Great
How are you today
Quite alright
Perfect
Had a busy day
Not so much
And you
Mine was okay too
Gimme a few extra minutes
因此,结果是每行一个句子并删除了标点符号(除 '
之外的所有标点符号都保留用于缩写;例如 don't
)。
实际上,我已经设法清理 的标点符号和numbers/timestamps。我一直在使用正则表达式(infile
是输入文件;首先re.sub()
是为了整理插入点后没有space的实例:
for line in infile:
if not line[0].isnumeric():
line = re.sub('(?<=[,;:.!?])(?=[a-zA-Z])', r' ', line)
lines += re.sub(r'[^a-zA-Z\'\ \n]+', r'', line)
遗憾的是,我还没有找到任何优雅的方法来调节和提取属于某个特定说话者的台词。原则上,我希望能够选择是否全部保存到同一个 string/file,每个扬声器单独 string/file(或仅一个扬声器)。
你基本上只需要继续嗅探说话者的变化并建立一个很好的结构化数据数组:
current_speaker = None
dialogue = []
while(True):
the_line = fetchLine(fromWhever)
if the_line is None:
break
if the_line == '':
continue
if the_line.isnumeric():
fetchLine(fromWherever) # Get the timeline that follows a block count
continue # ignore it all for now
# Actual speaker line.
m = re.search("^(\S+):", the_line)
if m is not None:
spk = m.groups()[0]
current_speaker = spk
the_line = the_line[len(spk)+2:] # remove name, colon, and 1 space
dialogue.append({"spk":current_speaker,"text":the_line})
print(dialogue)
[{'spk': 'JOHN', 'text': 'Great.'}, {'spk': 'JOHN', 'text': 'How are you today? '}, {'spk': 'JANE', 'text': 'Quite alright. '}, {'spk': 'JOHN', 'text': 'Perfect.'}, {'spk': 'JOHN', 'text': 'Had a busy day?'}, {'spk': 'JANE', 'text': 'Not so much. And you?'}, {'spk': 'JOHN', 'text': 'Mine was okay too. Gimme a few extra minutes.'}]
之后就是post简单的事情了——处理数组把句子变成更多的条目或者写入文件等等