如何将字幕文件转换为每个字幕只有一个句子?
How to convert subtitle file to have only one sentence per subtitle?
我正在尝试编写一种转换字幕文件的方法,这样每个字幕总是 一个句子。
我的想法是:
- 对于每个字幕:
1.1 -> 我得到了字幕时长
1.2 -> 计算 characters_per_second
1.3 -> 使用它来存储(在 dict_times_word_subtitle
内)说出单词 i
所需的时间
我从全文中提取句子
对于每个句子:
3.1 我存储(在 dict_sentences_subtitle
内)用特定单词说句子所需的时间(从中我可以得到说这些单词的持续时间)
- 我创建了一个新的 srt 文件(字幕文件),它与原始 srt 文件同时开始,然后可以从说句子的持续时间中获取字幕时间。
目前,我已经编写了以下代码:
#---------------------------------------------------------
import pysrt
import re
from datetime import datetime, date, time, timedelta
#---------------------------------------------------------
def convert_subtitle_one_sentence(file_name):
sub = pysrt.open(file_name)
### ----------------------------------------------------------------------
### Store Each Word and the Average Time it Takes to Say it in a dictionary
### ----------------------------------------------------------------------
dict_times_word_subtitle = {}
running_variable = 0
for i in range(len(sub)):
subtitle_text = sub[i].text
subtitle_duration = (datetime.combine(date.min, sub[i].duration.to_time()) - datetime.min).total_seconds()
# Compute characters per second
characters_per_second = len(subtitle_text)/subtitle_duration
# Store Each Word and the Average Time (seconds) it Takes to Say in a Dictionary
for j,word in enumerate(subtitle_text.split()):
if j == len(subtitle_text.split())-1:
time = len(word)/characters_per_second
else:
time = len(word+" ")/characters_per_second
dict_times_word_subtitle[str(running_variable)] = [word, time]
running_variable += 1
### ----------------------------------------------------------------------
### Store Each Sentence and the Average Time to Say it in a Dictionary
### ----------------------------------------------------------------------
total_number_of_words = len(dict_times_word_subtitle.keys())
# Get the entire text
entire_text = ""
for i in range(total_number_of_words):
entire_text += dict_times_word_subtitle[str(i)][0] +" "
# Initialize the dictionary
dict_times_sentences_subtitle = {}
# Loop through all found sentences
last_number_of_words = 0
for i,sentence in enumerate(re.findall(r'([A-Z][^\.!?]*[\.!?])', entire_text)):
number_of_words = len(sentence.split())
# Compute the time it takes to speak the sentence
time_sentence = 0
for j in range(last_number_of_words, last_number_of_words + number_of_words):
time_sentence += dict_times_word_subtitle[str(j)][1]
# Store the sentence together with the time it takes to say the sentence
dict_times_sentences_subtitle[str(i)] = [sentence, round(time_sentence,3)]
## Update last number_of_words
last_number_of_words += number_of_words
# Check if there is a non-sentence remaining at the end
if j < total_number_of_words:
remaining_string = ""
remaining_string_time = 0
for k in range(j+1, total_number_of_words):
remaining_string += dict_times_word_subtitle[str(k)][0] + " "
remaining_string_time += dict_times_word_subtitle[str(k)][1]
dict_times_sentences_subtitle[str(i+1)] = [remaining_string, remaining_string_time]
### ----------------------------------------------------------------------
### Create a new Subtitle file with only 1 sentence at a time
### ----------------------------------------------------------------------
# Initalize new srt file
new_srt = pysrt.SubRipFile()
# Loop through all sentence
# get initial start time (seconds)
#
start_time = (datetime.combine(date.min, sub[0].start.to_time()) - datetime.min).total_seconds()
for i in range(len(dict_times_sentences_subtitle.keys())):
sentence = dict_times_sentences_subtitle[str(i)][0]
print(sentence)
time_sentence = dict_times_sentences_subtitle[str(i)][1]
print(time_sentence)
item = pysrt.SubRipItem(
index=i,
start=pysrt.SubRipTime(seconds=start_time),
end=pysrt.SubRipTime(seconds=start_time+time_sentence),
text=sentence)
new_srt.append(item)
## Update Start Time
start_time += time_sentence
new_srt.save(file_name)
问题:
没有错误消息,但是当我将其应用于真正的字幕文件然后观看视频时,字幕正确开始,但是随着视频的进行(错误进行),字幕越来越不符合实际内容居然说了
示例:演讲者讲完了,但字幕一直在出现。
要测试的简单示例
srt = """
1
00:00:13,100 --> 00:00:14,750
Dr. Martin Luther King, Jr.,
2
00:00:14,750 --> 00:00:18,636
in a 1968 speech where he reflects
upon the Civil Rights Movement,
3
00:00:18,636 --> 00:00:21,330
states, "In the end,
4
00:00:21,330 --> 00:00:24,413
we will remember not the words of our enemies
5
00:00:24,413 --> 00:00:27,280
but the silence of our friends."
6
00:00:27,280 --> 00:00:29,800
As a teacher, I've internalized this message.
"""
with open('test.srt', "w") as file:
file.write(srt)
convert_subtitle_one_sentence("test.srt")
输出看起来像这样(是的,在句子识别 par(即博士)方面还有一些工作要做):
0
00:00:13,100 --> 00:00:13,336
Dr.
1
00:00:13,336 --> 00:00:14,750
Martin Luther King, Jr.
2
00:00:14,750 --> 00:00:23,514
Civil Rights Movement, states, "In the end, we will remember not the words of our enemies but the silence of our friends.
3
00:00:23,514 --> 00:00:26,175
As a teacher, I've internalized this message.
4
00:00:26,175 --> 00:00:29,859
our friends." As a teacher, I've internalized this message.
如您所见,原始的最后一个时间戳是 00:00:29,800
,而在输出文件中它是 00:00:29,859
。这在开始时可能看起来并不多,但随着视频变长,差异会增加。
完整的示例视频可以在这里下载:https://ufile.io/19nuvqb3
完整字幕文件:https://ufile.io/qracb7ai
注意:字幕文件将被覆盖,因此您可能需要用另一个名称存储一个副本以便能够进行比较。
修复方法:
单词开始或结束原始字幕的确切时间是已知的。这可用于交叉检查和相应地调整时间。
编辑
这里是创建一个字典的代码,它存储字符,character_duration(字幕的平均值)和开始或结束原始时间戳,如果它存在于这个字符。
sub = pysrt.open('video.srt')
running_variable = 0
dict_subtitle = {}
for i in range(len(sub)):
# Extract Start Time Stamb
timestamb_start = sub[i].start
# Extract Text
text =sub[i].text
# Extract End Time Stamb
timestamb_end = sub[i].end
# Extract Characters per Second
characters_per_second = sub[i].characters_per_second
# Fill Dictionary
for j,character in enumerate(" ".join(text.split())):
character_duration = len(character)*characters_per_second
dict_subtitle[str(running_variable)] = [character,character_duration,False, False]
if j == 0: dict_subtitle[str(running_variable)] = [character, character_duration, timestamb_start, False]
if j == len(text)-1 : dict_subtitle[str(running_variable)] = [character, character_duration, False, timestamb_end]
running_variable += 1
更多视频可供尝试
在这里您可以下载更多视频及其各自的字幕文件:https://filebin.net/kwygjffdlfi62pjs
编辑 3
4
00:00:18,856 --> 00:00:25,904
Je rappelle la définition de ce qu'est un produit scalaire, <i>dot product</i> dans <i>Ⅎ</i>.
5
00:00:24,855 --> 00:00:30,431
Donc je prends deux vecteurs dans <i>Ⅎ</i> et je définis cette opération-là , linéaire, <i>u
这可能不是你想要的,而不是计算时间,为什么不直接从字幕文件中取出呢。
我嘲笑这个作为一个例子。从长远来看,它并不完美,但可能会有所帮助。
import re
#Pre-process file to remove blank lines, line numbers and timestamp --> chars
with open('video.srt','r') as f:
lines = f.readlines()
with open('video.tmp','w') as f:
for line in lines:
line = line.strip()
if line.strip():
if line.strip().isnumeric():
continue
else:
line = line.replace(' --> ', ' ')
line = line+" "
f.write(line)
# Process pre-processed file
with open('video.tmp','r') as f:
lines = f.readlines()
outfile = open('new_video.srt','w')
idx = 0
# Define the regex options we will need
#regex to look for the time stamps in each sentence using the first and last only
timestamps = re.compile('\d{1,2}(?::\d{2}){1,2}(?:,)\d{3}')
#regex to remove html tags from length calculations
tags = re.compile(r'<.*?>')
#re.split('([^\s[=10=]-9]\.)',a)
# This is to cope with text that contains mathematical, chemical formulae, ip addresses etc
# where "." does not mean full-stop (end of sentence)
# This is used to split on a "." only if it is NOT preceded by space or a number
# this should catch most things but will fail to split the sentence if it genuinely
# ends with a number followed by a full-stop.
end_of_sentence = re.compile(r'([^\s[=10=]-9]\.)')
#sentences = str(lines).split('.')
sentences = re.split(end_of_sentence,str(lines))
# Because the sentences where split on "x." we now have to add that back
# so we concatenate every other list item with the previous one.
idx = 0
joined =[]
while idx < (len(sentences) -1) :
joined.append(sentences[idx]+sentences[idx+1])
idx += 2
sentences = joined
previous_timings =["00:00:00,000","00:00:00,000"]
previous_sentence = ""
#Dictionary of timestamps that will require post-processing
registry = {}
loop = 0
for sentence in sentences:
print(sentence)
timings = timestamps.findall(sentence)
idx+=1
outfile.write(str(idx)+"\n")
if timings:
#There are timestamps in the sentence
previous_timings = timings
loop = 0
start_time = timings[0]
end_time = timings[-1]
# Revert list item to a string
sentence = ''.join(sentence)
# Remove timestamps from the text
sentence = ''.join(re.sub(timestamps,' ', sentence))
# Get rid of multiple spaces and \ characters
sentence = ' '.join(sentence.split())
sentence = sentence.replace(' ', ' ')
sentence = sentence.replace("\'", "'")
previous_sentence = sentence
print("Starts at", start_time)
print(sentence)
print("Ends at", end_time,'\n')
outfile.write(start_time+" --> "+end_time+"\n")
outfile.write(sentence+"\n\n")
else:
# There are no timestamps in the sentence therefore this must
# be a separate sentence cut adrift from an existing timestamp
# We will have to estimate its start and end times using data
# from the last time stamp we know of
start_time = previous_timings[0]
reg_end_time = previous_timings[-1]
# Convert timestamp to seconds
h,m,s,milli = re.split(':|,',start_time)
s_time = (3600*int(h))+(60*int(m))+int(s)+(int(milli)/1000)
# Guess the timing for the previous sentence and add it
# but only for the first adrift sentence as the start time will be adjusted
# This number may well vary depending on the cadence of the speaker
if loop == 0:
registry[reg_end_time] = reg_end_time
#s_time += 0.06 * len(previous_sentence)
s_time += 0.06 * len(tags.sub('',previous_sentence))
# Guess the end time
e_time = s_time + (0.06 * len(tags.sub('',previous_sentence)))
# Convert start to a timestamp
s,milli = divmod(s_time,1)
m,s = divmod(int(s),60)
h,m = divmod(m,60)
start_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))
# Convert end to a timestamp
s,milli = divmod(e_time,1)
m,s = divmod(int(s),60)
h,m = divmod(m,60)
end_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))
#Register new end time for previous sentence
if loop == 0:
loop = 1
registry[reg_end_time] = start_time
print("Starts at", start_time)
print(sentence)
print("Ends at", end_time,'\n')
outfile.write(start_time+" --> "+end_time+"\n")
outfile.write(sentence+"\n\n")
try:
# re-set the previous start time in case the following sentence
# was cut adrift from its time stamp as well
previous_timings[0] = end_time
except:
pass
outfile.close()
#Post processing
if registry:
outfile = open('new_video.srt','r')
text = outfile.read()
new_text = text
# Run through registered end times and replace them
# if not the video player will not display the subtitles
# correctly because they overlap in time
for key, end in registry.items():
new_text = new_text.replace(key, end, 1)
print("replacing", key, "with", end)
outfile.close()
outfile = open('new_video.srt','w')
outfile.write(new_text)
outfile.close()
编辑:
令人高兴的是,我坚持使用这段代码,因为我对这个问题很感兴趣。
虽然我很欣赏它是 hackkey 并且不使用 pysrt
字幕模块,只是 re
,但我相信,在这种情况下,它做得很好。
我已经对编辑后的代码进行了评论,所以希望它会清楚我在做什么以及为什么这样做。
regx
正在寻找时间戳模式 0:00:0,000, 00:00:00,000, 0:00:00,000 等,即
\d{1,2}(?::\d{2}){1,2}(?:,)\d{3}
1 位或 2 位小数后跟 : 加 2 位小数后跟 : 加 1 位或 2 位小数后跟 : 后跟 3 位小数
如果一个串联的句子中有多个开始和结束时间,对于整个句子我们只需要第一个,句子开始时间,最后一个,句子结束时间。我希望这是清楚的。
编辑 2
这个版本解决了数学和化学公式中的句号,加上ip号等。基本上句号不代表句号的地方。
我有 re-coded 根据要求依赖 pysrt
包,还有 re
.
的 smigeon
这个想法是基于 start_times.
构建一个字典
如果开始时间存在,数据将添加到该时间的条目,但 end_time 会同时更新,因此结束时间会随着文本提前。
如果不存在开始时间,它只是一个新的字典条目。
一旦我们知道一个句子已经完成,开始时间就会提前。
所以本质上,我们开始构建一个具有固定开始时间的句子。通过添加更多文本和更新结束时间,继续构建句子,直到句子结束。这里我们使用当前记录提前开始时间,我们知道这是一个新句子。
Sub-title 包含多个句子的条目被分解,开始和结束时间使用整个 sub-title 条目的 pysrt
character_per_second
条目计算,之前分手了。
最后,一个新的 sub-title 文件从字典中的条目写入磁盘。
显然,只有一个文件可以玩,我可能会错过一些 sub-title 布局障碍,但至少它为您提供了一个工作起点。
代码自始至终都有注释,所以大多数事情应该很清楚,比如如何以及为什么。
编辑:
我改进了对现有字典开始时间的检查,并更改了用于确定句子是否结束的方法,即在拆分后将句号放回文本中。
您提到的第二个视频确实有 sub-title 稍微偏离,首先,请注意根本没有 milli-second 值。
以下代码在第二个视频上表现不错,在第一个视频上表现也不错。
编辑 2:添加了连续 full-stops 和 html <> 标签删除
编辑 3:原来 pysrt
从每秒字符数的计算中删除了 html 标签。我现在也这样做了,这意味着 <html>
格式可以保留在 sub-title 中。
编辑 4:此版本处理数学和化学公式中的句号,以及 ip 号等。基本上,句号不代表句号的地方。
它还允许以 ? 结尾的句子。和 !
import pysrt
import re
abbreviations = ['Dr.','Mr.','Mrs.','Ms.','etc.','Jr.','e.g.'] # You get the idea!
abbrev_replace = ['Dr','Mr','Mrs','Ms','etc','Jr','eg']
subs = pysrt.open('new.srt')
subs_dict = {} # Dictionary to accumulate new sub-titles (start_time:[end_time,sentence])
start_sentence = True # Toggle this at the start and end of sentences
# regex to remove html tags from the character count
tags = re.compile(r'<.*?>')
# regex to split on ".", "?" or "!" ONLY if it is preceded by something else
# which is not a digit and is not a space. (Not perfect but close enough)
# Note: ? and ! can be an issue in some languages (e.g. french) where both ? and !
# are traditionally preceded by a space ! rather than!
end_of_sentence = re.compile(r'([^\s[=10=]-9][\.\?\!])')
# End of sentence characters
eos_chars = set([".","?","!"])
for sub in subs:
if start_sentence:
start_time = sub.start
start_sentence = False
text = sub.text
#Remove multiple full-stops e.g. "and ....."
text = re.sub('\.+', '.', text)
# Optional
for idx, abr in enumerate(abbreviations):
if abr in text:
text = text.replace(abr,abbrev_replace[idx])
# A test could also be made for initials in names i.e. John E. Rotten - showing my age there ;)
multi = re.split(end_of_sentence,text.strip())
cps = sub.characters_per_second
# Test for a sub-title with multiple sentences
if len(multi) > 1:
# regex end_of_sentence breaks sentence start and sentence end into 2 parts
# we need to put them back together again.
# hence the odd range because the joined end part is then deleted
for cnt in range(divmod(len(multi),2)[0]): # e.g. len=3 give 0 | 5 gives 0,1 | 7 gives 0,1,2
multi[cnt] = multi[cnt] + multi[cnt+1]
del multi[cnt+1]
for part in multi:
if len(part): # Avoid blank parts
pass
else:
continue
# Convert start time to seconds
h,m,s,milli = re.split(':|,',str(start_time))
s_time = (3600*int(h))+(60*int(m))+int(s)+(int(milli)/1000)
# test for existing data
try:
existing_data = subs_dict[str(start_time)]
end_time = str(existing_data[0])
h,m,s,milli = re.split(':|,',str(existing_data[0]))
e_time = (3600*int(h))+(60*int(m))+int(s)+(int(milli)/1000)
except:
existing_data = []
e_time = s_time
# End time is the start time or existing end time + the time taken to say the current words
# based on the calculated number of characters per second
# use regex "tags" to remove any html tags from the character count.
e_time = e_time + len(tags.sub('',part)) / cps
# Convert start to a timestamp
s,milli = divmod(s_time,1)
m,s = divmod(int(s),60)
h,m = divmod(m,60)
start_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))
# Convert end to a timestamp
s,milli = divmod(e_time,1)
m,s = divmod(int(s),60)
h,m = divmod(m,60)
end_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))
# if text already exists add the current text to the existing text
# if not use the current text to write/rewrite the dictionary entry
if existing_data:
new_text = existing_data[1] + " " + part
else:
new_text = part
subs_dict[str(start_time)] = [end_time,new_text]
# if sentence ends re-set the current start time to the end time just calculated
if any(x in eos_chars for x in part):
start_sentence = True
start_time = end_time
print ("Split",start_time,"-->",end_time,)
print (new_text)
print('\n')
else:
start_sentence = False
else: # This is Not a multi-part sub-title
end_time = str(sub.end)
# Check for an existing dictionary entry for this start time
try:
existing_data = subs_dict[str(start_time)]
except:
existing_data = []
# if it already exists add the current text to the existing text
# if not use the current text
if existing_data:
new_text = existing_data[1] + " " + text
else:
new_text = text
# Create or Update the dictionary entry for this start time
# with the updated text and the current end time
subs_dict[str(start_time)] = [end_time,new_text]
if any(x in eos_chars for x in text):
start_sentence = True
print ("Single",start_time,"-->",end_time,)
print (new_text)
print('\n')
else:
start_sentence = False
# Generate the new sub-title file from the dictionary
idx=0
outfile = open('video_new.srt','w')
for key, text in subs_dict.items():
idx+=1
outfile.write(str(idx)+"\n")
outfile.write(key+" --> "+text[0]+"\n")
outfile.write(text[1]+"\n\n")
outfile.close()
为您的 video.srt
文件执行上述代码后的输出如下:
1
00:00:13,100 --> 00:00:27,280
Dr Martin Luther King, Jr, in a 1968 speech where he reflects
upon the Civil Rights Movement, states, "In the end, we will remember not the words of our enemies but the silence of our friends."
2
00:00:27,280 --> 00:00:29,800
As a teacher, I've internalized this message.
3
00:00:29,800 --> 00:00:39,701
Every day, all around us, we see the consequences of silence manifest themselves in the form of discrimination, violence, genocide and war.
4
00:00:39,701 --> 00:00:46,178
In the classroom, I challenge my students to explore the silences in their own lives through poetry.
5
00:00:46,178 --> 00:00:54,740
We work together to fill those spaces, to recognize them, to name them, to understand that they don't
have to be sources of shame.
6
00:00:54,740 --> 00:01:14,408
In an effort to create a culture within my classroom where students feel safe sharing the intimacies of their own silences, I have four core principles posted on the board that sits in the front of my class, which every student signs
at the beginning of the year: read critically, write consciously, speak clearly, tell your truth.
7
00:01:14,408 --> 00:01:18,871
And I find myself thinking a lot about that last point, tell your truth.
8
00:01:18,871 --> 00:01:28,848
And I realized that if I was going to ask my students to speak up, I was going to have to tell my truth and be honest with them about the times where I failed to do so.
9
00:01:28,848 --> 00:01:44,479
So I tell them that growing up, as a kid in a Catholic family in New Orleans, during Lent I was always taught that the most meaningful thing one could do was to give something up, sacrifice something you typically indulge in to prove to God you understand his sanctity.
10
00:01:44,479 --> 00:01:50,183
I've given up soda, McDonald's, French fries, French kisses, and everything in between.
11
00:01:50,183 --> 00:01:54,071
But one year, I gave up speaking.
12
00:01:54,071 --> 00:02:03,286
I figured the most valuable thing I could sacrifice was my own voice, but it was like I hadn't realized that I had given that up a long time ago.
13
00:02:03,286 --> 00:02:23,167
I spent so much of my life telling people the things they wanted to hear instead of the things they needed to, told myself I wasn't meant to be anyone's conscience because I still had to figure out being my own, so sometimes I just wouldn't say anything, appeasing ignorance with my silence, unaware that validation doesn't need words to endorse its existence.
14
00:02:23,167 --> 00:02:29,000
When Christian was beat up for being gay, I put my hands in my pocket and walked with my head
down as if I didn't even notice.
15
00:02:29,000 --> 00:02:39,502
I couldn't use my locker for weeks
because the bolt on the lock reminded me of the one I had put on my lips when the homeless man on the corner looked at me with eyes up merely searching for an affirmation that he was worth seeing.
16
00:02:39,502 --> 00:02:43,170
I was more concerned with
touching the screen on my Apple than actually feeding him one.
17
00:02:43,170 --> 00:02:46,049
When the woman at the fundraising gala said "I'm so proud of you.
18
00:02:46,049 --> 00:02:53,699
It must be so hard teaching
those poor, unintelligent kids," I bit my lip, because apparently
we needed her money more than my students needed their dignity.
19
00:02:53,699 --> 00:03:02,878
We spend so much time listening to the things people are saying that we rarely pay attention to the things they don't.
20
00:03:02,878 --> 00:03:06,139
Silence is the residue of fear.
21
00:03:06,139 --> 00:03:09,615
It is feeling your flaws gut-wrench guillotine your tongue.
22
00:03:09,615 --> 00:03:13,429
It is the air retreating from your chest because it doesn't feel safe in your lungs.
23
00:03:13,429 --> 00:03:15,186
Silence is Rwandan genocide.
24
00:03:15,186 --> 00:03:16,423
Silence is Katrina.
25
00:03:16,553 --> 00:03:19,661
It is what you hear when there
aren't enough body bags left.
26
00:03:19,661 --> 00:03:22,062
It is the sound after the noose is already tied.
27
00:03:22,062 --> 00:03:22,870
It is charring.
28
00:03:22,870 --> 00:03:23,620
It is chains.
29
00:03:23,620 --> 00:03:24,543
It is privilege.
30
00:03:24,543 --> 00:03:25,178
It is pain.
31
00:03:25,409 --> 00:03:28,897
There is no time to pick your battles when your battles have already picked you.
32
00:03:28,897 --> 00:03:31,960
I will not let silence wrap itself around my indecision.
33
00:03:31,960 --> 00:03:36,287
I will tell Christian that he is a lion, a sanctuary of bravery and brilliance.
34
00:03:36,287 --> 00:03:42,340
I will ask that homeless man what his name is and how his day was, because sometimes all people want to be is human.
35
00:03:42,340 --> 00:03:51,665
I will tell that woman that my students can talk about transcendentalism like their last name was Thoreau, and just because you watched
one episode of "The Wire" doesn't mean you know anything about my kids.
36
00:03:51,665 --> 00:04:03,825
So this year, instead of giving something up, I will live every day as if there were a microphone tucked under my tongue, a stage on the underside of my inhibition.
37
00:04:03,825 --> 00:04:10,207
Because who has to have a soapbox when all you've ever needed is your voice?
38
00:04:10,207 --> 00:04:12,712
Thank you.
39
00:04:12,712 --> 00:00:00,000
(Applause)
我正在尝试编写一种转换字幕文件的方法,这样每个字幕总是 一个句子。
我的想法是:
- 对于每个字幕:
1.1 -> 我得到了字幕时长
1.2 -> 计算 characters_per_second
1.3 -> 使用它来存储(在 dict_times_word_subtitle
内)说出单词 i
我从全文中提取句子
对于每个句子:
3.1 我存储(在 dict_sentences_subtitle
内)用特定单词说句子所需的时间(从中我可以得到说这些单词的持续时间)
- 我创建了一个新的 srt 文件(字幕文件),它与原始 srt 文件同时开始,然后可以从说句子的持续时间中获取字幕时间。
目前,我已经编写了以下代码:
#---------------------------------------------------------
import pysrt
import re
from datetime import datetime, date, time, timedelta
#---------------------------------------------------------
def convert_subtitle_one_sentence(file_name):
sub = pysrt.open(file_name)
### ----------------------------------------------------------------------
### Store Each Word and the Average Time it Takes to Say it in a dictionary
### ----------------------------------------------------------------------
dict_times_word_subtitle = {}
running_variable = 0
for i in range(len(sub)):
subtitle_text = sub[i].text
subtitle_duration = (datetime.combine(date.min, sub[i].duration.to_time()) - datetime.min).total_seconds()
# Compute characters per second
characters_per_second = len(subtitle_text)/subtitle_duration
# Store Each Word and the Average Time (seconds) it Takes to Say in a Dictionary
for j,word in enumerate(subtitle_text.split()):
if j == len(subtitle_text.split())-1:
time = len(word)/characters_per_second
else:
time = len(word+" ")/characters_per_second
dict_times_word_subtitle[str(running_variable)] = [word, time]
running_variable += 1
### ----------------------------------------------------------------------
### Store Each Sentence and the Average Time to Say it in a Dictionary
### ----------------------------------------------------------------------
total_number_of_words = len(dict_times_word_subtitle.keys())
# Get the entire text
entire_text = ""
for i in range(total_number_of_words):
entire_text += dict_times_word_subtitle[str(i)][0] +" "
# Initialize the dictionary
dict_times_sentences_subtitle = {}
# Loop through all found sentences
last_number_of_words = 0
for i,sentence in enumerate(re.findall(r'([A-Z][^\.!?]*[\.!?])', entire_text)):
number_of_words = len(sentence.split())
# Compute the time it takes to speak the sentence
time_sentence = 0
for j in range(last_number_of_words, last_number_of_words + number_of_words):
time_sentence += dict_times_word_subtitle[str(j)][1]
# Store the sentence together with the time it takes to say the sentence
dict_times_sentences_subtitle[str(i)] = [sentence, round(time_sentence,3)]
## Update last number_of_words
last_number_of_words += number_of_words
# Check if there is a non-sentence remaining at the end
if j < total_number_of_words:
remaining_string = ""
remaining_string_time = 0
for k in range(j+1, total_number_of_words):
remaining_string += dict_times_word_subtitle[str(k)][0] + " "
remaining_string_time += dict_times_word_subtitle[str(k)][1]
dict_times_sentences_subtitle[str(i+1)] = [remaining_string, remaining_string_time]
### ----------------------------------------------------------------------
### Create a new Subtitle file with only 1 sentence at a time
### ----------------------------------------------------------------------
# Initalize new srt file
new_srt = pysrt.SubRipFile()
# Loop through all sentence
# get initial start time (seconds)
#
start_time = (datetime.combine(date.min, sub[0].start.to_time()) - datetime.min).total_seconds()
for i in range(len(dict_times_sentences_subtitle.keys())):
sentence = dict_times_sentences_subtitle[str(i)][0]
print(sentence)
time_sentence = dict_times_sentences_subtitle[str(i)][1]
print(time_sentence)
item = pysrt.SubRipItem(
index=i,
start=pysrt.SubRipTime(seconds=start_time),
end=pysrt.SubRipTime(seconds=start_time+time_sentence),
text=sentence)
new_srt.append(item)
## Update Start Time
start_time += time_sentence
new_srt.save(file_name)
问题:
没有错误消息,但是当我将其应用于真正的字幕文件然后观看视频时,字幕正确开始,但是随着视频的进行(错误进行),字幕越来越不符合实际内容居然说了
示例:演讲者讲完了,但字幕一直在出现。
要测试的简单示例
srt = """
1
00:00:13,100 --> 00:00:14,750
Dr. Martin Luther King, Jr.,
2
00:00:14,750 --> 00:00:18,636
in a 1968 speech where he reflects
upon the Civil Rights Movement,
3
00:00:18,636 --> 00:00:21,330
states, "In the end,
4
00:00:21,330 --> 00:00:24,413
we will remember not the words of our enemies
5
00:00:24,413 --> 00:00:27,280
but the silence of our friends."
6
00:00:27,280 --> 00:00:29,800
As a teacher, I've internalized this message.
"""
with open('test.srt', "w") as file:
file.write(srt)
convert_subtitle_one_sentence("test.srt")
输出看起来像这样(是的,在句子识别 par(即博士)方面还有一些工作要做):
0 00:00:13,100 --> 00:00:13,336 Dr. 1 00:00:13,336 --> 00:00:14,750 Martin Luther King, Jr. 2 00:00:14,750 --> 00:00:23,514 Civil Rights Movement, states, "In the end, we will remember not the words of our enemies but the silence of our friends. 3 00:00:23,514 --> 00:00:26,175 As a teacher, I've internalized this message. 4 00:00:26,175 --> 00:00:29,859 our friends." As a teacher, I've internalized this message.
如您所见,原始的最后一个时间戳是 00:00:29,800
,而在输出文件中它是 00:00:29,859
。这在开始时可能看起来并不多,但随着视频变长,差异会增加。
完整的示例视频可以在这里下载:https://ufile.io/19nuvqb3
完整字幕文件:https://ufile.io/qracb7ai
注意:字幕文件将被覆盖,因此您可能需要用另一个名称存储一个副本以便能够进行比较。
修复方法:
单词开始或结束原始字幕的确切时间是已知的。这可用于交叉检查和相应地调整时间。
编辑
这里是创建一个字典的代码,它存储字符,character_duration(字幕的平均值)和开始或结束原始时间戳,如果它存在于这个字符。
sub = pysrt.open('video.srt')
running_variable = 0
dict_subtitle = {}
for i in range(len(sub)):
# Extract Start Time Stamb
timestamb_start = sub[i].start
# Extract Text
text =sub[i].text
# Extract End Time Stamb
timestamb_end = sub[i].end
# Extract Characters per Second
characters_per_second = sub[i].characters_per_second
# Fill Dictionary
for j,character in enumerate(" ".join(text.split())):
character_duration = len(character)*characters_per_second
dict_subtitle[str(running_variable)] = [character,character_duration,False, False]
if j == 0: dict_subtitle[str(running_variable)] = [character, character_duration, timestamb_start, False]
if j == len(text)-1 : dict_subtitle[str(running_variable)] = [character, character_duration, False, timestamb_end]
running_variable += 1
更多视频可供尝试
在这里您可以下载更多视频及其各自的字幕文件:https://filebin.net/kwygjffdlfi62pjs
编辑 3
4
00:00:18,856 --> 00:00:25,904
Je rappelle la définition de ce qu'est un produit scalaire, <i>dot product</i> dans <i>Ⅎ</i>.
5
00:00:24,855 --> 00:00:30,431
Donc je prends deux vecteurs dans <i>Ⅎ</i> et je définis cette opération-là , linéaire, <i>u
这可能不是你想要的,而不是计算时间,为什么不直接从字幕文件中取出呢。
我嘲笑这个作为一个例子。从长远来看,它并不完美,但可能会有所帮助。
import re
#Pre-process file to remove blank lines, line numbers and timestamp --> chars
with open('video.srt','r') as f:
lines = f.readlines()
with open('video.tmp','w') as f:
for line in lines:
line = line.strip()
if line.strip():
if line.strip().isnumeric():
continue
else:
line = line.replace(' --> ', ' ')
line = line+" "
f.write(line)
# Process pre-processed file
with open('video.tmp','r') as f:
lines = f.readlines()
outfile = open('new_video.srt','w')
idx = 0
# Define the regex options we will need
#regex to look for the time stamps in each sentence using the first and last only
timestamps = re.compile('\d{1,2}(?::\d{2}){1,2}(?:,)\d{3}')
#regex to remove html tags from length calculations
tags = re.compile(r'<.*?>')
#re.split('([^\s[=10=]-9]\.)',a)
# This is to cope with text that contains mathematical, chemical formulae, ip addresses etc
# where "." does not mean full-stop (end of sentence)
# This is used to split on a "." only if it is NOT preceded by space or a number
# this should catch most things but will fail to split the sentence if it genuinely
# ends with a number followed by a full-stop.
end_of_sentence = re.compile(r'([^\s[=10=]-9]\.)')
#sentences = str(lines).split('.')
sentences = re.split(end_of_sentence,str(lines))
# Because the sentences where split on "x." we now have to add that back
# so we concatenate every other list item with the previous one.
idx = 0
joined =[]
while idx < (len(sentences) -1) :
joined.append(sentences[idx]+sentences[idx+1])
idx += 2
sentences = joined
previous_timings =["00:00:00,000","00:00:00,000"]
previous_sentence = ""
#Dictionary of timestamps that will require post-processing
registry = {}
loop = 0
for sentence in sentences:
print(sentence)
timings = timestamps.findall(sentence)
idx+=1
outfile.write(str(idx)+"\n")
if timings:
#There are timestamps in the sentence
previous_timings = timings
loop = 0
start_time = timings[0]
end_time = timings[-1]
# Revert list item to a string
sentence = ''.join(sentence)
# Remove timestamps from the text
sentence = ''.join(re.sub(timestamps,' ', sentence))
# Get rid of multiple spaces and \ characters
sentence = ' '.join(sentence.split())
sentence = sentence.replace(' ', ' ')
sentence = sentence.replace("\'", "'")
previous_sentence = sentence
print("Starts at", start_time)
print(sentence)
print("Ends at", end_time,'\n')
outfile.write(start_time+" --> "+end_time+"\n")
outfile.write(sentence+"\n\n")
else:
# There are no timestamps in the sentence therefore this must
# be a separate sentence cut adrift from an existing timestamp
# We will have to estimate its start and end times using data
# from the last time stamp we know of
start_time = previous_timings[0]
reg_end_time = previous_timings[-1]
# Convert timestamp to seconds
h,m,s,milli = re.split(':|,',start_time)
s_time = (3600*int(h))+(60*int(m))+int(s)+(int(milli)/1000)
# Guess the timing for the previous sentence and add it
# but only for the first adrift sentence as the start time will be adjusted
# This number may well vary depending on the cadence of the speaker
if loop == 0:
registry[reg_end_time] = reg_end_time
#s_time += 0.06 * len(previous_sentence)
s_time += 0.06 * len(tags.sub('',previous_sentence))
# Guess the end time
e_time = s_time + (0.06 * len(tags.sub('',previous_sentence)))
# Convert start to a timestamp
s,milli = divmod(s_time,1)
m,s = divmod(int(s),60)
h,m = divmod(m,60)
start_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))
# Convert end to a timestamp
s,milli = divmod(e_time,1)
m,s = divmod(int(s),60)
h,m = divmod(m,60)
end_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))
#Register new end time for previous sentence
if loop == 0:
loop = 1
registry[reg_end_time] = start_time
print("Starts at", start_time)
print(sentence)
print("Ends at", end_time,'\n')
outfile.write(start_time+" --> "+end_time+"\n")
outfile.write(sentence+"\n\n")
try:
# re-set the previous start time in case the following sentence
# was cut adrift from its time stamp as well
previous_timings[0] = end_time
except:
pass
outfile.close()
#Post processing
if registry:
outfile = open('new_video.srt','r')
text = outfile.read()
new_text = text
# Run through registered end times and replace them
# if not the video player will not display the subtitles
# correctly because they overlap in time
for key, end in registry.items():
new_text = new_text.replace(key, end, 1)
print("replacing", key, "with", end)
outfile.close()
outfile = open('new_video.srt','w')
outfile.write(new_text)
outfile.close()
编辑:
令人高兴的是,我坚持使用这段代码,因为我对这个问题很感兴趣。
虽然我很欣赏它是 hackkey 并且不使用 pysrt
字幕模块,只是 re
,但我相信,在这种情况下,它做得很好。
我已经对编辑后的代码进行了评论,所以希望它会清楚我在做什么以及为什么这样做。
regx
正在寻找时间戳模式 0:00:0,000, 00:00:00,000, 0:00:00,000 等,即
\d{1,2}(?::\d{2}){1,2}(?:,)\d{3}
1 位或 2 位小数后跟 : 加 2 位小数后跟 : 加 1 位或 2 位小数后跟 : 后跟 3 位小数
如果一个串联的句子中有多个开始和结束时间,对于整个句子我们只需要第一个,句子开始时间,最后一个,句子结束时间。我希望这是清楚的。
编辑 2 这个版本解决了数学和化学公式中的句号,加上ip号等。基本上句号不代表句号的地方。
我有 re-coded 根据要求依赖 pysrt
包,还有 re
.
的 smigeon
这个想法是基于 start_times.
如果开始时间存在,数据将添加到该时间的条目,但 end_time 会同时更新,因此结束时间会随着文本提前。
如果不存在开始时间,它只是一个新的字典条目。
一旦我们知道一个句子已经完成,开始时间就会提前。
所以本质上,我们开始构建一个具有固定开始时间的句子。通过添加更多文本和更新结束时间,继续构建句子,直到句子结束。这里我们使用当前记录提前开始时间,我们知道这是一个新句子。
Sub-title 包含多个句子的条目被分解,开始和结束时间使用整个 sub-title 条目的 pysrt
character_per_second
条目计算,之前分手了。
最后,一个新的 sub-title 文件从字典中的条目写入磁盘。
显然,只有一个文件可以玩,我可能会错过一些 sub-title 布局障碍,但至少它为您提供了一个工作起点。
代码自始至终都有注释,所以大多数事情应该很清楚,比如如何以及为什么。
编辑:
我改进了对现有字典开始时间的检查,并更改了用于确定句子是否结束的方法,即在拆分后将句号放回文本中。
您提到的第二个视频确实有 sub-title 稍微偏离,首先,请注意根本没有 milli-second 值。
以下代码在第二个视频上表现不错,在第一个视频上表现也不错。
编辑 2:添加了连续 full-stops 和 html <> 标签删除
编辑 3:原来 pysrt
从每秒字符数的计算中删除了 html 标签。我现在也这样做了,这意味着 <html>
格式可以保留在 sub-title 中。
编辑 4:此版本处理数学和化学公式中的句号,以及 ip 号等。基本上,句号不代表句号的地方。 它还允许以 ? 结尾的句子。和 !
import pysrt
import re
abbreviations = ['Dr.','Mr.','Mrs.','Ms.','etc.','Jr.','e.g.'] # You get the idea!
abbrev_replace = ['Dr','Mr','Mrs','Ms','etc','Jr','eg']
subs = pysrt.open('new.srt')
subs_dict = {} # Dictionary to accumulate new sub-titles (start_time:[end_time,sentence])
start_sentence = True # Toggle this at the start and end of sentences
# regex to remove html tags from the character count
tags = re.compile(r'<.*?>')
# regex to split on ".", "?" or "!" ONLY if it is preceded by something else
# which is not a digit and is not a space. (Not perfect but close enough)
# Note: ? and ! can be an issue in some languages (e.g. french) where both ? and !
# are traditionally preceded by a space ! rather than!
end_of_sentence = re.compile(r'([^\s[=10=]-9][\.\?\!])')
# End of sentence characters
eos_chars = set([".","?","!"])
for sub in subs:
if start_sentence:
start_time = sub.start
start_sentence = False
text = sub.text
#Remove multiple full-stops e.g. "and ....."
text = re.sub('\.+', '.', text)
# Optional
for idx, abr in enumerate(abbreviations):
if abr in text:
text = text.replace(abr,abbrev_replace[idx])
# A test could also be made for initials in names i.e. John E. Rotten - showing my age there ;)
multi = re.split(end_of_sentence,text.strip())
cps = sub.characters_per_second
# Test for a sub-title with multiple sentences
if len(multi) > 1:
# regex end_of_sentence breaks sentence start and sentence end into 2 parts
# we need to put them back together again.
# hence the odd range because the joined end part is then deleted
for cnt in range(divmod(len(multi),2)[0]): # e.g. len=3 give 0 | 5 gives 0,1 | 7 gives 0,1,2
multi[cnt] = multi[cnt] + multi[cnt+1]
del multi[cnt+1]
for part in multi:
if len(part): # Avoid blank parts
pass
else:
continue
# Convert start time to seconds
h,m,s,milli = re.split(':|,',str(start_time))
s_time = (3600*int(h))+(60*int(m))+int(s)+(int(milli)/1000)
# test for existing data
try:
existing_data = subs_dict[str(start_time)]
end_time = str(existing_data[0])
h,m,s,milli = re.split(':|,',str(existing_data[0]))
e_time = (3600*int(h))+(60*int(m))+int(s)+(int(milli)/1000)
except:
existing_data = []
e_time = s_time
# End time is the start time or existing end time + the time taken to say the current words
# based on the calculated number of characters per second
# use regex "tags" to remove any html tags from the character count.
e_time = e_time + len(tags.sub('',part)) / cps
# Convert start to a timestamp
s,milli = divmod(s_time,1)
m,s = divmod(int(s),60)
h,m = divmod(m,60)
start_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))
# Convert end to a timestamp
s,milli = divmod(e_time,1)
m,s = divmod(int(s),60)
h,m = divmod(m,60)
end_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))
# if text already exists add the current text to the existing text
# if not use the current text to write/rewrite the dictionary entry
if existing_data:
new_text = existing_data[1] + " " + part
else:
new_text = part
subs_dict[str(start_time)] = [end_time,new_text]
# if sentence ends re-set the current start time to the end time just calculated
if any(x in eos_chars for x in part):
start_sentence = True
start_time = end_time
print ("Split",start_time,"-->",end_time,)
print (new_text)
print('\n')
else:
start_sentence = False
else: # This is Not a multi-part sub-title
end_time = str(sub.end)
# Check for an existing dictionary entry for this start time
try:
existing_data = subs_dict[str(start_time)]
except:
existing_data = []
# if it already exists add the current text to the existing text
# if not use the current text
if existing_data:
new_text = existing_data[1] + " " + text
else:
new_text = text
# Create or Update the dictionary entry for this start time
# with the updated text and the current end time
subs_dict[str(start_time)] = [end_time,new_text]
if any(x in eos_chars for x in text):
start_sentence = True
print ("Single",start_time,"-->",end_time,)
print (new_text)
print('\n')
else:
start_sentence = False
# Generate the new sub-title file from the dictionary
idx=0
outfile = open('video_new.srt','w')
for key, text in subs_dict.items():
idx+=1
outfile.write(str(idx)+"\n")
outfile.write(key+" --> "+text[0]+"\n")
outfile.write(text[1]+"\n\n")
outfile.close()
为您的 video.srt
文件执行上述代码后的输出如下:
1
00:00:13,100 --> 00:00:27,280
Dr Martin Luther King, Jr, in a 1968 speech where he reflects
upon the Civil Rights Movement, states, "In the end, we will remember not the words of our enemies but the silence of our friends."
2
00:00:27,280 --> 00:00:29,800
As a teacher, I've internalized this message.
3
00:00:29,800 --> 00:00:39,701
Every day, all around us, we see the consequences of silence manifest themselves in the form of discrimination, violence, genocide and war.
4
00:00:39,701 --> 00:00:46,178
In the classroom, I challenge my students to explore the silences in their own lives through poetry.
5
00:00:46,178 --> 00:00:54,740
We work together to fill those spaces, to recognize them, to name them, to understand that they don't
have to be sources of shame.
6
00:00:54,740 --> 00:01:14,408
In an effort to create a culture within my classroom where students feel safe sharing the intimacies of their own silences, I have four core principles posted on the board that sits in the front of my class, which every student signs
at the beginning of the year: read critically, write consciously, speak clearly, tell your truth.
7
00:01:14,408 --> 00:01:18,871
And I find myself thinking a lot about that last point, tell your truth.
8
00:01:18,871 --> 00:01:28,848
And I realized that if I was going to ask my students to speak up, I was going to have to tell my truth and be honest with them about the times where I failed to do so.
9
00:01:28,848 --> 00:01:44,479
So I tell them that growing up, as a kid in a Catholic family in New Orleans, during Lent I was always taught that the most meaningful thing one could do was to give something up, sacrifice something you typically indulge in to prove to God you understand his sanctity.
10
00:01:44,479 --> 00:01:50,183
I've given up soda, McDonald's, French fries, French kisses, and everything in between.
11
00:01:50,183 --> 00:01:54,071
But one year, I gave up speaking.
12
00:01:54,071 --> 00:02:03,286
I figured the most valuable thing I could sacrifice was my own voice, but it was like I hadn't realized that I had given that up a long time ago.
13
00:02:03,286 --> 00:02:23,167
I spent so much of my life telling people the things they wanted to hear instead of the things they needed to, told myself I wasn't meant to be anyone's conscience because I still had to figure out being my own, so sometimes I just wouldn't say anything, appeasing ignorance with my silence, unaware that validation doesn't need words to endorse its existence.
14
00:02:23,167 --> 00:02:29,000
When Christian was beat up for being gay, I put my hands in my pocket and walked with my head
down as if I didn't even notice.
15
00:02:29,000 --> 00:02:39,502
I couldn't use my locker for weeks
because the bolt on the lock reminded me of the one I had put on my lips when the homeless man on the corner looked at me with eyes up merely searching for an affirmation that he was worth seeing.
16
00:02:39,502 --> 00:02:43,170
I was more concerned with
touching the screen on my Apple than actually feeding him one.
17
00:02:43,170 --> 00:02:46,049
When the woman at the fundraising gala said "I'm so proud of you.
18
00:02:46,049 --> 00:02:53,699
It must be so hard teaching
those poor, unintelligent kids," I bit my lip, because apparently
we needed her money more than my students needed their dignity.
19
00:02:53,699 --> 00:03:02,878
We spend so much time listening to the things people are saying that we rarely pay attention to the things they don't.
20
00:03:02,878 --> 00:03:06,139
Silence is the residue of fear.
21
00:03:06,139 --> 00:03:09,615
It is feeling your flaws gut-wrench guillotine your tongue.
22
00:03:09,615 --> 00:03:13,429
It is the air retreating from your chest because it doesn't feel safe in your lungs.
23
00:03:13,429 --> 00:03:15,186
Silence is Rwandan genocide.
24
00:03:15,186 --> 00:03:16,423
Silence is Katrina.
25
00:03:16,553 --> 00:03:19,661
It is what you hear when there
aren't enough body bags left.
26
00:03:19,661 --> 00:03:22,062
It is the sound after the noose is already tied.
27
00:03:22,062 --> 00:03:22,870
It is charring.
28
00:03:22,870 --> 00:03:23,620
It is chains.
29
00:03:23,620 --> 00:03:24,543
It is privilege.
30
00:03:24,543 --> 00:03:25,178
It is pain.
31
00:03:25,409 --> 00:03:28,897
There is no time to pick your battles when your battles have already picked you.
32
00:03:28,897 --> 00:03:31,960
I will not let silence wrap itself around my indecision.
33
00:03:31,960 --> 00:03:36,287
I will tell Christian that he is a lion, a sanctuary of bravery and brilliance.
34
00:03:36,287 --> 00:03:42,340
I will ask that homeless man what his name is and how his day was, because sometimes all people want to be is human.
35
00:03:42,340 --> 00:03:51,665
I will tell that woman that my students can talk about transcendentalism like their last name was Thoreau, and just because you watched
one episode of "The Wire" doesn't mean you know anything about my kids.
36
00:03:51,665 --> 00:04:03,825
So this year, instead of giving something up, I will live every day as if there were a microphone tucked under my tongue, a stage on the underside of my inhibition.
37
00:04:03,825 --> 00:04:10,207
Because who has to have a soapbox when all you've ever needed is your voice?
38
00:04:10,207 --> 00:04:12,712
Thank you.
39
00:04:12,712 --> 00:00:00,000
(Applause)