R:从 srt(字幕)文件中提取时间
R: Extracting time from srt (subtitles) file
我需要计算每行字幕的语速。 srt(字幕)文件的内容如下所示:
1
00:00:19,000 --> 00:00:21,989
I'm Annita McVeigh and welcome to Election Today where we'll bring you
2
00:00:22,000 --> 00:00:23,989
the latest from the campaign trail, plus debate and analysis.
3
00:00:24,000 --> 00:00:28,989
The Liberal Democrats promise to protect the pay of millions
例如,说 10 个单词 "The Liberal Democrats promise to protect the pay of millions" 需要 4 秒 989 毫秒。这10个词的平均语速是每个词498.9毫秒.
我如何读取 srt 文件,以便我可以拥有包含 startTime、endTime、textString 的数据帧 和 wordCount 作为列和副标题行如下所示?
startTime<-c("00:00:19,000", "00:00:22,000", "00:00:24,000")
endTime<-c("00:00:21,989", "00:00:23,989", "00:00:28,989")
textString<-c("I'm Annita McVeigh and welcome to Election Today where we'll bring you", "the latest from the campaign trail, plus debate and analysis.", "The Liberal Democrats promise to protect the pay of millions")
wordCount<-c(12,10,10)
rate.df<-data.frame(startTime, endTime, textString, wordCount)
当时间以 hour:minute:second、毫秒的形式显示时,如何在 R 中从 endTime 中减去 startTime?
这是一个可能的解决方案(代码很容易解释):
text="
1
00:00:19,000 --> 00:00:21,989
I'm Annita McVeigh and welcome to Election Today where we'll bring you
2
00:00:22,000 --> 00:00:23,989
the latest from the campaign trail,
plus debate
and analysis.
3
00:00:24,000 --> 00:00:28,989
The Liberal Democrats promise to protect
the pay of millions"
con<-textConnection(text)
lines <- readLines(con)
# the previous lines of code are just to replicate you case, and
# they should be replaced by the following single line in the real case
# lines <- readLines(srtFileName)
listOfEntries <-
lapply(split(1:length(lines),cumsum(grepl("^\s*$",lines))),function(blockIdx){
block <- lines[blockIdx]
block <- block[!grepl("^\s*$",block)]
if(length(block) == 0){
return(NULL)
}
if(length(block) < 3){
warning("a block not respecting srt standards has been found")
}
return(data.frame(id=block[1],
times=block[2],
textString=paste0(block[3:length(block)],collapse="\n"),
stringsAsFactors = FALSE))
})
m <- do.call(rbind,listOfEntries)
# split start and end times
tmp <- do.call(rbind,strsplit(m[,'times'],' --> '))
m$startTime <- tmp[,1]
m$endTime <- tmp[,2]
# parse times
tmp <- do.call(rbind,lapply(strsplit(m$startTime,':|,'),as.numeric))
m$fromSeconds <- tmp %*% c(60*60,60,1,1/1000)
tmp <- do.call(rbind,lapply(strsplit(m$endTime,':|,'),as.numeric))
m$toSeconds <- tmp %*% c(60*60,60,1,1/1000)
# compute time difference in seconds
m$timeDiffInSecs <- m$toSeconds - m$fromSeconds
# word count
m$wordCount <- vapply(gregexpr("\W+",m$textString),length,0) + 1
# or if you consider "I'm" a single word you can remove the occurrencies of ', e.g. :
#m$wordCount <- vapply(gregexpr("\W+",gsub("'","",m$textString)),length,0) + 1
m$millisecsPerWord <- m$timeDiffInSecs * 1000 / m$wordCount
结果:
> m
id times textString
2 1 00:00:19,000 --> 00:00:21,989 I'm Annita McVeigh and welcome to Election Today where we'll bring you
3 2 00:00:22,000 --> 00:00:23,989 the latest from the campaign trail, \nplus debate \nand analysis.
6 3 00:00:24,000 --> 00:00:28,989 The Liberal Democrats promise to protect \nthe pay of millions
startTime endTime fromSeconds toSeconds timeDiffInSecs wordCount millisecsPerWord
2 00:00:19,000 00:00:21,989 19 21.989 2.989 14 213.5000
3 00:00:22,000 00:00:23,989 22 23.989 1.989 11 180.8182
6 00:00:24,000 00:00:28,989 24 28.989 4.989 10 498.9000
我需要计算每行字幕的语速。 srt(字幕)文件的内容如下所示:
1
00:00:19,000 --> 00:00:21,989
I'm Annita McVeigh and welcome to Election Today where we'll bring you
2
00:00:22,000 --> 00:00:23,989
the latest from the campaign trail, plus debate and analysis.
3
00:00:24,000 --> 00:00:28,989
The Liberal Democrats promise to protect the pay of millions
例如,说 10 个单词 "The Liberal Democrats promise to protect the pay of millions" 需要 4 秒 989 毫秒。这10个词的平均语速是每个词498.9毫秒.
我如何读取 srt 文件,以便我可以拥有包含 startTime、endTime、textString 的数据帧 和 wordCount 作为列和副标题行如下所示?
startTime<-c("00:00:19,000", "00:00:22,000", "00:00:24,000")
endTime<-c("00:00:21,989", "00:00:23,989", "00:00:28,989")
textString<-c("I'm Annita McVeigh and welcome to Election Today where we'll bring you", "the latest from the campaign trail, plus debate and analysis.", "The Liberal Democrats promise to protect the pay of millions")
wordCount<-c(12,10,10)
rate.df<-data.frame(startTime, endTime, textString, wordCount)
当时间以 hour:minute:second、毫秒的形式显示时,如何在 R 中从 endTime 中减去 startTime?
这是一个可能的解决方案(代码很容易解释):
text="
1
00:00:19,000 --> 00:00:21,989
I'm Annita McVeigh and welcome to Election Today where we'll bring you
2
00:00:22,000 --> 00:00:23,989
the latest from the campaign trail,
plus debate
and analysis.
3
00:00:24,000 --> 00:00:28,989
The Liberal Democrats promise to protect
the pay of millions"
con<-textConnection(text)
lines <- readLines(con)
# the previous lines of code are just to replicate you case, and
# they should be replaced by the following single line in the real case
# lines <- readLines(srtFileName)
listOfEntries <-
lapply(split(1:length(lines),cumsum(grepl("^\s*$",lines))),function(blockIdx){
block <- lines[blockIdx]
block <- block[!grepl("^\s*$",block)]
if(length(block) == 0){
return(NULL)
}
if(length(block) < 3){
warning("a block not respecting srt standards has been found")
}
return(data.frame(id=block[1],
times=block[2],
textString=paste0(block[3:length(block)],collapse="\n"),
stringsAsFactors = FALSE))
})
m <- do.call(rbind,listOfEntries)
# split start and end times
tmp <- do.call(rbind,strsplit(m[,'times'],' --> '))
m$startTime <- tmp[,1]
m$endTime <- tmp[,2]
# parse times
tmp <- do.call(rbind,lapply(strsplit(m$startTime,':|,'),as.numeric))
m$fromSeconds <- tmp %*% c(60*60,60,1,1/1000)
tmp <- do.call(rbind,lapply(strsplit(m$endTime,':|,'),as.numeric))
m$toSeconds <- tmp %*% c(60*60,60,1,1/1000)
# compute time difference in seconds
m$timeDiffInSecs <- m$toSeconds - m$fromSeconds
# word count
m$wordCount <- vapply(gregexpr("\W+",m$textString),length,0) + 1
# or if you consider "I'm" a single word you can remove the occurrencies of ', e.g. :
#m$wordCount <- vapply(gregexpr("\W+",gsub("'","",m$textString)),length,0) + 1
m$millisecsPerWord <- m$timeDiffInSecs * 1000 / m$wordCount
结果:
> m
id times textString
2 1 00:00:19,000 --> 00:00:21,989 I'm Annita McVeigh and welcome to Election Today where we'll bring you
3 2 00:00:22,000 --> 00:00:23,989 the latest from the campaign trail, \nplus debate \nand analysis.
6 3 00:00:24,000 --> 00:00:28,989 The Liberal Democrats promise to protect \nthe pay of millions
startTime endTime fromSeconds toSeconds timeDiffInSecs wordCount millisecsPerWord
2 00:00:19,000 00:00:21,989 19 21.989 2.989 14 213.5000
3 00:00:22,000 00:00:23,989 22 23.989 1.989 11 180.8182
6 00:00:24,000 00:00:28,989 24 28.989 4.989 10 498.9000