将成绩单 .srt 文件解析为可读文本

parsing transcript .srt files into readable text

我有一个视频脚本 SRT 文件,其中包含传统 SRT 格式的线条。这是一个例子:

1
00:00:00,710 --> 00:00:03,220
Lorem ipsum dolor sit amet
consectetur, adipisicing elit.

2
00:00:03,220 --> 00:00:05,970
Dignissimos et quod laboriosam
iure magni expedita

3
00:00:05,970 --> 00:00:09,130
nisi, quis quaerat. Rem, facere!

我正在尝试使用 python 来读取然后解析此文件,删除(或跳过)包含数字字符串的行(例如,SKIP '1' & '00:00:00,710 --> 00:00:03,220') 然后格式化剩余的文本行,以便它们以可读的格式连接和呈现。这是我尝试生成的输出示例:

Lorem ipsum dolor sit amet consectetur, adipisicing elit. Dignissimos et quod laboriosam iure magni expedita nisi, quis quaerat. Rem, facere!

这是我到目前为止想出的代码:

def main():
    # Access folder in filesystem

    # After parsing content of file, move to next file

    # Declare variable empty list
    lineList = []

    # read file line by line
    file = open( "/Sample-SRT-File.srt", "r")
    lines = file.readlines()
    file.close()

    # look for patterns and parse

    # Remove blank lines from file
    lines = [i for i in lines if i[:-1]]

    # Discount first and second line of each segment using a match pattern
    for line in lines:
        line = line.strip()
        if isinstance(line[0], int) != False:

            # store all text into a list
            lineList.append(line)

    # for every item in the list that ends with '', '.', '?', or '!', append a space at end
    for line in lineList:
        line = line + ' '

    # Finish with list.join() to bring everything together
    text = ''.join(lineList)
    print(text)

main()

我对我的 Python 很不熟悉,但现在我想知道是否唯一有效且可靠地匹配段的第一行和第二行以进行删除或跳过的方法是使用正则表达式。否则,这可能会使用 itertools 库或某种会跳过第 1 行和第 2 行以及任何空行的函数。

有人 Python 可以帮助我克服这个问题吗?

我只会使用像 pysrt 这样的库来解析 srt 文件。那应该被证明是最稳健的。

import pysrt
subs = pysrt.open("foo.srt")

for sub in subs:
    print(sub.text)
    print()

输出:

Lorem ipsum dolor sit amet
consectetur, adipisicing elit.

Dignissimos et quod laboriosam
iure magni expedita

nisi, quis quaerat. Rem, facere!

如果想用regex过滤掉数字行和空行,可以这样用:

import re

def main():
    # read file line by line
    file = open( "sample.srt", "r")
    lines = file.readlines()
    file.close()

    text = ''
    for line in lines:
        if re.search('^[0-9]+$', line) is None and re.search('^[0-9]{2}:[0-9]{2}:[0-9]{2}', line) is None and re.search('^$', line) is None:
            text += ' ' + line.rstrip('\n')
        text = text.lstrip()
    print(text)

main()

这将输出:

Lorem ipsum dolor sit amet consectetur, adipisicing elit. Dignissimos et quod laboriosam iure magni expedita nisi, quis quaerat. Rem, facere!

如果您想要一个特定的列表来查找以下代码将解决您的问题并让您有机会指定一个项目列表,即使它们包含不同的类型。

with open ('foo.srt', 'r') as f:
   for line in f:
      if not line.startswith(('0', '1' , '2', '3')):
         print(line) 

不过,这是一个循环,所以如果您担心程序的速度,我会建议使用 pysrt 来回答上面的问题。

感谢 python 3 因为不需要额外的导入

text =" "
with open(file,'r') as f:
    for line in f:
        if  not line[0].isdigit():
            text+= " " + line.replace('\n','')
            text = text.lstrip()