使用正则表达式匹配 .srt 文件字幕行和时间戳

Question

如标题所述，我想匹配 .srt 文件字幕的时间戳和文本行。

其中一些文件的格式不正确，所以我需要一些东西来处理几乎所有文件。

文件的正确格式是这样的：

1
00:00:02,160 --> 00:00:04,994
You really don't remember
what happened last year?

2
00:00:06,440 --> 00:00:07,920
- School. Now.
- I dropped out.

3
00:00:08,120 --> 00:00:10,510
- Get your diploma, I'll get mine.
- What you doing?

4
00:00:10,680 --> 00:00:13,514
- Studying.
- You taking your GED? All right, Fi.

我想出的正则表达式模式非常适用于此类文件。

正如我所说，有些文件格式不正确，有些文件没有行号，有些文件在每个字幕行和我提出的正则表达式之后没有新行with 不能正常工作。

还有其他类似的问题已经得到解答，但我想在单独的 matching-group 中匹配每个时间戳和文本行。所以我提到的示例的第一行的组是这样的：

第 1 组：00:00:02,160

第 2 组：00:00:04,994

第 3 组：You really don't remember\nwhat happened last year?

这是我目前得到的：

LINE_RE = (
    # group 1:
    r"^\s*(\d+:\d+:\d+,\d+)"  # line starts with any number of whitespace
    # and followed by a time format like 00:00:00,000
    r"(?:\s*-{2,3}>\s*)"  # non-matching group for ' --> '
    # matches one or more of - follwed by a >
    # group 2:
    r"(\d+:\d+:\d+,\d+)\s*\n"  # time format again,
    # ended with any number of whitespace and a \n
    # group 3:
    r"([\s\S]*?(?:^\s*$|\d+:\d+:\d+,\d+|^\s*\d+\s*\n))"
    # matches any character, until it hits an empty line, a line with only a number in it or a timestamp

)

我认为我的确切问题出在最后 non-matching 组，当下一行不是空行时它不能正常工作。

this 是一个示例文件，我在文件中做了一些修改，以便更好地显示问题。

Answer 1

在这种情况下，您可以匹配以类似模式的时间戳开头的行，并捕获所有不以换行符和单个数字或其他类似模式的时间戳开头的行。

^\s*(\d+:\d+:\d+,\d+)[^\S\n]+-->[^\S\n]+(\d+:\d+:\d+,\d+)((?:\n(?!\d+:\d+:\d+,\d+\b|\n+\d+$).*)*)

部分中的模式匹配：

^ 字符串开头
\s* 匹配可选的空白字符
(\d+:\d+:\d+,\d+) 捕获 组 1，匹配时间戳，如模式
[^\S\n]+-->[^\S\n]+ 在 1 个或多个空格之间匹配 -->
(\d+:\d+:\d+,\d+) 捕获 组 2，与组 1
( 捕获 第 3 组
- (?: Non capture group - \n 匹配一个换行符
  - (?! 否定前瞻，断言右边不是
    - \d+:\d+:\d+,\d+\b|\n+\d+$ 匹配时间戳或 1+ 个换行符后跟数字
  - ) 关闭前瞻
  - .*匹配整行
- )* 关闭非捕获组并选择性重复
) 关闭组 3

看到一个regex demo.

使用正则表达式匹配 .srt 文件字幕行和时间戳

Matching .srt file subtitle line and timestamps with regex

python

regex

subtitle

srt