仅当同一行包含后面的字母时才匹配以数字开头的行的正则表达式

Question

我有一个 text/subtitle 文件，如下所示：

1
00:00:58,178 --> 00:00:59,327
Some text!

2
00:00:59,329 --> 00:01:01,819
<i>Some text</i>

3
00:01:40,512 --> 00:01:41,629
2350 some text.

4
00:01:41,631 --> 00:01:43,771
Some text.

现在我差不多明白了，如何通过下面的正则表达式来匹配实际的字幕行：

^([^\d^\n].*)

但是如果相同的实际字幕行以数字开头（例如第三个字幕）怎么办？所以现在我还必须匹配那些以数字开头的行，前提是它们后来在行结束之前在同一行中有字母。

如何结合我上面使用的正则表达式来做到这一点？

Answer 1

我建议一种方法，包括忽略所有仅是数字或等于 SRT 时间戳周期的行：

^(?!\d+$|\d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d+$).+

见this regex demo

详情:

^ - 行首
(?! - 否定前瞻的开始，如果在右侧立即找到模式，则匹配失败：
- \d+$ - 1+ 位到行尾
- | - 或
- \d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d+$ - --> 分隔的时间戳
) - 前瞻结束
.+ - 匹配整个非空行

Answer 2

更新 #1

此次更新带来了巨大的性能提升

我想字幕可以多行:

^\d+:\d+:[^-]+-->.*\R+\K.+(?:\R.+)*(?=\s*(?:^\d+$|\z))

解释：

^\d+:\d+:[^-]+-->.*     # Match time's line
\R+\K                   # One or more newlines (& forget all previous matched characters)
.+                      # Match next immediate line
(?:\R.+)*               # And continuing lines of subtitle (if any)
(?=\s*(?:^\d+$|\z))     # Up to a digit-only-line or end of input string

Live demo

仅当同一行包含后面的字母时才匹配以数字开头的行的正则表达式

Regular expression to match a line starting with a digit only if the same line contains letters after

regex

pcre

subtitle

geany

regex-lookarounds