正则表达式匹配字幕中的 SRT 和 VTT 语法
RegEx matching for SRT and VTT syntax from subtitles
我有 srt 和 vtt 格式的字幕,我需要匹配和删除特定于格式的语法,只需要干净的文本行。
我想出了这个正则表达式:
/\n?\d*?\n?^.* --> [012345]{2}:.*$/m
示例内容(混合 srt 和 vtt):
1
00:00:04,019 --> 00:00:07,299
line1
line2
2
00:00:07,414 --> 00:00:09,155
line1
00:00:09,276 --> 00:00:11,429
line1
00:00:11,549 --> 00:00:14,874
line1
line2
这符合 https://regex101.com/r/zRsRMR/2/
中模拟的预期字幕编号和时间安排
但是当在代码本身中使用时(即使直接使用 https://regex101.com 生成的代码片段),它只会匹配时间,而不匹配字幕编号。
查看输出:
array (5)
0 => array (1)
0 => "00:00:04,019 --> 00:00:07,299
" (30)
1 => array (1)
0 => "
00:00:07,414 --> 00:00:09,155
" (31)
2 => array (1)
0 => "
00:00:09,276 --> 00:00:11,429
" (31)
3 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)
4 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)
可以在以下时间进行测试:http://sandbox.onlinephpfunctions.com/code/dec294251b879144f40a6d1bdd516d2050321242
目标是匹配字幕编号,例如第一个预期匹配应该是:
1
00:00:04,019 --> 00:00:07,299
我不太确定这是否是您想要捕获的内容。但是,原因是您可能希望使用捕获组来包装您的字符串,以便于获取。例如,this expression 示例捕获组如何围绕您想要的字符工作:
^([0-9]+\n|)([0-9:,->\s]+)
这可能不是这样做的方式,也不是最好的表达方式。但是,它可能会给您一个想法,以不同的方式处理问题。
我猜您可能想要捕获日期时间行和之前的行,这些行可能有也可能没有数字。
图表
此图显示了表达式的工作原理,您可以在此 link 中可视化其他表达式:
您可能想编写一个脚本来清理您的数据,然后再将其发送到 RegEx 引擎,这样您就会有一个简单的表达式。
示例测试 JavaScript
const regex = /^([0-9]+\n|)([0-9:,->\s]+)/mg;
const str = `1
00:00:04,019 --> 00:00:07,299
line1
line2
2
00:00:07,414 --> 00:00:09,155
line1
00:00:09,276 --> 00:00:11,429
line1
00:00:11,549 --> 00:00:14,874
line1
line2
`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
PHP 测试
这可能不会生成您想要的输出,这只是一个示例:
$re = '/^([0-9]+\n|)([0-9:,->\s]+)/m';
$str = '1
00:00:04,019 --> 00:00:07,299
line1
line2
2
00:00:07,414 --> 00:00:09,155
line1
00:00:09,276 --> 00:00:11,429
line1
00:00:11,549 --> 00:00:14,874
line1
line2
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach ($matches[0] as $key => $value) {
if ($value == "") {
unset($matches[0][$key]);
} else {
$matches[0][$key] = trim($value);
}
}
var_dump($matches[0]);
性能测试
这个 JavaScript 片段显示了使用简单的 100 万次 for
循环时该表达式的性能。
repeat = 1000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = '2 \n00:00:07,414 --> 00:00:09,155';
var regex = /(.*)([0-9:,->\s]+)/gm;
var match = string.replace(regex, "");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. ");
如果你想在一个变量中捕获所有你想要的输出,你可以简单地在整个表达式周围添加一个捕获组,然后使用 </code>.</p> 调用它
<p>如果需要,您还可以添加或减少边界,例如 <a href="https://regex101.com/r/Gx36yK/1" rel="nofollow noreferrer">this one</a>。 </p>
<pre><code>^(?:[0-9]+\n|\n)(([0-9:,]+)([\s->]+)([0-9:,]+))$
第二个表达式
JavaScript 的示例测试
const regex = /^(?:[0-9]+\n|\n)(([0-9:,]+)([\s->]+)([0-9:,]+))$/gm;
const str = `1
00:00:04,019 --> 00:00:07,299
- cdcdc
- cddcd
2
00:00:07,414 --> 00:00:09,155
54564
00:00:09,276 --> 00:00:11,429
- 445454 - ccd
- cdscdcdcd
00:00:11,549 --> 00:00:14,874
line1
line2
`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
您可以将表达式的这一部分 \n?\d*?\n?
设为可选组,以匹配 1+ 个数字后跟一个换行符。字符 class [012345]
也可以写成 [0-5]
您可以将您的表达更新为:
^(?:\d+\n)?.*\h+-->\h+[0-5]{2}:.*$
^
字符串开头
(?:\d+\n)?
可选 1+ 位和换行符
.*\h+-->\h+ Match 0+ times any char except newline, 1+ horizontal whitespace chars,
-->` 和 1+ 个水平空白字符
[0-5]{2}:
匹配2次0-5
.*
匹配除换行符之外的任何字符 0 次以上
$
字符串结束
我有 srt 和 vtt 格式的字幕,我需要匹配和删除特定于格式的语法,只需要干净的文本行。
我想出了这个正则表达式:
/\n?\d*?\n?^.* --> [012345]{2}:.*$/m
示例内容(混合 srt 和 vtt):
1
00:00:04,019 --> 00:00:07,299
line1
line2
2
00:00:07,414 --> 00:00:09,155
line1
00:00:09,276 --> 00:00:11,429
line1
00:00:11,549 --> 00:00:14,874
line1
line2
这符合 https://regex101.com/r/zRsRMR/2/
中模拟的预期字幕编号和时间安排但是当在代码本身中使用时(即使直接使用 https://regex101.com 生成的代码片段),它只会匹配时间,而不匹配字幕编号。
查看输出:
array (5)
0 => array (1)
0 => "00:00:04,019 --> 00:00:07,299
" (30)
1 => array (1)
0 => "
00:00:07,414 --> 00:00:09,155
" (31)
2 => array (1)
0 => "
00:00:09,276 --> 00:00:11,429
" (31)
3 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)
4 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)
可以在以下时间进行测试:http://sandbox.onlinephpfunctions.com/code/dec294251b879144f40a6d1bdd516d2050321242
目标是匹配字幕编号,例如第一个预期匹配应该是:
1
00:00:04,019 --> 00:00:07,299
我不太确定这是否是您想要捕获的内容。但是,原因是您可能希望使用捕获组来包装您的字符串,以便于获取。例如,this expression 示例捕获组如何围绕您想要的字符工作:
^([0-9]+\n|)([0-9:,->\s]+)
这可能不是这样做的方式,也不是最好的表达方式。但是,它可能会给您一个想法,以不同的方式处理问题。
我猜您可能想要捕获日期时间行和之前的行,这些行可能有也可能没有数字。
图表
此图显示了表达式的工作原理,您可以在此 link 中可视化其他表达式:
您可能想编写一个脚本来清理您的数据,然后再将其发送到 RegEx 引擎,这样您就会有一个简单的表达式。
示例测试 JavaScript
const regex = /^([0-9]+\n|)([0-9:,->\s]+)/mg;
const str = `1
00:00:04,019 --> 00:00:07,299
line1
line2
2
00:00:07,414 --> 00:00:09,155
line1
00:00:09,276 --> 00:00:11,429
line1
00:00:11,549 --> 00:00:14,874
line1
line2
`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
PHP 测试
这可能不会生成您想要的输出,这只是一个示例:
$re = '/^([0-9]+\n|)([0-9:,->\s]+)/m';
$str = '1
00:00:04,019 --> 00:00:07,299
line1
line2
2
00:00:07,414 --> 00:00:09,155
line1
00:00:09,276 --> 00:00:11,429
line1
00:00:11,549 --> 00:00:14,874
line1
line2
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach ($matches[0] as $key => $value) {
if ($value == "") {
unset($matches[0][$key]);
} else {
$matches[0][$key] = trim($value);
}
}
var_dump($matches[0]);
性能测试
这个 JavaScript 片段显示了使用简单的 100 万次 for
循环时该表达式的性能。
repeat = 1000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = '2 \n00:00:07,414 --> 00:00:09,155';
var regex = /(.*)([0-9:,->\s]+)/gm;
var match = string.replace(regex, "");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. ");
如果你想在一个变量中捕获所有你想要的输出,你可以简单地在整个表达式周围添加一个捕获组,然后使用 </code>.</p> 调用它
<p>如果需要,您还可以添加或减少边界,例如 <a href="https://regex101.com/r/Gx36yK/1" rel="nofollow noreferrer">this one</a>。 </p>
<pre><code>^(?:[0-9]+\n|\n)(([0-9:,]+)([\s->]+)([0-9:,]+))$
第二个表达式
JavaScript 的示例测试const regex = /^(?:[0-9]+\n|\n)(([0-9:,]+)([\s->]+)([0-9:,]+))$/gm;
const str = `1
00:00:04,019 --> 00:00:07,299
- cdcdc
- cddcd
2
00:00:07,414 --> 00:00:09,155
54564
00:00:09,276 --> 00:00:11,429
- 445454 - ccd
- cdscdcdcd
00:00:11,549 --> 00:00:14,874
line1
line2
`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
您可以将表达式的这一部分 \n?\d*?\n?
设为可选组,以匹配 1+ 个数字后跟一个换行符。字符 class [012345]
也可以写成 [0-5]
您可以将您的表达更新为:
^(?:\d+\n)?.*\h+-->\h+[0-5]{2}:.*$
^
字符串开头(?:\d+\n)?
可选 1+ 位和换行符.*\h+-->\h+ Match 0+ times any char except newline, 1+ horizontal whitespace chars,
-->` 和 1+ 个水平空白字符[0-5]{2}:
匹配2次0-5.*
匹配除换行符之外的任何字符 0 次以上$
字符串结束