如何从unicode字幕中提取文本?
How to extract text from unicode subtitle?
我有一个格式如下的 unicode 字幕文件:
3
00:01:40,200 --> 00:01:43,326
english part
4
00:01:43,534 --> 00:01:44,851
خط فارسی
5
00:01:45,063 --> 00:01:48,485
complex part مخلوط
6
00:01:45,063 --> 00:01:48,485
complex part مخلوط
in 2 lines
如何提取数字作为键和文本作为值
[
[3] => english part
[4] => خط فارسی
[5] => complex part مخلوط
[6] => complex part مخلوط</br>in 2 lines
]
不要将找到的数字用作索引。最好使用正在进行的索引和 key/value 对。
也就是说,您可以选择(启用 multiline
和 verbose
、m
和 x
):
^(\d+)\R
[->\d: ,]+\R
((?:.+\R?)+)
在 PHP
这可能是
<?php
$text = <<<END
3
00:01:40,200 --> 00:01:43,326
english part
4
00:01:43,534 --> 00:01:44,851
خط فارسی
5
00:01:45,063 --> 00:01:48,485
complex part مخلوط
6
00:01:45,063 --> 00:01:48,485
complex part مخلوط
in 2 lines
END;
$regex = <<<END
~
^(?P<line>\d+)\R
[->\d: ,]+\R
(?P<content>(?:.+\R?)+)
~mx
END;
preg_match_all($regex, $text, $matches);
print_r($matches);
?>
我有一个格式如下的 unicode 字幕文件:
3
00:01:40,200 --> 00:01:43,326
english part
4
00:01:43,534 --> 00:01:44,851
خط فارسی
5
00:01:45,063 --> 00:01:48,485
complex part مخلوط
6
00:01:45,063 --> 00:01:48,485
complex part مخلوط
in 2 lines
如何提取数字作为键和文本作为值
[
[3] => english part
[4] => خط فارسی
[5] => complex part مخلوط
[6] => complex part مخلوط</br>in 2 lines
]
不要将找到的数字用作索引。最好使用正在进行的索引和 key/value 对。
也就是说,您可以选择(启用 multiline
和 verbose
、m
和 x
):
^(\d+)\R
[->\d: ,]+\R
((?:.+\R?)+)
在
PHP
这可能是
<?php
$text = <<<END
3
00:01:40,200 --> 00:01:43,326
english part
4
00:01:43,534 --> 00:01:44,851
خط فارسی
5
00:01:45,063 --> 00:01:48,485
complex part مخلوط
6
00:01:45,063 --> 00:01:48,485
complex part مخلوط
in 2 lines
END;
$regex = <<<END
~
^(?P<line>\d+)\R
[->\d: ,]+\R
(?P<content>(?:.+\R?)+)
~mx
END;
preg_match_all($regex, $text, $matches);
print_r($matches);
?>