如何从unicode字幕中提取文本?

How to extract text from unicode subtitle?

我有一个格式如下的 unicode 字幕文件:

3
00:01:40,200 --> 00:01:43,326
english part

4
00:01:43,534 --> 00:01:44,851
خط فارسی

5
00:01:45,063 --> 00:01:48,485
complex part مخلوط

6
00:01:45,063 --> 00:01:48,485
complex part مخلوط
in 2 lines

如何提取数字作为键和文本作为值

[
   [3] => english part
   [4] => خط فارسی
   [5] => complex part مخلوط
   [6] => complex part مخلوط</br>in 2 lines
]

不要将找到的数字用作索引。最好使用正在进行的索引和 key/value 对。
也就是说,您可以选择(启用 multilineverbosemx):

^(\d+)\R
[->\d: ,]+\R
((?:.+\R?)+)

参见 a demo on regex101.com


PHP 这可能是

<?php

$text = <<<END
3
00:01:40,200 --> 00:01:43,326
english part

4
00:01:43,534 --> 00:01:44,851
خط فارسی

5
00:01:45,063 --> 00:01:48,485
complex part مخلوط

6
00:01:45,063 --> 00:01:48,485
complex part مخلوط
in 2 lines
END;

$regex = <<<END
~
    ^(?P<line>\d+)\R
    [->\d: ,]+\R
    (?P<content>(?:.+\R?)+)
~mx
END;

preg_match_all($regex, $text, $matches);
print_r($matches);
?>

another demo on ideone.com