将 Youtube 中的 WebVTT 文件转换为纯文本

Question

我正在使用 youtube-dl 从 youtube 下载 WebVTT 文件。

一个典型的文件如下所示：

WEBVTT
Kind: captions
Language: en

00:00:00.730 --> 00:00:05.200 align:start position:0%

[Applause]

00:00:05.200 --> 00:00:05.210 align:start position:0%
[Applause]


00:00:05.210 --> 00:00:11.860 align:start position:0%
[Applause]
hi<00:00:06.440><c> I'm</c><00:00:07.440><c> here</c><00:00:07.740><c> to</c><00:00:08.160><c> talk</c><00:00:08.429><c> to</c><00:00:09.019><c> share</c><00:00:10.019><c> an</c><00:00:10.469><c> idea</c><00:00:10.820><c> to</c>

00:00:11.860 --> 00:00:11.870 align:start position:0%
hi I'm here to talk to share an idea to


00:00:11.870 --> 00:00:15.890 align:start position:0%
hi I'm here to talk to share an idea to
communicate<00:00:12.920><c> but</c><00:00:13.920><c> what</c><00:00:14.790><c> is</c><00:00:14.940><c> communication</c>

00:00:15.890 --> 00:00:15.900 align:start position:0%
communicate but what is communication

我想要一个文本文件：

hi I'm here to talk to share an idea to
communicate but what is communication

使用我在网上找到的代码，我得到了这个：

cat output.vtt | sed "s/^[0-9]*[0-9\:\.\ \>\-]*//g" | grep -v "^WEBVTT\|^Kind: cap\|^Language" | awk 'BEGIN{ RS="\n\n+"; RS="\n\n" }NR>=2{ print }' > dialogues.txt

但它远非完美。我得到了很多无用的空格，所有的句子都显示了两次。你介意帮我吗？之前有人问过类似的问题，但提交的答案对我不起作用。

谢谢！

Answer 1

您也许可以做类似的事情：

sed -e '1,4d' -E -e '/^$|]|>$|%$/d' output.vtt | awk '!seen[[=10=]]++' > dialogues.txt

sed 删除前 4 行
sed 然后删除任何空行，或包含 ] 或以 >、%.
awk 删除重复行。

结果:

hi I'm here to talk to share an idea to
communicate but what is communication

您可能需要稍微调整一下，但结果应该会更符合您的要求。

Answer 2

能否请您尝试单独关注awk。

awk 'FNR<=4 || ([=10=] ~ /^$|-->|\[|\]|</){next} !a[[=10=]]++'  Input_file

说明：现在为上面的代码添加说明。

awk '                                     ##Starting awk program here.
FNR<=4 || ([=11=] ~ /^$|-->|\[|\]|</){        ##Checking condition if line number is less than 4 OR having spaces or [ or ] or --> then go next line.
  next                                    ##next will skip all further statements from here.
  }
!a[[=11=]]++                                  ##Creating an array whose index is [=11=] and increment its value with 1 with condition that it should NOT be already present in array a, which means it will give only 1 value of each line.
'  Input_file                             ##Mentioning Input_file name here.

Answer 3

如果分析 .vtt 文件的模式，基本上您希望保留从第 10 行开始的每 8 行。因此算法是删除前 2 行，然后保留每 8 行：

$ cat output.vtt | sed '1,2 d' | awk 'NR%8==0'

[Applause]
hi I'm here to talk to share an idea to
communicate but what is communication

sed '1,2 d' 删除第 1 行到第 2 行的范围
awk 'NR%8==0' 每 8 行打印一次

如果你想进一步过滤掉“[...]”行，那么你可以添加另一个grep命令，例如grep -v '^\[.*\]$'

Answer 4

就我而言，我想：

删除前 4 行
删除所有时间戳行
保留字幕之间的空行

我设法使用以下单个 sed 命令完成此操作：

sed -En '1,4d;/^[0-9].:[0-9].:[0-9].+$/!p' input.vtt > output.txt

如果像我一样，您需要经常这样做并且您正在使用 Bash，您也可以将其转换为 Bash 函数：

function vtt_to_txt() {
    sed -En '1,4d;/^[0-9].:[0-9].:[0-9].+$/!p' "" > ""
}

这将允许您随时像这样简单地调用函数：

vtt_to_text input.vtt output.txt

将 Youtube 中的 WebVTT 文件转换为纯文本

Convert WebVTT file from Youtube to plain text

bash

awk

youtube-dl

webvtt