python 中 YouTube 字幕的字符串格式

String formatting for youtube subtitles in python

我试图从 YouTube 视频中截取一段字幕。我抓取了数据,但现在我在格式化数据时遇到了困难。我想删除时间戳和额外的换行符(\n) 并以漂亮的字符串格式获取它。我想知道解决这些问题的最佳方法,以便将来我可以正确地获取

数据:

what if I told you that the world
0:06
creates 2.5 quintillion bytes of data
0:09
every single day would you believe me
0:12
what if I told you that 90% of all data
0:14
ever created in the history of the world
0:17
sprouted in just the past two years do
0:20
you believe me yet well hold on your
0:22
brains folks because both are very very
0:24
true whether it's the 13 new spotify
0:27
songs or the 600 Wikipedia page edits
0:29
maybe the five hundred and twenty seven
0:31
thousand snapchats or how about sixty
0:34
million texts all of this data is
0:36
created no not just in a day in 60
0:39

如果您只是想删除时间戳行,试试这个正则表达式调用:

import re

rawSubtitles = """what if I told you that the world
0:06
creates 2.5 quintillion bytes of data
0:09
every single day would you believe me
0:12
what if I told you that 90% of all data
0:14
ever created in the history of the world
0:17
sprouted in just the past two years do
0:20
you believe me yet well hold on your
0:22
brains folks because both are very very
0:24
true whether it's the 13 new spotify
0:27
songs or the 600 Wikipedia page edits
0:29
maybe the five hundred and twenty seven
0:31
thousand snapchats or how about sixty
0:34
million texts all of this data is
0:36
created no not just in a day in 60
0:39
"""

# subtitles = [[],[]]
cleanSubtitle = re.sub(r"\n\d+:\d+\n"," ",rawSubtitles)

print(cleanSubtitle)

输出:

what if I told you that the world creates 2.5 quintillion bytes of data every single day would you believe me what if I told you that 90% of all data ever created in the history of the world sprouted in just the past two years do you believe me yet well hold on your brains folks because both are very very true whether it's the 13 new spotify songs or the 600 Wikipedia page edits maybe the five hundred and twenty seven thousand snapchats or how about sixty million texts all of this data is created no not just in a day in 60

工作原理:

  • \d+ 求一组数字
  • ":" 找到冒号
  • \n 在模式的每一侧找到新行,确保整行的格式为“#:#”
  • 然后对于它匹配的任何内容,它将替换为 space,但如果您愿意,可以使用其他分隔符