如何以编程方式下载 Python 中 blob 中引用的 m3u8 视频？

Question

请注意，此问题与 How do we download a blob url video [closed] 的不同之处在于它需要无需与浏览器进行人工交互。

我有以下问题：

我有一个 URL 列表。它们指向 HTML 个具有相同底层结构的页面。
页面中间有一张图片；单击它时，它会加载一个播放器。
作为 blob 的播放器引用 m3u8 播放列表，尽管这在 HTML 本身中不可见（它在 Chrome 的“网络”选项卡中可见）。
播放器播放短视频。

我需要做的事情：

以编程方式访问各种 URL。获取 HTML 并单击图像播放器。
获取 blob 引用并使用该引用获取 m3u8 播放列表。
将流下载为视频（将其下载为 gif 可加分）。

请注意，该解决方案不需要与浏览器进行人工交互。 API-wise，输入应该是 URL 列表，输出应该是 videos/gifs.

列表

可以找到 示例页面 here 如果您想测试您的解决方案。

我的理解是可以用Selene获取HTML点击图片启动播放器。但是，我不知道如何处理 blob 以获得 m3u8，然后将其用于实际视频。

Answer 1

稍加挖掘，您无需单击任何按钮。当您单击按钮时，它会调用 master.m3u8 文件。使用开发工具，您可以将请求的 url 拼凑在一起。问题是，第一个文件不包含指向实际视频的链接。您将另一个请求拼凑起来以获得最终的 m3u8 文件。从那里，您可以使用其他 SO 链接下载视频。它是分段的，因此不能直接下载。您可以取消注释下面的打印语句，看看每个 m3u8 文件包含什么。这也会遍历页面

 import re
 for i in range(6119, 6121):
    url = 'https://www2.nhk.or.jp/signlanguage/sp/enquete.cgi?dno={}'.format(str(i))
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    print(soup.find(onclick=re.compile('signlanguage/movie'))) # locate the div that has the data we need

    video_id = soup.find(onclick=re.compile('signlanguage/movie')).get('onclick').split(',')[1].replace("'","")
    m3u8_url = 'https://nhks-vh.akamaihd.net/i/signlanguage/movie/v4/{}/{}.mp4/master.m3u8'.format(video_id[-1], video_id)
    # this m3u8 file doesn't contain download links, the next one does; so download and save that one
    r = requests.get(m3u8_url)
    # print(r.text)
 
    m3u8_url_2 = r.text.split('\n')[2] # get first link; high bandwidth
    r2 = requests.get(m3u8_url_2)
    # print(r2.text)
        
    # there are other ways to download the file, i'm just creating a new one with the data read and writing to a file
    fn = video_id + '.m3u8'
    with open(fn, 'w+') as f:
        f.write(r2.text)
        f.close()

如何以编程方式下载 Python 中 blob 中引用的 m3u8 视频？

How to programmatically download a m3u8 video referenced in a blob in Python?

python

beautifulsoup

web-scraping

m3u

m3u8