使用 python 和 selenium 从主页提取所有 youtube 视频 url
Extract all youtube video urls from homepage using python and selenium
我正在寻找构建一个 Youtube 推荐抓取工具,它抓取 Youtube 主页以寻找 youtube 视频 ids/links 以便稍后使用 youtube-dl 下载。但是,我不知道 how/where 是否真正获得了这些信息。
我尝试的代码如下:
from selenium import webdriver
driver = webdriver.Chrome('./chromedriver/chromedriver')
driver.get("https://www.youtube.com")
while True:
data = driver.find_elements_by_xpath("?")
for i in data:
l = i.get_attribute('href') #Should obtain some of the links/ids on the page but is None...
您的选择与任何元素都不匹配。通过简单地查看我的 youtube 首页的 html-source,我注意到,每个包含视频的元素都是 id 'thumbnail' 的 a-tag,它也具有直接属性 'href':
鉴于此,您可以通过这个确切的 id 找到元素并从中提取给定的属性“href”并通过简单的列表理解来过滤它,如下所示:
driver.get("https://www.youtube.com")
hrefs = [video.get_attribute('href') for video in driver.find_elements_by_id("thumbnail")]
for href in hrefs:
print(href)
输出:
https://www.youtube.com/watch?v=PcYxbxXJhcc
https://www.youtube.com/watch?v=oTL52-NvyE4
https://www.youtube.com/watch?v=8kVI621fZug
https://www.youtube.com/watch?v=Pr9TdbTDMH0
https://www.youtube.com/watch?v=iL9upp5jahg
https://www.youtube.com/watch?v=iWnb3IqCfgc
https://www.youtube.com/watch?v=ehAwNw4xDRM
https://www.youtube.com/watch?v=PzVj7s4JZhE
https://www.youtube.com/watch?v=7fBdqdqRxFM
https://www.youtube.com/watch?v=WMweEpGlu_U
https://www.youtube.com/watch?v=2ljGwsbRLaI
https://www.youtube.com/watch?v=aUgEPebvR2Q
https://www.youtube.com/watch?v=Gh6ovYtD2Q8
https://www.youtube.com/watch?v=dVICcSLIHCM
https://www.youtube.com/watch?v=bl6mPR5t6Dk
https://www.youtube.com/watch?v=mMKXCfTDjvg
https://www.youtube.com/watch?v=z_HhNWNm_jo
https://www.youtube.com/watch?v=ZtiqfY8fixU
https://www.youtube.com/watch?v=9eAcRFlXxgo
https://www.youtube.com/watch?v=omC2eg-d-6Q
https://www.youtube.com/watch?v=E90SOw7fIVk
https://www.youtube.com/watch?v=5qap5aO4i9A
https://www.youtube.com/watch?v=T3ua3xTfbFI
https://www.youtube.com/watch?v=DTvS9lvRxZ8
在抓取之前始终分析目标源的 html 结构,然后选择最适合查找数据的内容。
我正在寻找构建一个 Youtube 推荐抓取工具,它抓取 Youtube 主页以寻找 youtube 视频 ids/links 以便稍后使用 youtube-dl 下载。但是,我不知道 how/where 是否真正获得了这些信息。
我尝试的代码如下:
from selenium import webdriver
driver = webdriver.Chrome('./chromedriver/chromedriver')
driver.get("https://www.youtube.com")
while True:
data = driver.find_elements_by_xpath("?")
for i in data:
l = i.get_attribute('href') #Should obtain some of the links/ids on the page but is None...
您的选择与任何元素都不匹配。通过简单地查看我的 youtube 首页的 html-source,我注意到,每个包含视频的元素都是 id 'thumbnail' 的 a-tag,它也具有直接属性 'href':
鉴于此,您可以通过这个确切的 id 找到元素并从中提取给定的属性“href”并通过简单的列表理解来过滤它,如下所示:
driver.get("https://www.youtube.com")
hrefs = [video.get_attribute('href') for video in driver.find_elements_by_id("thumbnail")]
for href in hrefs:
print(href)
输出:
https://www.youtube.com/watch?v=PcYxbxXJhcc
https://www.youtube.com/watch?v=oTL52-NvyE4
https://www.youtube.com/watch?v=8kVI621fZug
https://www.youtube.com/watch?v=Pr9TdbTDMH0
https://www.youtube.com/watch?v=iL9upp5jahg
https://www.youtube.com/watch?v=iWnb3IqCfgc
https://www.youtube.com/watch?v=ehAwNw4xDRM
https://www.youtube.com/watch?v=PzVj7s4JZhE
https://www.youtube.com/watch?v=7fBdqdqRxFM
https://www.youtube.com/watch?v=WMweEpGlu_U
https://www.youtube.com/watch?v=2ljGwsbRLaI
https://www.youtube.com/watch?v=aUgEPebvR2Q
https://www.youtube.com/watch?v=Gh6ovYtD2Q8
https://www.youtube.com/watch?v=dVICcSLIHCM
https://www.youtube.com/watch?v=bl6mPR5t6Dk
https://www.youtube.com/watch?v=mMKXCfTDjvg
https://www.youtube.com/watch?v=z_HhNWNm_jo
https://www.youtube.com/watch?v=ZtiqfY8fixU
https://www.youtube.com/watch?v=9eAcRFlXxgo
https://www.youtube.com/watch?v=omC2eg-d-6Q
https://www.youtube.com/watch?v=E90SOw7fIVk
https://www.youtube.com/watch?v=5qap5aO4i9A
https://www.youtube.com/watch?v=T3ua3xTfbFI
https://www.youtube.com/watch?v=DTvS9lvRxZ8
在抓取之前始终分析目标源的 html 结构,然后选择最适合查找数据的内容。