使用 BeautifulSoup 抓取新的 YouTube 视频
Scraping New YouTube Videos With BeautifulSoup
我是 python 的新手,我想在 YouTube 上进行网络抓取。
我想使用这个 link 来上传最新的视频:'https://www.youtube.com/results?search_query=programming&sp=CAISBAgBEAE%253D' 我想抓取新的 5 个视频。我怎样才能做到这一点?
我已经使用这段代码对其进行了测试(我只想要
中的 links)
from bs4 import BeautifulSoup
import requests
url="https://www.youtube.com/results?search_query=programming&sp=CAISBAgBEAE%253D"
html = requests.get(url)
soup = BeautifulSoup(html.text, features="html.parser")
for entry in soup.find_all("entry"):
for link in entry.find_all("link"):
print(link["href"])
编辑:我没有收到来自 python 终端的任何响应。它没有刮任何东西。它只有默认的“>>>”。
如果不使用 Google 的 YouTube API 密钥,您就无法抓取 YouTube,您可以通过 these steps 获得该密钥。如果您仍然想尝试,我可以重新发布您问题的合法答案。
同时,尝试在本网站 videvo.net
上使用 beautifulsoup 练习解析
这里有一些代码可以帮助您入门
def get_source(url):
return BeautifulSoup(requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, verify=False).text, 'html.parser')
soup = get_source('http://videvo.net')
for tags in soup.find_all('a'):
print(tags['href'])
编辑
我的立场得到纠正(稍微)。 Youtube 的主要 url 无法解析。你可以试试这个代码
def get_source(url):
return BeautifulSoup(requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, verify=False).text, 'html.parser')
soup = get_source('https://www.youtube.com/feeds/videos.xml?user=kinagrannis')
for entry in soup.find_all("entry"):
for title in entry.find_all("title"):
print(title.text)
for link in entry.find_all("link"):
print(link["href"])
for name in entry.find_all("name"):
print(name.text)
for pub in entry.find_all("published"):
print(pub.text)
注意:你可以用任何用户名代替'kinnagrannis',user=[username]
您可以通过以下方式抓取 YouTube:
- 使用
requests-HTML
、playwright
或 selenium
库。
- 使用正则表达式。
- 使用来自 SerpApi 的 YouTube 搜索引擎结果 API。
代码(真的很基础只是给个思路)
from requests_html import HTMLSession
session = HTMLSession()
url = "https://www.youtube.com/results?search_query=programming&sp=CAISBAgBEAE%253D"
response = session.get(url)
response.html.render(sleep=1, keep_page = True, scrolldown = 2)
for links in response.html.find('a#video-title'):
link = next(iter(links.absolute_links))
print(link)
输出:
https://www.youtube.com/watch?v=OUnxJk3Bphk
https://www.youtube.com/watch?v=vWvtt1ESNeY
https://www.youtube.com/watch?v=b8OIZu5y_Ak
https://www.youtube.com/watch?v=xp3fHaT2_VE
https://www.youtube.com/watch?v=e9toQAcjOrw
https://www.youtube.com/watch?v=em0Is0nyaXA
https://www.youtube.com/watch?v=N5JVTUAGmAM
https://www.youtube.com/watch?v=a0hQG-UdhYc
https://www.youtube.com/watch?v=SmQFxQ1fa2o
https://www.youtube.com/watch?v=uuMS1FYLgWQ
https://www.youtube.com/watch?v=8WJ-zSE32ZY
https://www.youtube.com/watch?v=c5MtH-xDspg
https://www.youtube.com/watch?v=5Xktqz6VUTU
https://www.youtube.com/watch?v=Wbo6j_iq2XY
https://www.youtube.com/watch?v=8eu9nliySO4
https://www.youtube.com/watch?v=j28PjOy_uk8
https://www.youtube.com/watch?v=fM2Ordt8Q9E
https://www.youtube.com/watch?v=tFSkaIVyNno
https://www.youtube.com/watch?v=1hDXlc2C3Rw
https://www.youtube.com/watch?v=vH9_Eo7VW3c
在没有无头浏览器的情况下使用 regex
。
您需要到达 var ytInitialData
元素,然后 "commandMetadata"
在那里您会找到 URL 视频 {"url":"/watch?v=Ae2TRkpjRCc",....
这是一个起点,它在 regex101 上抓住了 var ytInitialData
内的所有内容。
或者,您可以使用 YouTube Search Engine Results API from SerpApi. It's a paid API with a free plan. Check out the Playground。
要集成的代码:
from serpapi import GoogleSearch
params = {
"engine": "youtube",
"search_query": "programming",
"sp": "CAISBAgBEAE%253D",
"api_key": "your_secret_api_key"
}
search = GoogleSearch(params)
results = search.get_dict()
for link in results['video_results']:
print(f"Title: {link['title']}\nLink: {link['link']}\n")
输出:
Title: CLASS VIII BASIC HTML TAGS AND PROGRAMMING 15 4 101`
Link: https://www.youtube.com/watch?v=KIPp63tXKpU
Title: For loop in c programming #bssdlectureclasses
Link: https://www.youtube.com/watch?v=nfRN0x9VvQc
Title: [C#] Programming NatsukiBot
Link: https://www.youtube.com/watch?v=chnigx-ezwg
Title: CS201 Short Lecture - 03 | VU Short Lecture | Introduction to Programming in (Urdu / Hindi)
Link: https://www.youtube.com/watch?v=qoxXJchd7N4
Title: Programming in C Language - While statement
Link: https://www.youtube.com/watch?v=cl0OpNCdF5I
Title: Introduction to html and Basic programming
Link: https://www.youtube.com/watch?v=A4We3NGqxuA
Title: Use of Printf & Scanf functions | Part 7 | C Programming | PadhoChalo
Link: https://www.youtube.com/watch?v=578xS-Ugc2c
Title: C++ course has started | Computer Programming | Aashu |
Link: https://www.youtube.com/watch?v=SjFgTK2HqbE
Title: Mitsubishi Outlander 2008 prox/twist transponder key programming tip
Link: https://www.youtube.com/watch?v=HlSJcBwxKFQ
Title: Computer Programming 1 -Introduction to the course
Link: https://www.youtube.com/watch?v=xdmPbhTT01g
Title: Programming, Data Structures and Algorithms in Python
Link: https://www.youtube.com/watch?v=0fUddu9cdAU
P.S - 我写了两篇关于如何 Scrape YouTube Search with Python (part 1) and Scrape YouTube Search with Python (part 2) 的博客文章,通过视觉表示更深入地介绍了它。
Disclaimer, I work for SerpApi.
我是 python 的新手,我想在 YouTube 上进行网络抓取。
我想使用这个 link 来上传最新的视频:'https://www.youtube.com/results?search_query=programming&sp=CAISBAgBEAE%253D' 我想抓取新的 5 个视频。我怎样才能做到这一点?
我已经使用这段代码对其进行了测试(我只想要
from bs4 import BeautifulSoup
import requests
url="https://www.youtube.com/results?search_query=programming&sp=CAISBAgBEAE%253D"
html = requests.get(url)
soup = BeautifulSoup(html.text, features="html.parser")
for entry in soup.find_all("entry"):
for link in entry.find_all("link"):
print(link["href"])
编辑:我没有收到来自 python 终端的任何响应。它没有刮任何东西。它只有默认的“>>>”。
如果不使用 Google 的 YouTube API 密钥,您就无法抓取 YouTube,您可以通过 these steps 获得该密钥。如果您仍然想尝试,我可以重新发布您问题的合法答案。
同时,尝试在本网站 videvo.net
上使用 beautifulsoup 练习解析这里有一些代码可以帮助您入门
def get_source(url):
return BeautifulSoup(requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, verify=False).text, 'html.parser')
soup = get_source('http://videvo.net')
for tags in soup.find_all('a'):
print(tags['href'])
编辑 我的立场得到纠正(稍微)。 Youtube 的主要 url 无法解析。你可以试试这个代码
def get_source(url):
return BeautifulSoup(requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, verify=False).text, 'html.parser')
soup = get_source('https://www.youtube.com/feeds/videos.xml?user=kinagrannis')
for entry in soup.find_all("entry"):
for title in entry.find_all("title"):
print(title.text)
for link in entry.find_all("link"):
print(link["href"])
for name in entry.find_all("name"):
print(name.text)
for pub in entry.find_all("published"):
print(pub.text)
注意:你可以用任何用户名代替'kinnagrannis',user=[username]
您可以通过以下方式抓取 YouTube:
- 使用
requests-HTML
、playwright
或selenium
库。 - 使用正则表达式。
- 使用来自 SerpApi 的 YouTube 搜索引擎结果 API。
代码(真的很基础只是给个思路)
from requests_html import HTMLSession
session = HTMLSession()
url = "https://www.youtube.com/results?search_query=programming&sp=CAISBAgBEAE%253D"
response = session.get(url)
response.html.render(sleep=1, keep_page = True, scrolldown = 2)
for links in response.html.find('a#video-title'):
link = next(iter(links.absolute_links))
print(link)
输出:
https://www.youtube.com/watch?v=OUnxJk3Bphk
https://www.youtube.com/watch?v=vWvtt1ESNeY
https://www.youtube.com/watch?v=b8OIZu5y_Ak
https://www.youtube.com/watch?v=xp3fHaT2_VE
https://www.youtube.com/watch?v=e9toQAcjOrw
https://www.youtube.com/watch?v=em0Is0nyaXA
https://www.youtube.com/watch?v=N5JVTUAGmAM
https://www.youtube.com/watch?v=a0hQG-UdhYc
https://www.youtube.com/watch?v=SmQFxQ1fa2o
https://www.youtube.com/watch?v=uuMS1FYLgWQ
https://www.youtube.com/watch?v=8WJ-zSE32ZY
https://www.youtube.com/watch?v=c5MtH-xDspg
https://www.youtube.com/watch?v=5Xktqz6VUTU
https://www.youtube.com/watch?v=Wbo6j_iq2XY
https://www.youtube.com/watch?v=8eu9nliySO4
https://www.youtube.com/watch?v=j28PjOy_uk8
https://www.youtube.com/watch?v=fM2Ordt8Q9E
https://www.youtube.com/watch?v=tFSkaIVyNno
https://www.youtube.com/watch?v=1hDXlc2C3Rw
https://www.youtube.com/watch?v=vH9_Eo7VW3c
在没有无头浏览器的情况下使用 regex
。
您需要到达 var ytInitialData
元素,然后 "commandMetadata"
在那里您会找到 URL 视频 {"url":"/watch?v=Ae2TRkpjRCc",....
这是一个起点,它在 regex101 上抓住了 var ytInitialData
内的所有内容。
或者,您可以使用 YouTube Search Engine Results API from SerpApi. It's a paid API with a free plan. Check out the Playground。
要集成的代码:
from serpapi import GoogleSearch
params = {
"engine": "youtube",
"search_query": "programming",
"sp": "CAISBAgBEAE%253D",
"api_key": "your_secret_api_key"
}
search = GoogleSearch(params)
results = search.get_dict()
for link in results['video_results']:
print(f"Title: {link['title']}\nLink: {link['link']}\n")
输出:
Title: CLASS VIII BASIC HTML TAGS AND PROGRAMMING 15 4 101`
Link: https://www.youtube.com/watch?v=KIPp63tXKpU
Title: For loop in c programming #bssdlectureclasses
Link: https://www.youtube.com/watch?v=nfRN0x9VvQc
Title: [C#] Programming NatsukiBot
Link: https://www.youtube.com/watch?v=chnigx-ezwg
Title: CS201 Short Lecture - 03 | VU Short Lecture | Introduction to Programming in (Urdu / Hindi)
Link: https://www.youtube.com/watch?v=qoxXJchd7N4
Title: Programming in C Language - While statement
Link: https://www.youtube.com/watch?v=cl0OpNCdF5I
Title: Introduction to html and Basic programming
Link: https://www.youtube.com/watch?v=A4We3NGqxuA
Title: Use of Printf & Scanf functions | Part 7 | C Programming | PadhoChalo
Link: https://www.youtube.com/watch?v=578xS-Ugc2c
Title: C++ course has started | Computer Programming | Aashu |
Link: https://www.youtube.com/watch?v=SjFgTK2HqbE
Title: Mitsubishi Outlander 2008 prox/twist transponder key programming tip
Link: https://www.youtube.com/watch?v=HlSJcBwxKFQ
Title: Computer Programming 1 -Introduction to the course
Link: https://www.youtube.com/watch?v=xdmPbhTT01g
Title: Programming, Data Structures and Algorithms in Python
Link: https://www.youtube.com/watch?v=0fUddu9cdAU
P.S - 我写了两篇关于如何 Scrape YouTube Search with Python (part 1) and Scrape YouTube Search with Python (part 2) 的博客文章,通过视觉表示更深入地介绍了它。
Disclaimer, I work for SerpApi.