使用 BeautifulSoup 抓取新的 YouTube 视频

Scraping New YouTube Videos With BeautifulSoup

我是 python 的新手,我想在 YouTube 上进行网络抓取。 我想使用这个 link 来上传最新的视频:'https://www.youtube.com/results?search_query=programming&sp=CAISBAgBEAE%253D' 我想抓取新的 5 个视频。我怎样才能做到这一点? 我已经使用这段代码对其进行了测试(我只想要

中的 links)
from bs4 import BeautifulSoup
import requests

url="https://www.youtube.com/results?search_query=programming&sp=CAISBAgBEAE%253D"
html = requests.get(url)
soup = BeautifulSoup(html.text, features="html.parser") 

for entry in soup.find_all("entry"):
    for link in entry.find_all("link"):
        print(link["href"])

编辑:我没有收到来自 python 终端的任何响应。它没有刮任何东西。它只有默认的“>>>”。

如果不使用 Google 的 YouTube API 密钥,您就无法抓取 YouTube,您可以通过 these steps 获得该密钥。如果您仍然想尝试,我可以重新发布您问题的合法答案。

同时,尝试在本网站 videvo.net

上使用 beautifulsoup 练习解析

这里有一些代码可以帮助您入门

def get_source(url):
    return BeautifulSoup(requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, verify=False).text, 'html.parser')

soup = get_source('http://videvo.net')

for tags in soup.find_all('a'):
   print(tags['href'])

编辑 我的立场得到纠正(稍微)。 Youtube 的主要 url 无法解析。你可以试试这个代码

def get_source(url):
    return BeautifulSoup(requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, verify=False).text, 'html.parser')

soup = get_source('https://www.youtube.com/feeds/videos.xml?user=kinagrannis')

for entry in soup.find_all("entry"):
   for title in entry.find_all("title"):
      print(title.text)
   for link in entry.find_all("link"):
      print(link["href"])
   for name in entry.find_all("name"):
      print(name.text)
   for pub in entry.find_all("published"):
      print(pub.text)

注意:你可以用任何用户名代替'kinnagrannis',user=[username]

您可以通过以下方式抓取 YouTube:

  • 使用 requests-HTMLplaywrightselenium 库。
  • 使用正则表达式。
  • 使用来自 SerpApi 的 YouTube 搜索引擎结果 API。

代码(真的很基础只是给个思路)

from requests_html import HTMLSession

session = HTMLSession()
url = "https://www.youtube.com/results?search_query=programming&sp=CAISBAgBEAE%253D"
response = session.get(url)
response.html.render(sleep=1, keep_page = True, scrolldown = 2)

for links in response.html.find('a#video-title'):
    link = next(iter(links.absolute_links))
    print(link)

输出:

https://www.youtube.com/watch?v=OUnxJk3Bphk
https://www.youtube.com/watch?v=vWvtt1ESNeY
https://www.youtube.com/watch?v=b8OIZu5y_Ak
https://www.youtube.com/watch?v=xp3fHaT2_VE
https://www.youtube.com/watch?v=e9toQAcjOrw
https://www.youtube.com/watch?v=em0Is0nyaXA
https://www.youtube.com/watch?v=N5JVTUAGmAM
https://www.youtube.com/watch?v=a0hQG-UdhYc
https://www.youtube.com/watch?v=SmQFxQ1fa2o
https://www.youtube.com/watch?v=uuMS1FYLgWQ
https://www.youtube.com/watch?v=8WJ-zSE32ZY
https://www.youtube.com/watch?v=c5MtH-xDspg
https://www.youtube.com/watch?v=5Xktqz6VUTU
https://www.youtube.com/watch?v=Wbo6j_iq2XY
https://www.youtube.com/watch?v=8eu9nliySO4
https://www.youtube.com/watch?v=j28PjOy_uk8
https://www.youtube.com/watch?v=fM2Ordt8Q9E
https://www.youtube.com/watch?v=tFSkaIVyNno
https://www.youtube.com/watch?v=1hDXlc2C3Rw
https://www.youtube.com/watch?v=vH9_Eo7VW3c

在没有无头浏览器的情况下使用 regex

您需要到达 var ytInitialData 元素,然后 "commandMetadata" 在那里您会找到 URL 视频 {"url":"/watch?v=Ae2TRkpjRCc",....

这是一个起点,它在 regex101 上抓住了 var ytInitialData 内的所有内容。


或者,您可以使用 YouTube Search Engine Results API from SerpApi. It's a paid API with a free plan. Check out the Playground

要集成的代码:

from serpapi import GoogleSearch

params = {
  "engine": "youtube",
  "search_query": "programming",
  "sp": "CAISBAgBEAE%253D",
  "api_key": "your_secret_api_key"
}

search = GoogleSearch(params)
results = search.get_dict()

for link in results['video_results']:
    print(f"Title: {link['title']}\nLink: {link['link']}\n")

输出:

Title: CLASS VIII BASIC HTML TAGS AND PROGRAMMING 15 4 101`
Link: https://www.youtube.com/watch?v=KIPp63tXKpU

Title: For loop in c programming #bssdlectureclasses
Link: https://www.youtube.com/watch?v=nfRN0x9VvQc

Title: [C#] Programming NatsukiBot
Link: https://www.youtube.com/watch?v=chnigx-ezwg

Title: CS201 Short Lecture - 03 | VU Short Lecture | Introduction to Programming in (Urdu / Hindi)
Link: https://www.youtube.com/watch?v=qoxXJchd7N4

Title: Programming in C Language - While statement
Link: https://www.youtube.com/watch?v=cl0OpNCdF5I

Title: Introduction to html and Basic programming
Link: https://www.youtube.com/watch?v=A4We3NGqxuA

Title: Use of Printf & Scanf functions | Part 7 | C Programming | PadhoChalo
Link: https://www.youtube.com/watch?v=578xS-Ugc2c

Title: C++ course has started | Computer Programming | Aashu |
Link: https://www.youtube.com/watch?v=SjFgTK2HqbE

Title: Mitsubishi Outlander 2008 prox/twist transponder key programming tip
Link: https://www.youtube.com/watch?v=HlSJcBwxKFQ

Title: Computer Programming 1 -Introduction to the course
Link: https://www.youtube.com/watch?v=xdmPbhTT01g

Title: Programming, Data Structures and Algorithms in Python
Link: https://www.youtube.com/watch?v=0fUddu9cdAU

P.S - 我写了两篇关于如何 Scrape YouTube Search with Python (part 1) and Scrape YouTube Search with Python (part 2) 的博客文章,通过视觉表示更深入地介绍了它。

Disclaimer, I work for SerpApi.