从 Beautiful Soup 中提取价值

Question

我对编程还很陌生，我正在使用 Python 开发语音助手。我在 Github 上找到了这段代码，但他没有正常工作。这是代码：

def Play(speech):
if speech.endswith("on YouTube"):
    searchTerm = speech.split()
    response = get("https://www.youtube.com/results?search_query=" + quote(" ".join(searchTerm[:-2])))
    soup = BeautifulSoup(response.text, "html.parser")
    videos = soup.findAll(attrs={"class":"yt-uix-tile-link"})[1:4]
    #Was [:3], changed to [1:4] to try to stop ads
    #Try to remove google ads if possible (May have fixed, but test this)
    names = list()
    links = list()
    for i in range(len(videos)):
        names.insert(i, videos[i]["title"])
        links.insert(i, "https://www.youtube.com" + videos[i]["href"])
    print("I found 3 videos. " + ". ".join(names), links)

在 get() 方法中作为参数传递的 URL 工作正常，汤变量也是如此，但“视频”中没有任何内容，所以最后没有打印任何内容，我不知道如何解决这个问题。

请提供一些想法:) ?

Answer 1

您 cant 使用请求获取 youtube 等动态网站的内容。抱歉这么直接，但这是事实。

你需要先 get 到 url，然后在后台使用类似 chromium 的东西渲染响应，然后将结果传递给 beautiful soup。

渲染需要 1-2 秒。这就是它的完成方式。

有一段用于提取动态网站内容的代码段，然后将其传递给 BeautifulSoup

# pip install playwright
from playwright.sync_api import sync_playwright
# after installing you will get prompted
# to install `chromium`, the `thing` i was talking about
from bs4 import BeautifulSoup


def get_dynamic_soup(url: str) -> BeautifulSoup:
    with sync_playwright() as p:
        # Launch the browser
        browser = p.chromium.launch()

        # Open a new browser page
        page = browser.new_page()

        # Open our test file in the opened page
        page.goto(url)

        # Process extracted content with BeautifulSoup
        soup = BeautifulSoup(page.content(), "html.parser")

        browser.close()

        return soup

# quote is defined in your code
_url = "https://www.youtube.com/results?search_query=" + quote(" ".join(searchTerm[:-2]))
soup = get_dynamic_soup(_url)
# now you can do whatever you want with the soup

然后你就可以做你的事情了：

videos = soup.findAll(attrs={"class":"yt-uix-tile-link"})[1:4]

安装编剧

python -m pip install playwright # this installs the python package
python -m playwright install # this install the chromium executable

installation

的文档

编辑我在你的代码中发现了一个错误这条线

videos = soup.findAll(attrs={"class":"yt-uix-tile-link"})[1:4]

是错误的，因为您需要指定要搜索的 HTML 元素

一个很好的例子是：

videos = soup.findAll("div", attrs={
    "class": "yt-uix-tile-link"
})[1:4]
# or 
videos = soup.findAll("span", attrs={
    "class": "yt-uix-tile-link"
})[1:4]
# or whatever element it is

从 Beautiful Soup 中提取价值

Extracting values from Beautiful Soup

python

speech-recognition

response

beautifulsoup

安装编剧