从 Beautiful Soup 中提取价值
Extracting values from Beautiful Soup
我对编程还很陌生,我正在使用 Python 开发语音助手。我在 Github 上找到了这段代码,但他没有正常工作。这是代码:
def Play(speech):
if speech.endswith("on YouTube"):
searchTerm = speech.split()
response = get("https://www.youtube.com/results?search_query=" + quote(" ".join(searchTerm[:-2])))
soup = BeautifulSoup(response.text, "html.parser")
videos = soup.findAll(attrs={"class":"yt-uix-tile-link"})[1:4]
#Was [:3], changed to [1:4] to try to stop ads
#Try to remove google ads if possible (May have fixed, but test this)
names = list()
links = list()
for i in range(len(videos)):
names.insert(i, videos[i]["title"])
links.insert(i, "https://www.youtube.com" + videos[i]["href"])
print("I found 3 videos. " + ". ".join(names), links)
在 get() 方法中作为参数传递的 URL 工作正常,汤变量也是如此,但“视频”中没有任何内容,所以最后没有打印任何内容,我不知道如何解决这个问题。
请提供一些想法:) ?
您 cant
使用请求获取 youtube
等动态网站的内容。抱歉这么直接,但这是事实。
你需要先 get
到 url,然后在后台使用类似 chromium
的东西渲染响应,然后将结果传递给 beautiful soup。
渲染需要 1-2 秒。这就是它的完成方式。
有一段用于提取动态网站内容的代码段,然后将其传递给 BeautifulSoup
# pip install playwright
from playwright.sync_api import sync_playwright
# after installing you will get prompted
# to install `chromium`, the `thing` i was talking about
from bs4 import BeautifulSoup
def get_dynamic_soup(url: str) -> BeautifulSoup:
with sync_playwright() as p:
# Launch the browser
browser = p.chromium.launch()
# Open a new browser page
page = browser.new_page()
# Open our test file in the opened page
page.goto(url)
# Process extracted content with BeautifulSoup
soup = BeautifulSoup(page.content(), "html.parser")
browser.close()
return soup
# quote is defined in your code
_url = "https://www.youtube.com/results?search_query=" + quote(" ".join(searchTerm[:-2]))
soup = get_dynamic_soup(_url)
# now you can do whatever you want with the soup
然后你就可以做你的事情了:
videos = soup.findAll(attrs={"class":"yt-uix-tile-link"})[1:4]
安装编剧
python -m pip install playwright # this installs the python package
python -m playwright install # this install the chromium executable
的文档
编辑
我在你的代码中发现了一个错误
这条线
videos = soup.findAll(attrs={"class":"yt-uix-tile-link"})[1:4]
是错误的,因为您需要指定要搜索的 HTML 元素
一个很好的例子是:
videos = soup.findAll("div", attrs={
"class": "yt-uix-tile-link"
})[1:4]
# or
videos = soup.findAll("span", attrs={
"class": "yt-uix-tile-link"
})[1:4]
# or whatever element it is
我对编程还很陌生,我正在使用 Python 开发语音助手。我在 Github 上找到了这段代码,但他没有正常工作。这是代码:
def Play(speech):
if speech.endswith("on YouTube"):
searchTerm = speech.split()
response = get("https://www.youtube.com/results?search_query=" + quote(" ".join(searchTerm[:-2])))
soup = BeautifulSoup(response.text, "html.parser")
videos = soup.findAll(attrs={"class":"yt-uix-tile-link"})[1:4]
#Was [:3], changed to [1:4] to try to stop ads
#Try to remove google ads if possible (May have fixed, but test this)
names = list()
links = list()
for i in range(len(videos)):
names.insert(i, videos[i]["title"])
links.insert(i, "https://www.youtube.com" + videos[i]["href"])
print("I found 3 videos. " + ". ".join(names), links)
在 get() 方法中作为参数传递的 URL 工作正常,汤变量也是如此,但“视频”中没有任何内容,所以最后没有打印任何内容,我不知道如何解决这个问题。
请提供一些想法:) ?
您 cant
使用请求获取 youtube
等动态网站的内容。抱歉这么直接,但这是事实。
你需要先 get
到 url,然后在后台使用类似 chromium
的东西渲染响应,然后将结果传递给 beautiful soup。
渲染需要 1-2 秒。这就是它的完成方式。
有一段用于提取动态网站内容的代码段,然后将其传递给 BeautifulSoup
# pip install playwright
from playwright.sync_api import sync_playwright
# after installing you will get prompted
# to install `chromium`, the `thing` i was talking about
from bs4 import BeautifulSoup
def get_dynamic_soup(url: str) -> BeautifulSoup:
with sync_playwright() as p:
# Launch the browser
browser = p.chromium.launch()
# Open a new browser page
page = browser.new_page()
# Open our test file in the opened page
page.goto(url)
# Process extracted content with BeautifulSoup
soup = BeautifulSoup(page.content(), "html.parser")
browser.close()
return soup
# quote is defined in your code
_url = "https://www.youtube.com/results?search_query=" + quote(" ".join(searchTerm[:-2]))
soup = get_dynamic_soup(_url)
# now you can do whatever you want with the soup
然后你就可以做你的事情了:
videos = soup.findAll(attrs={"class":"yt-uix-tile-link"})[1:4]
安装编剧
python -m pip install playwright # this installs the python package
python -m playwright install # this install the chromium executable
的文档
编辑 我在你的代码中发现了一个错误 这条线
videos = soup.findAll(attrs={"class":"yt-uix-tile-link"})[1:4]
是错误的,因为您需要指定要搜索的 HTML 元素
一个很好的例子是:
videos = soup.findAll("div", attrs={
"class": "yt-uix-tile-link"
})[1:4]
# or
videos = soup.findAll("span", attrs={
"class": "yt-uix-tile-link"
})[1:4]
# or whatever element it is