Beautiful Soup 无法在此网站上运行

Question

我想抓取 table 中所有项目的 URL，但是当我尝试时，什么也没有出现。该代码非常基础，所以我明白为什么它可能不起作用。然而，即使试图抓取该网站的标题，也没有任何结果。我至少期待 h1 标签，因为它在 table...

之外

网站：https://www.vanguard.com.au/personal/products/en/overview

import requests
from bs4 import BeautifulSoup


lists =[]
url = 'https://www.vanguard.com.au/personal/products/en/overview'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

title = soup.find_all('h1', class_='heading2 gbs-font-vanguard-red')
for links in soup.find_all('a', style='padding-bottom: 1px;'):
    link_text = links['href']
    lists.append(link_text)

print(title)
print(lists)

Answer 1

如果问题是由 JavaScript 事件监听器引起的，我建议您使用 beautifulsoup 和 selenium 来抓取该网站。所以，让我们在发送请求时应用 selenium 并取回页面源，然后使用 beautifulsoup 来解析它。

此外，您应该使用 title = soup.find() 而不是 title = soup.findall() 以便只获得一个称号。

使用Firefox的代码示例：

from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from bs4 import BeautifulSoup


url = 'https://www.vanguard.com.au/personal/products/en/overview'
browser = webdriver.Firefox(executable_path=GeckoDriverManager().install())
browser.get(url)

soup = BeautifulSoup(browser.page_source, 'html.parser')
browser.close()

lists =[]
title = soup.find('h1', class_='heading2 gbs-font-vanguard-red')
for links in soup.find_all('a', style='padding-bottom: 1px;'):
    link_text = links['href']
    lists.append(link_text)

print(title)
print(lists)

输出：

<h1 class="heading2 gbs-font-vanguard-red">Investment products</h1>
['/personal/products/en/detail/8132', '/personal/products/en/detail/8219', '/personal/products/en/detail/8121',...,'/personal/products/en/detail/8217']

Answer 2

最常见的问题（很多现代页面）：这个页面使用JavaScript添加元素但是requests/BeautifulSoup不能运行JavaScript.

您可能需要使用 Selenium 来控制真正的网络浏览器，它可以运行 JavaScript.

此示例仅使用 Selenium，没有 BeautifulSoup

我用 xpath 但你也可以用 css selector.

from selenium import webdriver
from selenium.webdriver.common.by import By
             
url = 'https://www.vanguard.com.au/personal/products/en/overview'

lists = []

#driver = webdriver.Chrome(executable_path="/path/to/chromedrive.exe")
driver = webdriver.Firefox(executable_path="/path/to/geckodrive.exe")
driver.get(url)

title = driver.find_element(By.XPATH, '//h1[@class="heading2 gbs-font-vanguard-red"]')
print(title.text)

all_items = driver.find_elements(By.XPATH, '//a[@style="padding-bottom: 1px;"]')

for links in all_items:
    link_text = links.get_attribute('href')
    print(link_text)
    lists.append(link_text)

ChromeDriver（对于Chrome）
GeckoDriver（对于 Firefox）

Answer 3

从源头获取数据总是比通过 Selenium 获取数据更有效。看起来链接是通过 portId 创建的。

import pandas as pd
import requests


url = 'https://www3.vanguard.com.au/personal/products/funds.json'
payload = {
'context': '/personal/products/',
'countryCode': 'au.ret',
'paths': "[[['funds','legacyFunds'],'AU']]",
'method': 'get'}

jsonData = requests.get(url, params=payload).json()

results = jsonData['jsonGraph']['funds']['AU']['value']


df1 = pd.json_normalize(results, record_path=['children'])
df2 = pd.json_normalize(results, record_path=['listings'])


df = pd.concat([df1, df2], axis=0)
df['url_link'] = 'https://www.vanguard.com.au/personal/products/en/detail/' + df['portId'] + '/Overview'

输出：

print(df[['fundName', 'url_link']])
                                             fundName                                           url_link
0         Vanguard Active Emerging Market Equity Fund  https://www.vanguard.com.au/personal/products/...
1             Vanguard Active Global Credit Bond Fund  https://www.vanguard.com.au/personal/products/...
2                  Vanguard Active Global Growth Fund  https://www.vanguard.com.au/personal/products/...
3   Vanguard Australian Corporate Fixed Interest I...  https://www.vanguard.com.au/personal/products/...
4       Vanguard Australian Fixed Interest Index Fund  https://www.vanguard.com.au/personal/products/...
..                                                ...                                                ...
23  Vanguard MSCI Australian Small Companies Index...  https://www.vanguard.com.au/personal/products/...
24  Vanguard MSCI Index International Shares (Hedg...  https://www.vanguard.com.au/personal/products/...
25       Vanguard MSCI Index International Shares ETF  https://www.vanguard.com.au/personal/products/...
26  Vanguard MSCI International Small Companies In...  https://www.vanguard.com.au/personal/products/...
27  Vanguard International Credit Securities Hedge...  https://www.vanguard.com.au/personal/products/...

[66 rows x 2 columns]

Beautiful Soup 无法在此网站上运行

Beautiful Soup not working on this website

python

beautifulsoup

web-scraping

python-requests