抓取时未获取所有信息 bet365.com

Question

我在尝试使用 urllib.request 和 BeautifulSoup 抓取 https://www.bet365.com/ 时遇到问题。问题是，下面的代码并没有获取页面上的所有信息，例如玩家的名字没有出现。也许另一个框架或配置来提取信息？

我的代码是：

from bs4 import BeautifulSoup
import urllib.request
url = "https://www.bet365.com/"
try:
    page = urllib.request.urlopen(url)
except:
    print("An error occured.")

soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)

Answer 1

查看相关页面的源代码，看起来基本上所有数据都是由 Javascript 填充的。 BeautifulSoup 不是无头客户端，它只是下载和解析 HTML 的东西，因此它看不到填充有 Javascript 的任何内容。你需要像 selenium 这样的无头浏览器来抓取类似的东西。

Answer 2

您需要使用 selenium 而不是请求，以及 Beautifulsoup。

from selenium import webdriver

url = "https://www.bet365.com"
driver = webdriver.Chrome(executable_path=r"the_path_of_driver")

driver.get(url)

driver.maximize_window() #optional, if you want to maximize the browser
driver.implicitly_wait(60) ##Optional, Wait the loading if error

soup = BeautifulSoup(driver.page_source, 'html.parser')  #get the soup

抓取时未获取所有信息 bet365.com

Not getting all the information when scraping bet365.com

python

web-scraping

beautifulsoup

scrapy

screen-scraping