使用 BeautifulSoup 抓取 g2a[dot]com

Question

我正在尝试抓取此游戏网站 (g2a[dot]com) 以获取我正在寻找的游戏的最佳价格列表。价格通常在 table 内（见图）。

我到达 table 的代码是：

for gTitle in gameList:
    page = urllib.request.urlopen('http://www.g2a.com/%s.html' %gTitle).read()
    soup = BeautifulSoup(page, 'lxml')
    table = soup.find('table',class_='mp-user-rating')

但是当我打印 table 时，我发现 python 将网站中的所有 table 合并在一起，没有任何内容：

>>> <table class="mp-user-rating jq-wh-offers wh-table"></table>

这是一个错误还是我做错了什么？我将 python 3.6.1 与 BeautifulSoup4 和 urllib 一起使用。如果可能的话，我想继续使用这些，但我愿意改变。

Answer 1

根据 Pedro 的建议，我已经尝试过 Selenium，确实它完成了任务。谢谢佩德罗！有兴趣的朋友，我的代码：

# importing packages
from selenium import webdriver

# game list
gameList = ['mass-effect-andromeda-origin-cd-key-preorder-global',\
            'total-war-warhammer-steam-cd-key-preorder-global',\
            'starcraft-2-heart-of-the-swarm-cd-key-global-1']

# scraping
chromePath = r"C:\Users\userName\Documents\Python\chromedriver.exe"
for gTitle in gameList:
    driver = webdriver.Chrome(chromePath)
    driver.get('http://www.g2a.com/%s.html' %gTitle)
    table = driver.find_element_by_xpath("""//*[@id="about-game"]/div/div[3]/div[1]/table/tbody""")
    bestPrice = ''.join(list(table.text.split('\n'))[2][12:][:6])
    bestPrice = float(bestPrice.replace(",","."))
    print(bestPrice)

Answer 2

我查看了该网站。当您单击 "LOAD MORE" 时，它会加载游戏列表，此后。如果您在检查元素内部时查看浏览器的网络选项卡并仅过滤 "xhr" 请求，您可以看到它正在点击加载新游戏集的 api 端点。我已将此 api 端点用作我的 url。

import requests,json
pageNum = 0 # start with 0, (Also using lower than 0 will start it from 0)
while True :
    url = "https://www.g2a.com/lucene/search/filter?&minPrice=0.00&maxPrice=10000&cn=&kr=&stock=all&event=&platform=0&search=&genre=0&cat=0&sortOrder=popularity+desc&start={}&rows=12&steam_app_id=&steam_category=&steam_prod_type=&includeOutOfStock=&includeFreeGames=false&_=1492758607443".format(str(pageNum))

    games_list = json.loads(requests.get(url).text)['docs'] # `games_list` contains each game as a dictionary from where you can take out the required information.  

    if len(games_list) == 0:
        break # we break off here when the maximum of start parameter is reached and the games_list is empty.
    else:
        pageNum += 12 # we use an increment of 12 because we observed an increment of 12 in the start parameter each time we click on "LOAD MORE"

使用 BeautifulSoup 抓取 g2a[dot]com

Scraping g2a[dot]com with BeautifulSoup

python

urllib

beautifulsoup

web-scraping