使用 BeautifulSoup 抓取 g2a[dot]com
Scraping g2a[dot]com with BeautifulSoup
我正在尝试抓取此游戏网站 (g2a[dot]com) 以获取我正在寻找的游戏的最佳价格列表。价格通常在 table 内(见图)。
我到达 table 的代码是:
for gTitle in gameList:
page = urllib.request.urlopen('http://www.g2a.com/%s.html' %gTitle).read()
soup = BeautifulSoup(page, 'lxml')
table = soup.find('table',class_='mp-user-rating')
但是当我打印 table 时,我发现 python 将网站中的所有 table 合并在一起,没有任何内容:
>>> <table class="mp-user-rating jq-wh-offers wh-table"></table>
这是一个错误还是我做错了什么?我将 python 3.6.1 与 BeautifulSoup4 和 urllib 一起使用。如果可能的话,我想继续使用这些,但我愿意改变。
根据 Pedro 的建议,我已经尝试过 Selenium,确实它完成了任务。谢谢佩德罗!有兴趣的朋友,我的代码:
# importing packages
from selenium import webdriver
# game list
gameList = ['mass-effect-andromeda-origin-cd-key-preorder-global',\
'total-war-warhammer-steam-cd-key-preorder-global',\
'starcraft-2-heart-of-the-swarm-cd-key-global-1']
# scraping
chromePath = r"C:\Users\userName\Documents\Python\chromedriver.exe"
for gTitle in gameList:
driver = webdriver.Chrome(chromePath)
driver.get('http://www.g2a.com/%s.html' %gTitle)
table = driver.find_element_by_xpath("""//*[@id="about-game"]/div/div[3]/div[1]/table/tbody""")
bestPrice = ''.join(list(table.text.split('\n'))[2][12:][:6])
bestPrice = float(bestPrice.replace(",","."))
print(bestPrice)
我查看了该网站。当您单击 "LOAD MORE" 时,它会加载游戏列表,此后。如果您在检查元素内部时查看浏览器的网络选项卡并仅过滤 "xhr" 请求,您可以看到它正在点击加载新游戏集的 api 端点。我已将此 api 端点用作我的 url。
import requests,json
pageNum = 0 # start with 0, (Also using lower than 0 will start it from 0)
while True :
url = "https://www.g2a.com/lucene/search/filter?&minPrice=0.00&maxPrice=10000&cn=&kr=&stock=all&event=&platform=0&search=&genre=0&cat=0&sortOrder=popularity+desc&start={}&rows=12&steam_app_id=&steam_category=&steam_prod_type=&includeOutOfStock=&includeFreeGames=false&_=1492758607443".format(str(pageNum))
games_list = json.loads(requests.get(url).text)['docs'] # `games_list` contains each game as a dictionary from where you can take out the required information.
if len(games_list) == 0:
break # we break off here when the maximum of start parameter is reached and the games_list is empty.
else:
pageNum += 12 # we use an increment of 12 because we observed an increment of 12 in the start parameter each time we click on "LOAD MORE"
我正在尝试抓取此游戏网站 (g2a[dot]com) 以获取我正在寻找的游戏的最佳价格列表。价格通常在 table 内(见图)。
我到达 table 的代码是:
for gTitle in gameList:
page = urllib.request.urlopen('http://www.g2a.com/%s.html' %gTitle).read()
soup = BeautifulSoup(page, 'lxml')
table = soup.find('table',class_='mp-user-rating')
但是当我打印 table 时,我发现 python 将网站中的所有 table 合并在一起,没有任何内容:
>>> <table class="mp-user-rating jq-wh-offers wh-table"></table>
这是一个错误还是我做错了什么?我将 python 3.6.1 与 BeautifulSoup4 和 urllib 一起使用。如果可能的话,我想继续使用这些,但我愿意改变。
根据 Pedro 的建议,我已经尝试过 Selenium,确实它完成了任务。谢谢佩德罗!有兴趣的朋友,我的代码:
# importing packages
from selenium import webdriver
# game list
gameList = ['mass-effect-andromeda-origin-cd-key-preorder-global',\
'total-war-warhammer-steam-cd-key-preorder-global',\
'starcraft-2-heart-of-the-swarm-cd-key-global-1']
# scraping
chromePath = r"C:\Users\userName\Documents\Python\chromedriver.exe"
for gTitle in gameList:
driver = webdriver.Chrome(chromePath)
driver.get('http://www.g2a.com/%s.html' %gTitle)
table = driver.find_element_by_xpath("""//*[@id="about-game"]/div/div[3]/div[1]/table/tbody""")
bestPrice = ''.join(list(table.text.split('\n'))[2][12:][:6])
bestPrice = float(bestPrice.replace(",","."))
print(bestPrice)
我查看了该网站。当您单击 "LOAD MORE" 时,它会加载游戏列表,此后。如果您在检查元素内部时查看浏览器的网络选项卡并仅过滤 "xhr" 请求,您可以看到它正在点击加载新游戏集的 api 端点。我已将此 api 端点用作我的 url。
import requests,json
pageNum = 0 # start with 0, (Also using lower than 0 will start it from 0)
while True :
url = "https://www.g2a.com/lucene/search/filter?&minPrice=0.00&maxPrice=10000&cn=&kr=&stock=all&event=&platform=0&search=&genre=0&cat=0&sortOrder=popularity+desc&start={}&rows=12&steam_app_id=&steam_category=&steam_prod_type=&includeOutOfStock=&includeFreeGames=false&_=1492758607443".format(str(pageNum))
games_list = json.loads(requests.get(url).text)['docs'] # `games_list` contains each game as a dictionary from where you can take out the required information.
if len(games_list) == 0:
break # we break off here when the maximum of start parameter is reached and the games_list is empty.
else:
pageNum += 12 # we use an increment of 12 because we observed an increment of 12 in the start parameter each time we click on "LOAD MORE"