如何在滚动时从使用 javascript 加载元素的网页中抓取内容?

How can I scrape from a webpage that uses javascript to load in elements as you scroll?

我的朋友问我是否可以编写一个网络抓取脚本来从特定网站收集口袋妖怪的数据。

我编写了以下代码来呈现 javascript 并获取特定的 class 以从网站 (https://www.smogon.com/dex/ss/pokemon/) 收集数据。

问题是,当您向下滚动页面时,页面会加载更多条目。有什么办法可以从中刮掉吗?我是网络抓取的新手,所以我不完全确定这一切是如何工作的。

from requests_html import HTMLSession

def getPokemon(link):
    session = HTMLSession()
    r = session.get(link)
    r.html.render()
    for pokemon in r.html.find("div.PokemonAltRow"):
        print(pokemon)
    quit()

getPokemon('https://www.smogon.com/dex/ss/pokemon/')

数据实际存在于页面源中。请参阅 view-source:https://www.smogon.com/dex/ss/pokemon/(它作为 javascript 变量存在于脚本标签中)。

import requests
import re
import json


response = requests.get('https://www.smogon.com/dex/ss/pokemon/')

# The following regex will help you take the json string from the response text
data = "".join(re.findall(r'dexSettings = (\{.*\})', response.text))

# the above will only return a string, we need to parse that to json in order to process it as a regular json object using `json.loads()`
data = json.loads(data)

# now we can query json string like below.
data = data.get('injectRpcs', [])[1][1].get('items', [])

for row in data:
  print(row.get('name', ''))
  print(row.get('description', ''))

查看实际效果 here