Google 学者通过 AWS ApiGateway 使用 ip-rotator 进行抓取

Google scholar scraping with ip-rotator through AWS ApiGateway

我收到以下错误。代码(George 方法,https://whosebug.com/users/7173479/george)在开始时运行了几次,稍后它崩溃了。它应该与 HTTP 配置有关,但我在 AWS 文档中迷路了。我正在研究 jupyter notebook。有人可以帮忙吗?

创建网关对象并在 AWS 中初始化

engine = 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q={}&btnG='

gateway = ApiGateway(engine,\
                     access_key_id="KEY", access_key_secret="SECRET_KEY")
gateway.start()

为会话分配网关

session = requests.Session()
session.mount(engine, gateway)

发送请求(IP 将被随机化)

header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}

search_string = '{}+and+{}+and+{}+and+{}'.format('term1','term2','term3','term4')

url = engine.format(search_string)
print(url)

response = session.get(url,headers=header)
tree = BeautifulSoup(response.content,'lxml')
result = tree.find('div',id='gs_ab_md')

print(response.status_code)
print(result.text)
print(len(result.text))
number=[int(s.replace('.','').replace(',','')) for s in result.text.split() \
                if s.replace('.','').replace(',','').isdigit()]

删除网关

gateway.shutdown()

=====================================

BadRequestException: An error occurred (BadRequestException) when calling the PutIntegration operation: Invalid HTTP endpoint specified for URI

requests-ip-rotator 包中 ApiGateway 构造函数的 site 参数应该只是站点。除了协议、域名或 IP 地址以及端口之外,它不能包含 URI 的任何部分。

如果您将构造函数更改为如下所示:

gateway = ApiGateway("https://scholar.google.com")
gateway.start()

它将正确构建网关端点。