使用 Python 抓取 Kickstarter 项目页面

Scraping Kickstarter Project Page with Python

一年多来,我一直在使用下面的代码来抓取某些 Kickstarter 页面,这是我日常工作的一部分。没有恶意或恶意,只是需要从页面获取一些信息来帮助项目创建者。

但在过去的 4 到 6 个月里,Kickstarter 实施了某种拦截器,它阻止我访问/抓取 实际页面。我得到的只是 Backer or bot? Complete this security check to prove that you’re a human. Once you’ve passed this page, you might need to navigate away from your current screen on Kickstarter to refresh and move on. To avoid seeing this page again, double-check that JavaScript and cookies are enabled on your web browser and that you’re not blocking them from loading with an extension (e.g., ad blockers).

谁能想出办法绕过这个检查并真正登陆页面?任何输入都会很有帮助。

import os
import sys
import requests
import time
import urllib
import urllib.request
import shutil
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from csv import writer
from shutil import copyfile

print('What is the project URL?')
urlInp = input()

elClass = "rte__content"

chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)

driver.get(urlInp)
time.sleep(2)
html = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()

soup = BeautifulSoup(html, 'lxml')
ele = soup.find('div', {'class': elClass})

print(soup)
quit()

看看你的剧本 - 看起来你正在尝试了解故事。

Selenium 非常适合 GUI 测试,但它会向网站宣布它是谁,以帮助防止 DOS 攻击。如果您想了解更多,请阅读 the docs 上的更多内容。我的看法是,这些站点出于某种原因正在竭尽全力阻止 GUI 自动化。他们有很多聪明的人在努力,所以要打败他们将是一场艰苦的战斗。

作为更好的选择,您是否考虑过使用 requests 库? - 这将允许您在本质上不需要浏览器的情况下模拟调用

我查看了 devtools,甚至还有一个 API 可以为您获取故事信息。您需要一个 csrf token,并且您需要 post 一些数据(这些数据已经在您的 url 中可用)。这将 运行 比 selenium 快得多,并且允许您做更多的事情。

这是我为您整理的一些代码。我选择了一个随机的 kickstarter 页面并将其硬编码到这个演示中:

urlInp = 'https://www.kickstarter.com/projects/iamlunasol/soft-like-mochi-enamel-pins?ref=section-homepage-featured-project'


#start a session - this stores cookies
s = requests.session()

# go here to get  cookies and the token
landing = s.get(urlInp) 
page = html.fromstring(landing.content)
csrf = page.xpath('//meta[@name="csrf-token"]')[0].get('content')
headers={} 
headers['x-csrf-token'] = csrf


#hit the api with the data
graphslug = urlInp.split("projects/")[1]
graphslug = graphslug.split("?")[0]
graphData= [{
        "operationName": "Campaign",
        "variables": {
            "slug": graphslug
        },
        "query": "query Campaign($slug: String!) {\n  project(slug: $slug) {\n    id\n    isSharingProjectBudget\n    risks\n    showRisksTab\n    story(assetWidth: 680)\n    currency\n    spreadsheet {\n      displayMode\n      public\n      url\n      data {\n        name\n        value\n        phase\n        rowNum\n        __typename\n      }\n      dataLastUpdatedAt\n      __typename\n    }\n    environmentalCommitments {\n      id\n      commitmentCategory\n      description\n      __typename\n    }\n    __typename\n  }\n}\n"
    }]

response = s.post("https://www.kickstarter.com/graph", json=graphData, headers=headers)

#process the response
graph_json = response.json()
story = graph_json[0]['data']['project']['story']
soup = BeautifulSoup(story, 'lxml')
print(soup)

输出的前几行是:

<html><body><p>Hi! I'm Felice Regina (<a href="https://www.instagram.com/iamlunasol/" rel="noopener" target="_blank">@iamlunasol</a> on Instagram) but everyone just calls me Luna! I'm an independent illustrator and pin designer! I've run many successful 
Kickstarter campaigns for enamel pins over the past few years. This campaign will help put new hard enamel pin designs into production.</p>
<p>Pledging ensures that the pins get produced, discounts when you purchase multiple pins, plus any freebies that we may unlock. If the campaign is successful, any extra pins will be sold at  + shipping in my <a href="https://shopiamlunasol.com/" rel="noopener" target="_blank">web store</a>.</p>

这与在开发工具 json 中看到的 story 有关 - 预览选项卡对此很有用:

最后,如果您希望调整它以使用其他查询,您可以了解要从请求负载中的 headers 选项卡发送的 json 数据: