使用 Scrapy(美味的冰淇淋)绕过弹出窗口
Bypass popup using Scrapy (yummy ice-cream)
我正在尝试从网站上抓取与冰淇淋相关的数据,https://threetwinsicecream.com/products/ice-cream/。这似乎是一个非常简单的网站。但是,由于(我认为)阻碍我访问的 (JavaScript) 弹出窗口,我无法让我的蜘蛛工作。我在下面附上了我的 scrapy 代码的精简版:
class NutritionSpider(scrapy.Spider):
name = 'nutrition'
allowed_domains = ['threetwinsicecream.com']
start_urls = ['http://threetwinsicecream.com/']
def parse(self, response):
products = response.xpath("//div[@id='pints']/div[2]/div")
for product in products:
name = product.xpath(".//a/p/text()").extract_first()
link = product.xpath(".//a/@href").extract_first()
yield scrapy.Request(
url=link,
callback=self.parse_products,
meta={
"name": name,
"link": link
}
)
def parse_products(self, response):
name = response.meta["name"]
link = response.meta["link"]
serving_size = response.xpath("//div[@id='nutritionFacts']/ul/li[1]/text()").extract_first()
calories = response.xpath("//div[@id='nutritionFacts']/ul/li[2]/span/text()").extract_first()
yield {
"Name": name,
"Link": link,
"Serving Size": serving_size,
"Calories": calories
}
我设计了一个解决方法,但它需要手动写出指向各种冰淇淋品种的所有链接,如下所示。我也试过在网站上禁用 JavaScript,但这似乎也不起作用。
def parse(self, response):
urls = [
"https://threetwinsicecream.com/products/ice-cream/madagascar-vanilla/",
"https://threetwinsicecream.com/products/ice-cream/sea-salted-caramel/",
...
]
for url in urls:
yield scrapy.Request(
url=url,
callback=self.parse_products
)
def parse_products(self, response):
pass
有没有办法使用 scrapy 绕过弹出窗口,或者我是否必须使用其他工具,如 selenium?感谢您的帮助!
您发布的蜘蛛有效。至少在我的机器上。我唯一需要更改的是 start_urls = ['http://threetwinsicecream.com/']
到 start_urls = ['https://threetwinsicecream.com/products/ice-cream/']
如果你 运行 遇到这些类型的问题,你可以使用 Scrapys open_in_browser
功能,通过它你可以看到 Scrapy 在你的浏览器中看到的内容。已记录 here
我正在尝试从网站上抓取与冰淇淋相关的数据,https://threetwinsicecream.com/products/ice-cream/。这似乎是一个非常简单的网站。但是,由于(我认为)阻碍我访问的 (JavaScript) 弹出窗口,我无法让我的蜘蛛工作。我在下面附上了我的 scrapy 代码的精简版:
class NutritionSpider(scrapy.Spider):
name = 'nutrition'
allowed_domains = ['threetwinsicecream.com']
start_urls = ['http://threetwinsicecream.com/']
def parse(self, response):
products = response.xpath("//div[@id='pints']/div[2]/div")
for product in products:
name = product.xpath(".//a/p/text()").extract_first()
link = product.xpath(".//a/@href").extract_first()
yield scrapy.Request(
url=link,
callback=self.parse_products,
meta={
"name": name,
"link": link
}
)
def parse_products(self, response):
name = response.meta["name"]
link = response.meta["link"]
serving_size = response.xpath("//div[@id='nutritionFacts']/ul/li[1]/text()").extract_first()
calories = response.xpath("//div[@id='nutritionFacts']/ul/li[2]/span/text()").extract_first()
yield {
"Name": name,
"Link": link,
"Serving Size": serving_size,
"Calories": calories
}
我设计了一个解决方法,但它需要手动写出指向各种冰淇淋品种的所有链接,如下所示。我也试过在网站上禁用 JavaScript,但这似乎也不起作用。
def parse(self, response):
urls = [
"https://threetwinsicecream.com/products/ice-cream/madagascar-vanilla/",
"https://threetwinsicecream.com/products/ice-cream/sea-salted-caramel/",
...
]
for url in urls:
yield scrapy.Request(
url=url,
callback=self.parse_products
)
def parse_products(self, response):
pass
有没有办法使用 scrapy 绕过弹出窗口,或者我是否必须使用其他工具,如 selenium?感谢您的帮助!
您发布的蜘蛛有效。至少在我的机器上。我唯一需要更改的是 start_urls = ['http://threetwinsicecream.com/']
到 start_urls = ['https://threetwinsicecream.com/products/ice-cream/']
如果你 运行 遇到这些类型的问题,你可以使用 Scrapys open_in_browser
功能,通过它你可以看到 Scrapy 在你的浏览器中看到的内容。已记录 here