Scrapy 程序并没有抓取所有数据
Scrapy program is not scraping all data
我正在用 scrapy 编写一个程序来抓取下一页,https://www.trollandtoad.com/magic-the-gathering/aether-revolt/10066,它只抓取第一行数据而不抓取其余数据。我认为这与我的 for 循环有关,但是当我将循环更改为更宽时,它会输出太多数据,因为它会多次输出每行数据。
def parse(self, response):
item = GameItem()
saved_name = ""
for game in response.css("div.row.mt-1.list-view"):
saved_name = game.css("a.card-text::text").get() or saved_name
item["Card_Name"] = saved_name.strip()
if item["Card_Name"] != None:
saved_name = item["Card_Name"].strip()
else:
item["Card_Name"] = saved_name
yield item
更新 #1
def parse(self, response):
for game in response.css('div.card > div.row'):
item = GameItem()
item["Card_Name"] = game.css("a.card-text::text").get()
for buying_option in game.css('div.buying-options-table div.row:not(:first-child)'):
item["Condition"] = game.css("div.col-3.text-center.p-1::text").get()
item["Price"] = game.css("div.col-2.text-center.p-1::text").get()
yield item
response.css("div.row.mt-1.list-view")
returns 只有 1 个选择器,所以循环中的代码只运行一次。试试这个:for game in response.css(".mt-1.list-view .card-text"):
你会得到一个要循环的选择器列表。
我认为您需要以下 CSS(稍后您可以将其用作处理 buying-options
容器的基础):
def parse(self, response):
for game in response.css('div.card > div.row'):
item = GameItem()
Card_Name = game.css("a.card-text::text").get()
item["Card_Name"] = Card_Name.strip()
for buying_option in game.css('div.buying-options-table div.row:not(:first-child)'):
# process buying-option
# may be you need to move GameItem() initialization inside this loop
yield item
如您所见,我将 item = GameItem()
移动到一个循环中。这里也不需要saved_game
。
您正在编写代码 -- 它不起作用,因为您正在列表循环之外创建 GameItem()。我一定是错过了关于此 .get() 和 .getall() 方法的明信片。也许有人可以评论它与提取物有何不同?
您的失败代码
def parse(self, response):
item = GameItem() # this line right here only creates 1 game item per page
saved_name = ""
for game in response.css("div.row.mt-1.list-view"): # this line fails since it gets all the items on the page. This is a wrapper wrapping all the items inside of it. See below code for corrected selector.
saved_name = game.css("a.card-text::text").get() or saved_name
item["Card_Name"] = saved_name.strip()
if item["Card_Name"] != None:
saved_name = item["Card_Name"].strip()
else:
item["Card_Name"] = saved_name
yield item
解决您问题的固定代码:
def parse(self, response):
for game in response.css("div.product-col"):
item = GameItem()
item["Card_Name"] = game.css("a.card-text::text").get()
if not item["Card_Name"]:
continue # this will skip to the next item if there is no card name, if there is a card name it will continue to yield the item. Another way of doing this would be to return nothing. Just "return". You only do this if you DO NOT want code after executed. If you want the code after to execute then use yeid.
yield item
我正在用 scrapy 编写一个程序来抓取下一页,https://www.trollandtoad.com/magic-the-gathering/aether-revolt/10066,它只抓取第一行数据而不抓取其余数据。我认为这与我的 for 循环有关,但是当我将循环更改为更宽时,它会输出太多数据,因为它会多次输出每行数据。
def parse(self, response):
item = GameItem()
saved_name = ""
for game in response.css("div.row.mt-1.list-view"):
saved_name = game.css("a.card-text::text").get() or saved_name
item["Card_Name"] = saved_name.strip()
if item["Card_Name"] != None:
saved_name = item["Card_Name"].strip()
else:
item["Card_Name"] = saved_name
yield item
更新 #1
def parse(self, response):
for game in response.css('div.card > div.row'):
item = GameItem()
item["Card_Name"] = game.css("a.card-text::text").get()
for buying_option in game.css('div.buying-options-table div.row:not(:first-child)'):
item["Condition"] = game.css("div.col-3.text-center.p-1::text").get()
item["Price"] = game.css("div.col-2.text-center.p-1::text").get()
yield item
response.css("div.row.mt-1.list-view")
returns 只有 1 个选择器,所以循环中的代码只运行一次。试试这个:for game in response.css(".mt-1.list-view .card-text"):
你会得到一个要循环的选择器列表。
我认为您需要以下 CSS(稍后您可以将其用作处理 buying-options
容器的基础):
def parse(self, response):
for game in response.css('div.card > div.row'):
item = GameItem()
Card_Name = game.css("a.card-text::text").get()
item["Card_Name"] = Card_Name.strip()
for buying_option in game.css('div.buying-options-table div.row:not(:first-child)'):
# process buying-option
# may be you need to move GameItem() initialization inside this loop
yield item
如您所见,我将 item = GameItem()
移动到一个循环中。这里也不需要saved_game
。
您正在编写代码 -- 它不起作用,因为您正在列表循环之外创建 GameItem()。我一定是错过了关于此 .get() 和 .getall() 方法的明信片。也许有人可以评论它与提取物有何不同?
您的失败代码
def parse(self, response):
item = GameItem() # this line right here only creates 1 game item per page
saved_name = ""
for game in response.css("div.row.mt-1.list-view"): # this line fails since it gets all the items on the page. This is a wrapper wrapping all the items inside of it. See below code for corrected selector.
saved_name = game.css("a.card-text::text").get() or saved_name
item["Card_Name"] = saved_name.strip()
if item["Card_Name"] != None:
saved_name = item["Card_Name"].strip()
else:
item["Card_Name"] = saved_name
yield item
解决您问题的固定代码:
def parse(self, response):
for game in response.css("div.product-col"):
item = GameItem()
item["Card_Name"] = game.css("a.card-text::text").get()
if not item["Card_Name"]:
continue # this will skip to the next item if there is no card name, if there is a card name it will continue to yield the item. Another way of doing this would be to return nothing. Just "return". You only do this if you DO NOT want code after executed. If you want the code after to execute then use yeid.
yield item