如何强制 Scrapy 显示所有项目而不是最后一个?
How to force Scrapy to show all items instead of just the last one?
有以下蜘蛛:
import scrapy
class ScrapeNames(scrapy.Spider):
name='final2'
start_urls = [
'https://www.trekearth.com/members/'
]
def parse(self, response):
for entry in response.xpath('//table[@class="member-table"]'):
for name in entry.xpath('.//tr[@class="row"]/td/p/a/text()|.//tr/td/p/a/text()').extract():
item['name'] = name
for photo in entry.xpath('.//tr[@class="row"]/td[6]/a/text()|.//tr[@class="row"]/td[6]/text()|.//tr/td[6]/text()|.//tr/td[6]/a/text()').extract():
item['photo'] = photo
yield item
我想提取用户拍摄的照片数量,然后将其导出到 csv。然而,在我的 .csv
中,我只有此页面上 table 中的最后一项(请参见下面的屏幕截图)。
我想要的显然是在一个页面上为所有用户提供会员名和照片数量。我究竟做错了什么?如何解决这个问题?
编辑:
可能这也是必不可少的,但我的 items.py
文件如下所示:
import scrapy
class FinalItem(scrapy.Item):
name = scrapy.Field()
photo = scrapy.Field()
pass
后续问题:
我对我的代码进行了一些改进,现在是:
class ScrapeMovies(scrapy.Spider):
name='final2'
start_urls = [
'https://www.trekearth.com/members/'
]
def parse(self, response):
item = FinalItem()
for entry in response.xpath('//table[@class="member-table"]'):
for name in entry.xpath('.//tr[@class="row"]/td/p/a/text()|.//tr/td/p/a/text()').extract():
names = entry.xpath('.//tr[@class="row"]/td/p/a/text()|.//tr/td/p/a/text()').extract()
item['name'] = ";".join(names)
for photos in entry.xpath('.//tr[@class="row"]/td[6]/a/text()|.//tr[@class="row"]/td[6]/text()|.//tr/td[6]/text()|.//tr/td[6]/a/text()').extract():
photos = entry.xpath('.//tr[@class="row"]/td[6]/a/text()|.//tr[@class="row"]/td[6]/text()|.//tr/td[6]/text()|.//tr/td[6]/a/text()').extract()
item['photo'] = ";".join(photos)
yield item
然而,这在最终 .csv
中造成了混乱,现在看起来像这样:
有解决这个问题的简单方法吗?
在下面的 .csv 中示例所需的输出:
编辑2:
我的蜘蛛现在:
import scrapy
from final.items import FinalItem
class ScrapeMovies(scrapy.Spider):
name='final2'
start_urls = [
'https://www.trekearth.com/members/'
]
def parse(self, response):
for row in response.xpath('//table[@class="member-table"]//tr[position() > 1]'):
item = FinalItem()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
item['photos'] = row.xpath('string(./td[6])').extract_first()
yield item
仍然没有产生正确的结果。我只有空的 .csv。已更新 settings.py
更新
你需要在你的 settings.py
中加入这一行(网站阻止默认的 Scrapy 用户代理):
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.170 Safari/537.36'
接下来这将起作用:
def parse(self, response):
for row in response.xpath('//table[@class="member-table"]//tr[position() > 1]'):
item = FinalItem()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
item['photos'] = row.xpath('string(./td[6])').extract_first()
yield item
有以下蜘蛛:
import scrapy
class ScrapeNames(scrapy.Spider):
name='final2'
start_urls = [
'https://www.trekearth.com/members/'
]
def parse(self, response):
for entry in response.xpath('//table[@class="member-table"]'):
for name in entry.xpath('.//tr[@class="row"]/td/p/a/text()|.//tr/td/p/a/text()').extract():
item['name'] = name
for photo in entry.xpath('.//tr[@class="row"]/td[6]/a/text()|.//tr[@class="row"]/td[6]/text()|.//tr/td[6]/text()|.//tr/td[6]/a/text()').extract():
item['photo'] = photo
yield item
我想提取用户拍摄的照片数量,然后将其导出到 csv。然而,在我的 .csv
中,我只有此页面上 table 中的最后一项(请参见下面的屏幕截图)。
我想要的显然是在一个页面上为所有用户提供会员名和照片数量。我究竟做错了什么?如何解决这个问题?
编辑:
可能这也是必不可少的,但我的 items.py
文件如下所示:
import scrapy
class FinalItem(scrapy.Item):
name = scrapy.Field()
photo = scrapy.Field()
pass
后续问题:
我对我的代码进行了一些改进,现在是:
class ScrapeMovies(scrapy.Spider):
name='final2'
start_urls = [
'https://www.trekearth.com/members/'
]
def parse(self, response):
item = FinalItem()
for entry in response.xpath('//table[@class="member-table"]'):
for name in entry.xpath('.//tr[@class="row"]/td/p/a/text()|.//tr/td/p/a/text()').extract():
names = entry.xpath('.//tr[@class="row"]/td/p/a/text()|.//tr/td/p/a/text()').extract()
item['name'] = ";".join(names)
for photos in entry.xpath('.//tr[@class="row"]/td[6]/a/text()|.//tr[@class="row"]/td[6]/text()|.//tr/td[6]/text()|.//tr/td[6]/a/text()').extract():
photos = entry.xpath('.//tr[@class="row"]/td[6]/a/text()|.//tr[@class="row"]/td[6]/text()|.//tr/td[6]/text()|.//tr/td[6]/a/text()').extract()
item['photo'] = ";".join(photos)
yield item
然而,这在最终 .csv
中造成了混乱,现在看起来像这样:
有解决这个问题的简单方法吗?
在下面的 .csv 中示例所需的输出:
编辑2:
我的蜘蛛现在:
import scrapy
from final.items import FinalItem
class ScrapeMovies(scrapy.Spider):
name='final2'
start_urls = [
'https://www.trekearth.com/members/'
]
def parse(self, response):
for row in response.xpath('//table[@class="member-table"]//tr[position() > 1]'):
item = FinalItem()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
item['photos'] = row.xpath('string(./td[6])').extract_first()
yield item
仍然没有产生正确的结果。我只有空的 .csv。已更新 settings.py
更新
你需要在你的 settings.py
中加入这一行(网站阻止默认的 Scrapy 用户代理):
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.170 Safari/537.36'
接下来这将起作用:
def parse(self, response):
for row in response.xpath('//table[@class="member-table"]//tr[position() > 1]'):
item = FinalItem()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
item['photos'] = row.xpath('string(./td[6])').extract_first()
yield item