Scrapy csv 多行输出
Scrapy csv outputing on multiple lines
这是我的蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from ..items import TutorialItem
class Tutorial1(BaseSpider):
name = "Tut"
allowed_domains = ['nytimes.com']
start_urls = ["http://nytimes.com",]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[@class="span-ab-layout layout"]')
items = []
for site in sites:
item = TutorialItem()
item['title'] = map(unicode.strip, site.select('//h2[@class="story-heading"]/a/text()').extract())
item['time'] = map(unicode.strip, site.select('//time[@class="timestamp"]/text()').extract())
yield item
这是我的输出:
author time
By PETER BAKER,By JONATHAN M. KATZ and RICHARD PÉREZ-PEÑA,By NEIL MacFARQUHAR,By RON NIXON,By RICHARD GOLDSTEIN,By LOUISE STORY and ALEJANDRA XANIC von BERTRAB,By DAVID CARR,By A.O. SCOTT,By JERÉ LONGMAN,By THE EDITORIAL BOARD,By JON BECKMANN,By C. J. HUGHES,By JOANNE KAUFMAN 10:26 AM ET,1:08 PM ET,11:57 AM ET,8:33 AM ET,10:01 AM ET,12:35 PM ET,1:47 PM ET,10:36 AM ET,10:26 AM ET,9:49 AM ET,12:05 PM ET,9:21 AM ET,12:22 PM ET,11:52 AM ET,8:59 AM ET
By PETER BAKER,By JONATHAN M. KATZ and RICHARD PÉREZ-PEÑA,By NEIL MacFARQUHAR,By RON NIXON,By RICHARD GOLDSTEIN,By LOUISE STORY and ALEJANDRA XANIC von BERTRAB,By DAVID CARR,By A.O. SCOTT,By JERÉ LONGMAN,By THE EDITORIAL BOARD,By JON BECKMANN,By C. J. HUGHES,By JOANNE KAUFMAN 10:26 AM ET,1:08 PM ET,11:57 AM ET,8:33 AM ET,10:01 AM ET,12:35 PM ET,1:47 PM ET,10:36 AM ET,10:26 AM ET,9:49 AM ET,12:05 PM ET,9:21 AM ET,12:22 PM ET,11:52 AM ET,8:59 AM ET
我做了缩进,所以很清楚在哪里重复了。
当我用 CSV 打印我的作品时,我的问题总是出现在 1 大行中。出于某种原因,它也制作了一个重复的列。谁能帮我解决这个难题?
我通过试验找到了它:
hxs = HtmlXPathSelector(response)
显然,Selector 和 HtmlPatchSelector 之间存在巨大差异
这是我的蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from ..items import TutorialItem
class Tutorial1(BaseSpider):
name = "Tut"
allowed_domains = ['nytimes.com']
start_urls = ["http://nytimes.com",]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[@class="span-ab-layout layout"]')
items = []
for site in sites:
item = TutorialItem()
item['title'] = map(unicode.strip, site.select('//h2[@class="story-heading"]/a/text()').extract())
item['time'] = map(unicode.strip, site.select('//time[@class="timestamp"]/text()').extract())
yield item
这是我的输出:
author time By PETER BAKER,By JONATHAN M. KATZ and RICHARD PÉREZ-PEÑA,By NEIL MacFARQUHAR,By RON NIXON,By RICHARD GOLDSTEIN,By LOUISE STORY and ALEJANDRA XANIC von BERTRAB,By DAVID CARR,By A.O. SCOTT,By JERÉ LONGMAN,By THE EDITORIAL BOARD,By JON BECKMANN,By C. J. HUGHES,By JOANNE KAUFMAN 10:26 AM ET,1:08 PM ET,11:57 AM ET,8:33 AM ET,10:01 AM ET,12:35 PM ET,1:47 PM ET,10:36 AM ET,10:26 AM ET,9:49 AM ET,12:05 PM ET,9:21 AM ET,12:22 PM ET,11:52 AM ET,8:59 AM ET
By PETER BAKER,By JONATHAN M. KATZ and RICHARD PÉREZ-PEÑA,By NEIL MacFARQUHAR,By RON NIXON,By RICHARD GOLDSTEIN,By LOUISE STORY and ALEJANDRA XANIC von BERTRAB,By DAVID CARR,By A.O. SCOTT,By JERÉ LONGMAN,By THE EDITORIAL BOARD,By JON BECKMANN,By C. J. HUGHES,By JOANNE KAUFMAN 10:26 AM ET,1:08 PM ET,11:57 AM ET,8:33 AM ET,10:01 AM ET,12:35 PM ET,1:47 PM ET,10:36 AM ET,10:26 AM ET,9:49 AM ET,12:05 PM ET,9:21 AM ET,12:22 PM ET,11:52 AM ET,8:59 AM ET
我做了缩进,所以很清楚在哪里重复了。
当我用 CSV 打印我的作品时,我的问题总是出现在 1 大行中。出于某种原因,它也制作了一个重复的列。谁能帮我解决这个难题?
我通过试验找到了它:
hxs = HtmlXPathSelector(response)
显然,Selector 和 HtmlPatchSelector 之间存在巨大差异