执行此操作以对抗多个 xpath 选择器的正确方法是什么?
What would be the correct way to to this to counter multiple xpath selectors?
很累,只睡了3个小时,醒了20+个小时,原谅我的错误。
我正在尝试实现多个 xpath 选择器,但似乎无法实现,显然这段代码有缺陷代码,重复描述,最终采用最后一项的描述并将其分配给所有项目、屏幕截图和代码:
以视觉表示形式显示我想看到的内容:
this http://puu.sh/fBjA9/da85290fc2.png
代码(Scrapy网络爬虫Python):
蜘蛛
def parse(self, response):
item = DmozItem()
for sel in response.xpath("//td[@class='nblu tabcontent']"):
item['title'] = sel.xpath("a/big/text()").extract()
item['link'] = sel.xpath("a/@href").extract()
for sel in response.xpath("//td[contains(@class,'framed')]"):
item['description'] = sel.xpath("b/text()").extract()
yield item
流水线
def process_item(self, item, spider):
self.cursor.execute("SELECT * FROM data WHERE title= %s", item['title'])
result = self.cursor.fetchall()
if result:
log.msg("Item already in database: %s" % item, level=log.DEBUG)
else:
self.cursor.execute(
"INSERT INTO data(title, url, description) VALUES (%s, %s, %s)",
(item['title'][0], item['link'][0], item['description'][0]))
self.connection.commit()
log.msg("Item stored : " % item, level=log.DEBUG)
return item
def handle_error(self, e):
log.err(e)
感谢您阅读并提供帮助。
我认为您只需要将项目实例化移动到 for 循环中即可:
def parse(self, response):
for sel in response.xpath("//td[@class='nblu tabcontent']"):
item = DmozItem()
item['title'] = sel.xpath("a/big/text()").extract()
item['link'] = sel.xpath("a/@href").extract()
for sel in response.xpath("//td[contains(@class,'framed')]"):
item['description'] = sel.xpath("b/text()").extract()
yield item
问题是"//td[@class='nblu tabcontent']"
和"//td[contains(@class,'framed')]"
是一一对应的;您不能在另一个内部迭代一个,或者正如您所发现的那样,您只能从内部列表中获取最后一项。
相反,尝试
def parse(self, response):
title_links = response.xpath("//td[@class='nblu tabcontent']")
descriptions = response.xpath("//td[contains(@class,'framed')]")
for tl,d in zip(title_links, descriptions):
item = DmozItem()
item['title'] = tl.xpath("a/big/text()").extract()
item['link'] = tl.xpath("a/@href").extract()
item['description'] = d.xpath("b/text()").extract()
yield item
很累,只睡了3个小时,醒了20+个小时,原谅我的错误。
我正在尝试实现多个 xpath 选择器,但似乎无法实现,显然这段代码有缺陷代码,重复描述,最终采用最后一项的描述并将其分配给所有项目、屏幕截图和代码:
以视觉表示形式显示我想看到的内容: this http://puu.sh/fBjA9/da85290fc2.png
代码(Scrapy网络爬虫Python): 蜘蛛
def parse(self, response):
item = DmozItem()
for sel in response.xpath("//td[@class='nblu tabcontent']"):
item['title'] = sel.xpath("a/big/text()").extract()
item['link'] = sel.xpath("a/@href").extract()
for sel in response.xpath("//td[contains(@class,'framed')]"):
item['description'] = sel.xpath("b/text()").extract()
yield item
流水线
def process_item(self, item, spider):
self.cursor.execute("SELECT * FROM data WHERE title= %s", item['title'])
result = self.cursor.fetchall()
if result:
log.msg("Item already in database: %s" % item, level=log.DEBUG)
else:
self.cursor.execute(
"INSERT INTO data(title, url, description) VALUES (%s, %s, %s)",
(item['title'][0], item['link'][0], item['description'][0]))
self.connection.commit()
log.msg("Item stored : " % item, level=log.DEBUG)
return item
def handle_error(self, e):
log.err(e)
感谢您阅读并提供帮助。
我认为您只需要将项目实例化移动到 for 循环中即可:
def parse(self, response):
for sel in response.xpath("//td[@class='nblu tabcontent']"):
item = DmozItem()
item['title'] = sel.xpath("a/big/text()").extract()
item['link'] = sel.xpath("a/@href").extract()
for sel in response.xpath("//td[contains(@class,'framed')]"):
item['description'] = sel.xpath("b/text()").extract()
yield item
问题是"//td[@class='nblu tabcontent']"
和"//td[contains(@class,'framed')]"
是一一对应的;您不能在另一个内部迭代一个,或者正如您所发现的那样,您只能从内部列表中获取最后一项。
相反,尝试
def parse(self, response):
title_links = response.xpath("//td[@class='nblu tabcontent']")
descriptions = response.xpath("//td[contains(@class,'framed')]")
for tl,d in zip(title_links, descriptions):
item = DmozItem()
item['title'] = tl.xpath("a/big/text()").extract()
item['link'] = tl.xpath("a/@href").extract()
item['description'] = d.xpath("b/text()").extract()
yield item