如何分离项目容器的内容?
How to separate contents of item containers?
我正在构建一个电子邮件抓取工具,但在生成项目时遇到了问题。我的产量打印为:
{'email': ['ex1@email.com', 'ex2@email.com', 'ex3@email.com']}
每当我将其导出为 CSV 时,我都会收到一封电子邮件 header,然后这三封电子邮件会列在同一个单元格中。我如何将它们分成单独的单元格?
class EmailSpider(CrawlSpider):
name = 'emails'
start_urls = ['https://example.com']
parsed_url = urlparse(start_urls[0])
rules = [Rule(LinkExtractor(allow_domains=parsed_url), callback='parse', follow=True)]
def parse(self, response):
# Scrape page for email links
items = EmailscrapeItem()
hrefs = [response.xpath("//a[starts-with(@href, 'mailto')]/text()").getall()]
# Removes hrefs that are empty or None
hrefs = [d for d in hrefs if d]
# TODO: Add code to capture non-mailto emails as well
# hrefs.append(response.xpath("//*[contains(text(), '@')]/text()"))
for href in hrefs:
items['email'] = href
yield items
找出我做错了什么。
我将解析更改为:
for res in response.xpath("//a[starts-with(@href, 'mailto')]/text()"):
item = EmailscrapeItem()
item['email'] = res.get()
yield item
这产生了正确的结果。
我正在构建一个电子邮件抓取工具,但在生成项目时遇到了问题。我的产量打印为:
{'email': ['ex1@email.com', 'ex2@email.com', 'ex3@email.com']}
每当我将其导出为 CSV 时,我都会收到一封电子邮件 header,然后这三封电子邮件会列在同一个单元格中。我如何将它们分成单独的单元格?
class EmailSpider(CrawlSpider):
name = 'emails'
start_urls = ['https://example.com']
parsed_url = urlparse(start_urls[0])
rules = [Rule(LinkExtractor(allow_domains=parsed_url), callback='parse', follow=True)]
def parse(self, response):
# Scrape page for email links
items = EmailscrapeItem()
hrefs = [response.xpath("//a[starts-with(@href, 'mailto')]/text()").getall()]
# Removes hrefs that are empty or None
hrefs = [d for d in hrefs if d]
# TODO: Add code to capture non-mailto emails as well
# hrefs.append(response.xpath("//*[contains(text(), '@')]/text()"))
for href in hrefs:
items['email'] = href
yield items
找出我做错了什么。
我将解析更改为:
for res in response.xpath("//a[starts-with(@href, 'mailto')]/text()"):
item = EmailscrapeItem()
item['email'] = res.get()
yield item
这产生了正确的结果。