Scrapy Filter 复制从网页中提取的 URLS

Question

好的，所以我正在使用 Scrapy。我目前正在尝试抓取 "snipplr.com/all/page" 然后在页面中提取 url。然后，我在下次运行蜘蛛再次提取 url 时，通过读取 csv 文件来过滤提取的 url。那是计划，但不知何故，我收到覆盖结果的错误。

过程：抓取网页链接 > 检查 CSV 文件是否已在过去提取 > 如果已经提取，IgnoreRequest/dropItem 否则添加到 csv 文件

蜘蛛代码：

import scrapy
import csv

from scrapycrawler.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request

class DmozSpider(scrapy.Spider):
 name = "dmoz"
 allowed_domains = ["snipplr.com"]


def start_requests(self):
    #for i in xrange(1000):
    for i in range(2, 5):
        yield self.make_requests_from_url("http://www.snipplr.com/all/page/%d" % i)


def parse(self, response):
    for sel in response.xpath('//ol/li/h3'):
        item = DmozItem()
        #item['title'] = sel.xpath('a/text()').extract()
        item['link'] = sel.xpath('a[last()]/@href').extract()
        #item['desc'] = sel.xpath('text()').extract()

        reader = csv.reader(open('items.csv', 'w+')) #think it as a list
        for row in reader:
            if item['link'] == row:
                raise IgnoreRequest()

            else:
                f = open('items.csv', 'w')
                f.write(item[link'])
        yield item

但是，我收到了类似这些奇怪的结果，这些结果在我下次抓取不同的页面时相互覆盖，但相反，我希望将结果添加到文件中，而不是覆盖

       clock/
/view/81327/chatting-swing-gui-tcp/
/view/82731/automate-system-setup/
/view/81215/rmi-factorial/
/view/81214/tcp-addition/
/view/81213/hex-octal-binary-calculator/
/view/81188/abstract-class-book-novel-magazine/
/view/81187/data-appending-to-file/
/view/81186/bouncing-ball-multithreading/
/view/81185/stringtokenizer/
/view/81184/prime-and-divisible-by-3/
/view/81183/packaging/
/view/81182/font-controller/
/view/81181/multithreaded-server-and-client/
/view/81180/simple-calculator/
/view/81179/inner-class-program/
/view/81114/cvv-dumps-paypals-egift-cards-tracks-wu-transfer-banklogins-/
/view/81038/magento-social-login/
/view/81037/faq-page-magento-extension/
/view/81036/slider-revolution-responsive-magento-extension/
/view/81025/bugfix-globalization/

代码中可能存在错误，请随时编辑以更正代码为needed.Thanks以提供帮助。

编辑：错别字

Answer 1

其实你做错地方了，输出爬取的数据应该在Item Pipeline.

好吧，最好使用普通数据库并使用数据库约束过滤重复项，但无论如何，如果您仍想使用 csv 文件 - 创建一个首先读取现有内容并记住的管道它用于将来的检查，对于从蜘蛛通过管道传输的每个项目，检查之前是否未看到它，如果没有则写入：

import csv

from scrapy.exceptions import DropItem


class CsvWriterPipeline(object):
    def __init__(self):
        with open('items.csv', 'r') as f:
            self.seen = set([row for row in f])

        self.file = open('items.csv', 'a+')

    def process_item(self, item, spider):
        link = item['link']

        if link in self.seen:
            raise DropItem('Duplicate link found %s' % link)

        self.file.write(link)
        self.seen.add(link)

        return item

将其添加到 ITEM_PIPELINES 以将其打开：

ITEM_PIPELINES = {
    'myproject.pipelines.CsvWriterPipeline': 300
}

而您的 parse() 回调只会产生 Item:

def parse(self, response):
    for sel in response.xpath('//ol/li/h3'):
        item = DmozItem()
        item['link'] = sel.xpath('a[last()]/@href').extract()

        yield item

Answer 2

您打开文件只是为了从头开始写入。要附加到文件，您需要使用 'a' 或 'a+'.

替换

f = open('items.csv', 'w')

和

f = open('items.csv', 'a')

基于BSD Library Functions Manual for fopen：

 The argument mode points to a string beginning with one of the following
 sequences (Additional characters may follow these sequences.):

 ``r''   Open text file for reading.  The stream is positioned at the
         beginning of the file.

 ``r+''  Open for reading and writing.  The stream is positioned at the
         beginning of the file.

 ``w''   Truncate file to zero length or create text file for writing.
         The stream is positioned at the beginning of the file.

 ``w+''  Open for reading and writing.  The file is created if it does not
         exist, otherwise it is truncated.  The stream is positioned at
         the beginning of the file.

 ``a''   Open for writing.  The file is created if it does not exist.  The
         stream is positioned at the end of the file.  Subsequent writes
         to the file will always end up at the then current end of file,
         irrespective of any intervening fseek(3) or similar.

 ``a+''  Open for reading and writing.  The file is created if it does not
         exist.  The stream is positioned at the end of the file.  Subse-
         quent writes to the file will always end up at the then current
         end of file, irrespective of any intervening fseek(3) or similar.

Scrapy Filter 复制从网页中提取的 URLS

Scrapy Filter duplicates extracted URLS from webpage

python

csv

fwrite

scrapy

web-scraping