独立 python 文件中的网络爬虫

Question

我发现很多 Scrapy 教程（例如 this good tutorial）都需要下面列出的步骤。结果是项目，其中包含大量文件（project.cfg + 一些 .py 文件 + 特定文件夹结构）。

如何使步骤（在下面列出）作为一个独立的 python 文件工作，可以与 python mycrawler.py 一起运行？

（而不是包含大量文件、一些 .cfg 文件等的完整项目，并且必须使用 scrapy crawl myproject -o myproject.json... 顺便说一下，似乎 scrapy 是一个新的 shell 命令？这是真的吗？）

注意：here could be an answer to this question但不幸的是它已被弃用，不再有效。

1) 使用scrapy startproject myproject

创建一个新的scrapy项目

2) 像这样用Item定义数据结构：

from scrapy.item import Item, Field
    class MyItem(Item):
        title = Field() 
        link = Field()
        ...

3) 用

定义爬虫

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class MySpider(BaseSpider):
    name = "myproject"
    allowed_domains = ["example.com"] 
    start_urls = ["http://www.example.com"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        ...

4) 运行：

scrapy crawl myproject -o myproject.json

Answer 1

Scrapy 不是 unix 命令，它只是像 python、javac、gcc 等可执行文件
bcz 你正在为此使用框架你必须使用由提供的命令框架。您可以做的一件事是创建一个 bash 脚本并在您需要时简单地执行或从其他程序中执行它。

你可以使用 urllib3 编写爬虫，它很简单

Answer 2

您可以运行将 scrapy 蜘蛛作为单个脚本使用 runspider 而无需启动项目这是你想要的吗？

#myscript.py
from scrapy.item import Item, Field
from scrapy import Spider

class MyItem(Item):
    title = Field() 
    link = Field()

class MySpider(Spider):

     start_urls = ['http://www.example.com']
     name = 'samplespider'

     def parse(self, response):
          item = MyItem()
          item['title'] = response.xpath('//h1/text()').extract()
          item['link'] = response.url
          yield item

现在您可以运行使用 scrapy runspider myscript.py -o out.json

独立 python 文件中的网络爬虫

A web crawler in a self-contained python file

python

web-crawler

scrapy

web-scraping