正常运行的 scrapy 蜘蛛现在会在一个请求后死掉吗?

Functioning scrapy spider now dies after one request?

我有一个正常工作的 scrapy 蜘蛛,现在它在一个请求后就死了?我不知道发生了什么。我已经发布了完成时的完整输出和我的蜘蛛代码。

jeff@deltaskelta:~/Desktop/hangulscrape/hangulscrape$ scrapy crawl englishwiki -o test.json
2015-01-13 22:20:41+0900 [scrapy] INFO: Scrapy 0.24.4 started (bot: hangulscrape)
2015-01-13 22:20:41+0900 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2015-01-13 22:20:41+0900 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'hangulscrape.spiders', 'FEED_URI': 'test.json', 'SPIDER_MODULES': ['hangulscrape.spiders'], 'BOT_NAME': 'hangulscrape', 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeue.FifoMemoryQueue', 'DEPTH_PRIORITY': 1, 'FEED_FORMAT': 'json', 'SCHEDULER_DISK_QUEUE': 'scrapy.squeue.PickleFifoDiskQueue'}
2015-01-13 22:20:42+0900 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-01-13 22:20:43+0900 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-01-13 22:20:43+0900 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-01-13 22:20:43+0900 [scrapy] INFO: Enabled item pipelines: 
2015-01-13 22:20:43+0900 [englishwiki] INFO: Spider opened
2015-01-13 22:20:43+0900 [englishwiki] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-13 22:20:43+0900 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6036
2015-01-13 22:20:43+0900 [scrapy] DEBUG: Web service listening on 127.0.0.1:6093
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Crawled (200) <GET http://en.wikipedia.org/wiki/Garden_warbler> (referer: None)
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'en.wikipedia.org': <GET http://en.wikipedia.org/wiki/Garden_warbler>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.mediawiki.org': <GET https://www.mediawiki.org/wiki/Special:MyLanguage/Extension:TimedMediaHandler/Client_download>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.iucnredlist.org': <GET http://www.iucnredlist.org/details/22716906>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'dx.doi.org': <GET http://dx.doi.org/10.1111%2Fj.1463-6409.2006.00221.x>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.ncbi.nlm.nih.gov': <GET http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1794596>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.birdlife.org': <GET http://www.birdlife.org/datazone/speciesfactsheet.php?id=8074>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.jstor.org': <GET http://www.jstor.org/stable/4454>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'blx1.bto.org': <GET http://blx1.bto.org/birdfacts/results/bob12760.htm>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.webcitation.org': <GET http://www.webcitation.org/6HLrPClx6>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.euring.org': <GET http://www.euring.org/data_and_codes/longevity-voous.htm>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.tandfonline.com': <GET http://www.tandfonline.com/doi/pdf/10.1080/00063657909476637>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.nhm.ac.uk': <GET http://www.nhm.ac.uk/research-curation/scientific-resources/biodiversity/uk-biodiversity/british-flea-distribution/database/Searchpage.do?county=&fleaname=&host=&hostname=Garden+Warbler&listoption=&publication=&search=Search&sortorder=&species=>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.biodiversitylibrary.org': <GET http://www.biodiversitylibrary.org/item/88617>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'oops.uni-oldenburg.de': <GET http://oops.uni-oldenburg.de/214/>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'commons.wikimedia.org': <GET http://commons.wikimedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ibc.lynxeds.com': <GET http://ibc.lynxeds.com/species/garden-warbler-sylvia-borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.ornithos.de': <GET http://www.ornithos.de/Ornithos/Feather_Collection/Sylvia_borin/Sylvia_borin.htm>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'donate.wikimedia.org': <GET https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?uselang=en&utm_campaign=C13_en.wikipedia.org&utm_medium=sidebar&utm_source=donate>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'shop.wikimedia.org': <GET http://shop.wikimedia.org/>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.wikidata.org': <GET http://www.wikidata.org/wiki/Q202478>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'kbd.wikipedia.org': <GET http://kbd.wikipedia.org/wiki/%D0%92%D1%8D%D0%B4%D0%B3%D1%8A%D1%83%D0%B0%D0%B1%D0%B6%D1%8D>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'af.wikipedia.org': <GET http://af.wikipedia.org/wiki/Tuinsanger>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ar.wikipedia.org': <GET http://ar.wikipedia.org/wiki/%D8%AF%D8%AE%D9%84%D8%A9_%D8%A7%D9%84%D8%A8%D8%B3%D8%A7%D8%AA%D9%8A%D9%86>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ba.wikipedia.org': <GET http://ba.wikipedia.org/wiki/%D2%BA%D0%B0%D2%99_%D0%BA%D0%B8%D0%BB%D0%B5%D0%B9%D0%B5%D0%B3%D0%B5>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'bg.wikipedia.org': <GET http://bg.wikipedia.org/wiki/%D0%93%D1%80%D0%B0%D0%B4%D0%B8%D0%BD%D1%81%D0%BA%D0%BE_%D0%BA%D0%BE%D0%BF%D1%80%D0%B8%D0%B2%D0%B0%D1%80%D1%87%D0%B5>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'br.wikipedia.org': <GET http://br.wikipedia.org/wiki/Devedig-liorzh>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ca.wikipedia.org': <GET http://ca.wikipedia.org/wiki/Tallarol_gros>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ceb.wikipedia.org': <GET http://ceb.wikipedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'cs.wikipedia.org': <GET http://cs.wikipedia.org/wiki/P%C4%9Bnice_slav%C3%ADkov%C3%A1>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'cy.wikipedia.org': <GET http://cy.wikipedia.org/wiki/Telor_yr_Ardd>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'da.wikipedia.org': <GET http://da.wikipedia.org/wiki/Havesanger>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'de.wikipedia.org': <GET http://de.wikipedia.org/wiki/Gartengrasm%C3%BCcke>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'et.wikipedia.org': <GET http://et.wikipedia.org/wiki/Aed-p%C3%B5%C3%B5salind>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'es.wikipedia.org': <GET http://es.wikipedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'eo.wikipedia.org': <GET http://eo.wikipedia.org/wiki/%C4%9Cardensilvio>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'eu.wikipedia.org': <GET http://eu.wikipedia.org/wiki/Baso-txinbo>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'fa.wikipedia.org': <GET http://fa.wikipedia.org/wiki/%D8%A2%D9%84%D9%88%DA%86%D9%87%E2%80%8C%D8%AE%D9%88%D8%B1%DA%A9>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'fo.wikipedia.org': <GET http://fo.wikipedia.org/wiki/Gar%C3%B0lj%C3%B3mari>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'fr.wikipedia.org': <GET http://fr.wikipedia.org/wiki/Fauvette_des_jardins>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'gl.wikipedia.org': <GET http://gl.wikipedia.org/wiki/Papuxa_picafollas>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'hy.wikipedia.org': <GET http://hy.wikipedia.org/wiki/%D4%B1%D5%B5%D5%A3%D5%B8%D6%82_%D5%B7%D5%A1%D5%B0%D6%80%D5%AB%D5%AF>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'io.wikipedia.org': <GET http://io.wikipedia.org/wiki/Bekafiko>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'it.wikipedia.org': <GET http://it.wikipedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'he.wikipedia.org': <GET http://he.wikipedia.org/wiki/%D7%A1%D7%91%D7%9B%D7%99_%D7%90%D7%A4%D7%95%D7%A8>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'kk.wikipedia.org': <GET http://kk.wikipedia.org/wiki/%D0%91%D0%B0%D2%9B_%D1%81%D0%B0%D0%BD%D0%B4%D1%83%D2%93%D0%B0%D1%88>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'kv.wikipedia.org': <GET http://kv.wikipedia.org/wiki/%D0%A1%D1%8D%D1%82%D3%A7%D1%80_%D0%BA%D0%B0%D0%B9>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'lt.wikipedia.org': <GET http://lt.wikipedia.org/wiki/Sodin%C4%97_devynbals%C4%97>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'li.wikipedia.org': <GET http://li.wikipedia.org/wiki/Zengersj_van_de_Ouwe_Waereld>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'hu.wikipedia.org': <GET http://hu.wikipedia.org/wiki/Kerti_posz%C3%A1ta>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'mk.wikipedia.org': <GET http://mk.wikipedia.org/wiki/%D0%93%D1%80%D0%B0%D0%B4%D0%B8%D0%BD%D1%81%D0%BA%D0%BE_%D0%B3%D1%80%D0%BC%D1%83%D1%88%D0%B0%D1%80%D1%87%D0%B5>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ms.wikipedia.org': <GET http://ms.wikipedia.org/wiki/Burung_siul_taman>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'nl.wikipedia.org': <GET http://nl.wikipedia.org/wiki/Tuinfluiter>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'nap.wikipedia.org': <GET http://nap.wikipedia.org/wiki/Fucetula>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'no.wikipedia.org': <GET http://no.wikipedia.org/wiki/Hagesanger>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'nn.wikipedia.org': <GET http://nn.wikipedia.org/wiki/Hagesongar>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ps.wikipedia.org': <GET http://ps.wikipedia.org/wiki/%D9%BC%D8%B1%D8%A7%DA%A9>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'pms.wikipedia.org': <GET http://pms.wikipedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'pl.wikipedia.org': <GET http://pl.wikipedia.org/wiki/Gaj%C3%B3wka_(ptak)>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'pt.wikipedia.org': <GET http://pt.wikipedia.org/wiki/Felosa-das-figueiras>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ru.wikipedia.org': <GET http://ru.wikipedia.org/wiki/%D0%A1%D0%B0%D0%B4%D0%BE%D0%B2%D0%B0%D1%8F_%D1%81%D0%BB%D0%B0%D0%B2%D0%BA%D0%B0>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'sl.wikipedia.org': <GET http://sl.wikipedia.org/wiki/Vrtna_penica>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'fi.wikipedia.org': <GET http://fi.wikipedia.org/wiki/Lehtokerttu>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'sv.wikipedia.org': <GET http://sv.wikipedia.org/wiki/Tr%C3%A4dg%C3%A5rdss%C3%A5ngare>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'tr.wikipedia.org': <GET http://tr.wikipedia.org/wiki/Boz_%C3%B6tle%C4%9Fen>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'uk.wikipedia.org': <GET http://uk.wikipedia.org/wiki/%D0%9A%D1%80%D0%BE%D0%BF%D0%B8%D0%B2'%D1%8F%D0%BD%D0%BA%D0%B0_%D1%81%D0%B0%D0%B4%D0%BE%D0%B2%D0%B0>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'vi.wikipedia.org': <GET http://vi.wikipedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'war.wikipedia.org': <GET http://war.wikipedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'creativecommons.org': <GET http://creativecommons.org/licenses/by-sa/3.0/>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'wikimediafoundation.org': <GET http://wikimediafoundation.org/wiki/Terms_of_Use>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.wikimediafoundation.org': <GET http://www.wikimediafoundation.org/>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'en.m.wikipedia.org': <GET http://en.m.wikipedia.org/w/index.php?mobileaction=toggle_view_mobile&title=Garden_warbler>
2015-01-13 22:20:47+0900 [englishwiki] INFO: Closing spider (finished)
2015-01-13 22:20:47+0900 [englishwiki] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 234,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 36442,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 1, 13, 13, 20, 47, 424202),
     'log_count/DEBUG': 74,
     'log_count/INFO': 7,
     'offsite/domains': 71,
     'offsite/filtered': 289,
     'request_depth_max': 1,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2015, 1, 13, 13, 20, 43, 114492)}
2015-01-13 22:20:47+0900 [englishwiki] INFO: Spider closed (finished)

蜘蛛代码如下:

import scrapy
from hangulscrape.items import HangulScrapeItem
import re
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import json


class HangulSpider(CrawlSpider):

    name='englishwiki'
    allowed_domains = ['en.wikipedia.org/wiki/']
    start_urls = [
    'http://en.wikipedia.org/wiki/Garden_warbler'
    ]

    rules = (
        Rule(SgmlLinkExtractor(), callback='parse_it', follow=True),
        )

    def parse_it(self, response):

        the_item = HangulScrapeItem()
        response.body.decode('utf-8')
        body = response.xpath('//*[@id="mw-content-text"]//text()').extract()

        english_dict = {}
        for i in body:
            english_words = re.findall('[a-zA-Z\'-]+' ,i)
            if english_words:
                for j in english_words:
                    if len(j) > 1:
                        word = j.lower()
                        if word in english_dict:
                            english_dict[word] += 1
                        else:
                            english_dict[word] = 1

        jsondump = json.dumps(english_dict)
        the_item['word'] = jsondump
        the_item['site'] = response.url

        return the_item

我无法复制您的所有代码,因为缺少项目。无论如何,这是您代码的简化版本:

  • 更改了解析函数
  • 不存在项目和转换 json

    class testSpider(CrawlSpider):
        name='englishwiki'
        allowed_domains = ["en.wikipedia.org/wiki/Garden_warbler"]
        start_urls = ["http://en.wikipedia.org/wiki/Garden_warbler"]
        rules = (Rule(SgmlLinkExtractor(), callback='parse', follow=True),)
        def parse(self, response):
            response.body.decode('utf-8')
            body = response.xpath('//*[@id="mw-content-text"]//text()').extract()
    
            english_dict = {}
            for i in body:
                english_words = re.findall('[a-zA-Z\'-]+' ,i)
                if english_words:
                    for j in english_words:
                        if len(j) > 1:
                            word = j.lower()
                            if word in english_dict:
                                english_dict[word] += 1
                            else:
                                english_dict[word] = 1
            print english_dict    
    

输出:

1, u'times': 1, u'length': 2, u'south': 5, u'upperparts': 4, u'isbn': 19, u'evans': 1, u'scene': 1, u'reaches': 1, u'svalbard': 1, u'management': 1, u'atricapilla': 1, u'their': 15, u'vocalisation': 1, u'intermediate': 1, u'zoologica': 1, u'shell': 1, u'accompany': 1, u'july': 1, u'ben': 2, u'borini': 1, u'protista': 1, u'sweden': 2, u'migration': 15, u'clip': 2, u'have': 17, u'throat': 1, u'able': 1, u'relatives': 1, u'which': 13, u'vegetation': 2, u'digestive': 1, u'sylviae': 1, u'alarmed': 1, u'class': 1, u'afresh': 1, u'conspecifics': 2, u"dohrn's": 1, u'spleen': 1, u'clive': 1, u'jean-louis': 1, u'sylviid': 1, u'painting': 1, u'phenology': 2, u'warblers': 31, u'selection': 1, u'biebach': 2, u'text': 1, u'supported': 1, u'nagy': 1, u'longevity': 1, u'fear': 1, u'pause': 1, u'interspecific': 3, u'should': 1, u'jan': 1, u'bernard': 1, u'arabian': 1, u'piano': 1, u'local': 2, u'means': 2, u'borin': 16, u'areas': 6, u'organ': 2, u'she': 1, u'nightingale': 1, u'available': 1, u'mid-september': 1, u'edition': 1, u'boddaert': 7, u'oldenburg': 2, u'placed': 2, u'pattern': 1, u'southward': 2, u'identification': 2, u'closed': 4, u'bedfordshire': 1, u'simms': 3, u'kidneys': 1, u'publishers': 1, u'animalia': 1, u'miroslav': 1, u'jon': 2, u'seventeen': 1, u'olga': 1, u'april': 4, u'sexes': 1, u'passing': 1, u'grounds': 3, u'ch': 1, u'cm': 6, u'reorientating': 1, u'coexistence': 1, u"cock's": 1, u'johann': 1, u'fledging': 1, u'anthelme': 1, u'table': 1, u'second': 1, u'silvia': 1, u'quia': 1, u'long-tailed': 2

我想如果你删除:

allowed_domains = ['en.wikipedia.org/wiki/']

您将允许蜘蛛加载不属于 en.wikipedia.org/wiki.

的域

此日志消息显示正在进行的域过滤:

DEBUG:已过滤到 'dx.doi.org' 的异地请求:

它正在过滤异地请求,即不让蜘蛛抓取它们。

我发现了我的问题。这是因为我试图在域部分中指定域的子目录。似乎域和子域可以,但子目录不行。

所以 (en.wikipedia.org = 好) 和 (en.wikipedia.org/wiki = 坏)

指定如何拉取链接的正确位置是在提取链接的规则中。