Scrapy Link提取器不正确 Link 已废弃
Scrapy LinkExtractor Incorrect Link Scraped
在 LinkExtractor 中使用 Scrapy 规则时,在与我的正则表达式匹配的页面中找到的 links 不太正确。我可能遗漏了一些明显的东西,但我没有看到...
从页面中提取的所有与我的正则表达式匹配的 link 都是正确的,但是似乎在 link 的末尾添加了一个“=”符号。我做错了什么?
URL 已删除:
http://rotoguru1.com/cgi-bin/hstats.cgi?pos=0&sort=1&game=k&colA=0&daypt=0&xavg=3&show=1&fltr=00
示例link我要抓取:
<a href="playrh.cgi?3986">Durant, Kevin</a>
我的Rule/LinkExtractor/Regular表达式:
rules = [ # <a href="playrh.cgi?3986">Durant, Kevin</a>
Rule(LinkExtractor(r'playrh\.cgi\?[0-9]{4}$'),
callback='parse_player',
follow=False
)
]
抓取的URL(取自parse_player响应对象):
'http://rotoguru1.com/cgi-bin/playrh.cgi?4496='
Notice the extra '=' appended to the end of the URL!
谢谢!
好的,这是我的日志...
据我所知,没有发生重定向,但讨厌的“=”正在结束或以某种方式请求 URL...
我现在打算使用 Explorer 'link processing' 来解决这个问题,但我想弄个水落石出。
谢谢!
Testing started at 10:24 AM ...
pydev debugger: process 1352 is connecting
Connected to pydev debugger (build 143.1919)
2016-02-17 10:24:57,789: INFO >> Scrapy 1.0.3 started (bot: Scraper)
2016-02-17 10:24:57,789: INFO >> Optional features available: ssl, http11
2016-02-17 10:24:57,790: INFO >> Overridden settings: {'NEWSPIDER_MODULE': 'Scraper.spiders', 'LOG_ENABLED': False, 'SPIDER_MODULES': ['Scraper.spiders'], 'CONCURRENT_REQUESTS': 128, 'BOT_NAME': 'Scraper'}
2016-02-17 10:24:57,904: INFO >> Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-02-17 10:24:58,384: INFO >> Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-17 10:24:58,388: INFO >> Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-17 10:24:58,417: INFO >> Enabled item pipelines: MongoOutPipeline
2016-02-17 10:24:58,420: INFO >> Spider opened
2016-02-17 10:24:58,424: INFO >> Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-17 10:24:58,427: DEBUG >> spider_opened (NbaRotoGuruDfsPerformanceSpider) : 'NbaRotoGuruDfsPerformanceSpider'
2016-02-17 10:24:58,428: DEBUG >> Telnet console listening on 127.0.0.1:6023
2016-02-17 10:24:59,957: DEBUG >> Crawled (200) <GET http://rotoguru1.com/cgi-bin/hstats.cgi?pos=0&sort=1&game=k&colA=0&daypt=0&xavg=3&show=1&fltr=00> (referer: None)
2016-02-17 10:25:01,130: DEBUG >> Crawled (200) <GET http://rotoguru1.com/cgi-bin/playrh.cgi?4496=> (referer: http://rotoguru1.com/cgi-bin/hstats.cgi?pos=0&sort=1&game=k&colA=0&daypt=0&xavg=3&show=1&fltr=00)
**********************************>> CUT OUT ABOUT 550 LINES HERE FOR BREVITY (Just links same as directly above/below) *********************************>>
2016-02-17 10:25:28,983: DEBUG >> Crawled (200) <GET http://rotoguru1.com/cgi-bin/playrh.cgi?4632=> (referer: http://rotoguru1.com/cgi-bin/hstats.cgi?pos=0&sort=1&game=k&colA=0&daypt=0&xavg=3&show=1&fltr=00)
2016-02-17 10:25:28,987: DEBUG >> Crawled (200) <GET http://rotoguru1.com/cgi-bin/playrh.cgi?3527=> (referer: http://rotoguru1.com/cgi-bin/hstats.cgi?pos=0&sort=1&game=k&colA=0&daypt=0&xavg=3&show=1&fltr=00)
2016-02-17 10:25:29,400: DEBUG >> Crawled (200) <GET http://rotoguru1.com/cgi-bin/playrh.cgi?4564=> (referer: http://rotoguru1.com/cgi-bin/hstats.cgi?pos=0&sort=1&game=k&colA=0&daypt=0&xavg=3&show=1&fltr=00)
2016-02-17 10:25:29,581: INFO >> Closing spider (finished)
2016-02-17 10:25:29,585: INFO >> Dumping Scrapy stats:
{'downloader/request_bytes': 194884,
'downloader/request_count': 570,
'downloader/request_method_count/GET': 570,
'downloader/response_bytes': 5886991,
'downloader/response_count': 570,
'downloader/response_status_count/200': 570,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 2, 17, 15, 25, 29, 582000),
'log_count/DEBUG': 572,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 570,
'scheduler/dequeued': 570,
'scheduler/dequeued/memory': 570,
'scheduler/enqueued': 570,
'scheduler/enqueued/memory': 570,
'start_time': datetime.datetime(2016, 2, 17, 15, 24, 58, 424000)}
2016-02-17 10:25:29,585: INFO >> Spider closed (finished)
Process finished with exit code 0
以下代码段通过删除链接中的恶意“=”符号来工作
...
rules = [
Rule(LinkExtractor(r'playrh\.cgi\?[0-9]{4}'),
process_links='process_links',
callback='parse_player',
follow=False
)
]
...
def process_links(self, links):
for link in links:
link.url = link.url.replace('=','')
return links
...
在 LinkExtractor 中使用 Scrapy 规则时,在与我的正则表达式匹配的页面中找到的 links 不太正确。我可能遗漏了一些明显的东西,但我没有看到...
从页面中提取的所有与我的正则表达式匹配的 link 都是正确的,但是似乎在 link 的末尾添加了一个“=”符号。我做错了什么?
URL 已删除:
http://rotoguru1.com/cgi-bin/hstats.cgi?pos=0&sort=1&game=k&colA=0&daypt=0&xavg=3&show=1&fltr=00
示例link我要抓取:
<a href="playrh.cgi?3986">Durant, Kevin</a>
我的Rule/LinkExtractor/Regular表达式:
rules = [ # <a href="playrh.cgi?3986">Durant, Kevin</a>
Rule(LinkExtractor(r'playrh\.cgi\?[0-9]{4}$'),
callback='parse_player',
follow=False
)
]
抓取的URL(取自parse_player响应对象):
'http://rotoguru1.com/cgi-bin/playrh.cgi?4496='
Notice the extra '=' appended to the end of the URL!
谢谢!
好的,这是我的日志...
据我所知,没有发生重定向,但讨厌的“=”正在结束或以某种方式请求 URL...
我现在打算使用 Explorer 'link processing' 来解决这个问题,但我想弄个水落石出。
谢谢!
Testing started at 10:24 AM ...
pydev debugger: process 1352 is connecting
Connected to pydev debugger (build 143.1919)
2016-02-17 10:24:57,789: INFO >> Scrapy 1.0.3 started (bot: Scraper)
2016-02-17 10:24:57,789: INFO >> Optional features available: ssl, http11
2016-02-17 10:24:57,790: INFO >> Overridden settings: {'NEWSPIDER_MODULE': 'Scraper.spiders', 'LOG_ENABLED': False, 'SPIDER_MODULES': ['Scraper.spiders'], 'CONCURRENT_REQUESTS': 128, 'BOT_NAME': 'Scraper'}
2016-02-17 10:24:57,904: INFO >> Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-02-17 10:24:58,384: INFO >> Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-17 10:24:58,388: INFO >> Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-17 10:24:58,417: INFO >> Enabled item pipelines: MongoOutPipeline
2016-02-17 10:24:58,420: INFO >> Spider opened
2016-02-17 10:24:58,424: INFO >> Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-17 10:24:58,427: DEBUG >> spider_opened (NbaRotoGuruDfsPerformanceSpider) : 'NbaRotoGuruDfsPerformanceSpider'
2016-02-17 10:24:58,428: DEBUG >> Telnet console listening on 127.0.0.1:6023
2016-02-17 10:24:59,957: DEBUG >> Crawled (200) <GET http://rotoguru1.com/cgi-bin/hstats.cgi?pos=0&sort=1&game=k&colA=0&daypt=0&xavg=3&show=1&fltr=00> (referer: None)
2016-02-17 10:25:01,130: DEBUG >> Crawled (200) <GET http://rotoguru1.com/cgi-bin/playrh.cgi?4496=> (referer: http://rotoguru1.com/cgi-bin/hstats.cgi?pos=0&sort=1&game=k&colA=0&daypt=0&xavg=3&show=1&fltr=00)
**********************************>> CUT OUT ABOUT 550 LINES HERE FOR BREVITY (Just links same as directly above/below) *********************************>>
2016-02-17 10:25:28,983: DEBUG >> Crawled (200) <GET http://rotoguru1.com/cgi-bin/playrh.cgi?4632=> (referer: http://rotoguru1.com/cgi-bin/hstats.cgi?pos=0&sort=1&game=k&colA=0&daypt=0&xavg=3&show=1&fltr=00)
2016-02-17 10:25:28,987: DEBUG >> Crawled (200) <GET http://rotoguru1.com/cgi-bin/playrh.cgi?3527=> (referer: http://rotoguru1.com/cgi-bin/hstats.cgi?pos=0&sort=1&game=k&colA=0&daypt=0&xavg=3&show=1&fltr=00)
2016-02-17 10:25:29,400: DEBUG >> Crawled (200) <GET http://rotoguru1.com/cgi-bin/playrh.cgi?4564=> (referer: http://rotoguru1.com/cgi-bin/hstats.cgi?pos=0&sort=1&game=k&colA=0&daypt=0&xavg=3&show=1&fltr=00)
2016-02-17 10:25:29,581: INFO >> Closing spider (finished)
2016-02-17 10:25:29,585: INFO >> Dumping Scrapy stats:
{'downloader/request_bytes': 194884,
'downloader/request_count': 570,
'downloader/request_method_count/GET': 570,
'downloader/response_bytes': 5886991,
'downloader/response_count': 570,
'downloader/response_status_count/200': 570,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 2, 17, 15, 25, 29, 582000),
'log_count/DEBUG': 572,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 570,
'scheduler/dequeued': 570,
'scheduler/dequeued/memory': 570,
'scheduler/enqueued': 570,
'scheduler/enqueued/memory': 570,
'start_time': datetime.datetime(2016, 2, 17, 15, 24, 58, 424000)}
2016-02-17 10:25:29,585: INFO >> Spider closed (finished)
Process finished with exit code 0
以下代码段通过删除链接中的恶意“=”符号来工作
...
rules = [
Rule(LinkExtractor(r'playrh\.cgi\?[0-9]{4}'),
process_links='process_links',
callback='parse_player',
follow=False
)
]
...
def process_links(self, links):
for link in links:
link.url = link.url.replace('=','')
return links
...