Scrapy 排除包含特定文本的 URL
Scrapy exclude URLs containing specific text
我在尝试构建 Scrapy Python 程序时遇到问题。代码如下。
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class LinkscrawlItem(scrapy.Item):
link = scrapy.Field()
attr = scrapy.Field()
class someSpider(CrawlSpider):
name = 'mysitecrawler'
item = []
allowed_domains = ['mysite.co.uk']
start_urls = ['https://mysite.co.uk/']
rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
Rule (LinkExtractor(deny=('my-account', 'cart', 'checkout', 'wp-content')))
)
def parse_obj(self,response):
item = LinkscrawlItem()
item["link"] = str(response.url)+":"+str(response.status)
filename = 'links2.txt'
with open(filename, 'a') as f:
f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
self.log('Saved file %s' % filename)
我在使用 LinkExtractor 时遇到问题,对我来说,拒绝意味着从抓取中排除我提供的链接列表。但它仍在爬行它们。前三个网址是:
https://mysite.co.uk/my-account/
https://mysite.co.uk/checkout/
最后一个包含wp-content,例如:
https://mysite.co.uk/wp-content/uploads/01/22/photo.jpg
有人知道我的拒绝名单有什么问题吗?
谢谢
您的代码有两个问题。首先,您的抓取蜘蛛中有两个 Rules
并且您在第二条规则中包含拒绝限制,该规则永远不会被检查,因为第一条规则遵循所有链接然后调用回调。因此,首先检查第一条规则,因此它不会排除您不想抓取的 urls。第二个问题是,在你的第二条规则中,你已经包含了你想要避免抓取的文字字符串,但 deny
需要正则表达式。
解决方案是删除第一条规则并通过转义 url 中的特殊正则表达式字符(例如 -
)稍微更改 deny
参数。请参阅下面的示例。
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class LinkscrawlItem(scrapy.Item):
link = scrapy.Field()
attr = scrapy.Field()
class SomeSpider(CrawlSpider):
name = 'mysitecrawler'
allowed_domains = ['mysite.co.uk']
start_urls = ['https://mysite.co.uk/']
rules = (
Rule (LinkExtractor(deny=('my\-account', 'cart', 'checkout', 'wp\-content')), callback="parse_obj", follow=True),
)
def parse_obj(self,response):
item = LinkscrawlItem()
item["link"] = str(response.url)+":"+str(response.status)
filename = 'links2.txt'
with open(filename, 'a') as f:
f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
self.log('Saved file %s' % filename)
我在尝试构建 Scrapy Python 程序时遇到问题。代码如下。
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class LinkscrawlItem(scrapy.Item):
link = scrapy.Field()
attr = scrapy.Field()
class someSpider(CrawlSpider):
name = 'mysitecrawler'
item = []
allowed_domains = ['mysite.co.uk']
start_urls = ['https://mysite.co.uk/']
rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
Rule (LinkExtractor(deny=('my-account', 'cart', 'checkout', 'wp-content')))
)
def parse_obj(self,response):
item = LinkscrawlItem()
item["link"] = str(response.url)+":"+str(response.status)
filename = 'links2.txt'
with open(filename, 'a') as f:
f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
self.log('Saved file %s' % filename)
我在使用 LinkExtractor 时遇到问题,对我来说,拒绝意味着从抓取中排除我提供的链接列表。但它仍在爬行它们。前三个网址是:
https://mysite.co.uk/my-account/
https://mysite.co.uk/checkout/
最后一个包含wp-content,例如:
https://mysite.co.uk/wp-content/uploads/01/22/photo.jpg
有人知道我的拒绝名单有什么问题吗?
谢谢
您的代码有两个问题。首先,您的抓取蜘蛛中有两个 Rules
并且您在第二条规则中包含拒绝限制,该规则永远不会被检查,因为第一条规则遵循所有链接然后调用回调。因此,首先检查第一条规则,因此它不会排除您不想抓取的 urls。第二个问题是,在你的第二条规则中,你已经包含了你想要避免抓取的文字字符串,但 deny
需要正则表达式。
解决方案是删除第一条规则并通过转义 url 中的特殊正则表达式字符(例如 -
)稍微更改 deny
参数。请参阅下面的示例。
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class LinkscrawlItem(scrapy.Item):
link = scrapy.Field()
attr = scrapy.Field()
class SomeSpider(CrawlSpider):
name = 'mysitecrawler'
allowed_domains = ['mysite.co.uk']
start_urls = ['https://mysite.co.uk/']
rules = (
Rule (LinkExtractor(deny=('my\-account', 'cart', 'checkout', 'wp\-content')), callback="parse_obj", follow=True),
)
def parse_obj(self,response):
item = LinkscrawlItem()
item["link"] = str(response.url)+":"+str(response.status)
filename = 'links2.txt'
with open(filename, 'a') as f:
f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
self.log('Saved file %s' % filename)