crawlspider 不使用文本文件中的 url 进行爬网
crawlspider not crawling using urls in text files
问题陈述:
我在每行的文件名 myurls.csv 中有一个论坛网址列表,如下所示:
https://www.drupal.org/user/3178461/track
https://www.drupal.org/user/511008/track
我写了一个CrawlSpider代码来爬取论坛帖子如下:
class fileuserurl(CrawlSpider):
name = "fileuserurl"
allowed_domains = []
start_urls = []
rules = (
Rule(SgmlLinkExtractor(allow=('/user/\d/track'),
restrict_xpaths = ('//li[@class="pager-next"]',),
canonicalize=False ),callback='parse_page',follow=True)
)
def __init__(self):
f = open('./myurls.txt','r').readlines()
self.allowed_domains = ['www.drupal.org']
self.start_urls = [l.strip() for l in f]
super(fileuserurl,self).__init__()
def parse_page(self, response):
print '*********** START PARSE_PAGE METHOD**************'
# print response.url
items = response.xpath("//tbody/tr")
myposts=[]
for temp in items:
item = TopicPosts()
item['topic'] = temp.xpath(".//td[2]/a/text()").extract()
relative_url = temp.xpath(".//td[2]/a/@href").extract()[0]
item['topiclink'] = 'https://www.drupal.org'+relative_url
item['author'] = temp.xpath(".//td[3]/a/text()").extract()
try:
item['replies'] = str(temp.xpath(".//td[4]/text()").extract()[0]).strip('\n')
except:
continue
myposts.append(item)
return myposts
问题:
它只给我文本文件中提到的 url 的第一页输出。我想转到首页下一个定义的每个 link 页面。
相反,定义 start_requests()
method:
def start_requests(self):
with open('./myurls.txt','r') as f:
for url in f:
url = url.strip()
yield scrapy.Request(url)
并且,您需要将 rules
定义为可迭代对象。另外,allow
中的正则表达式应该允许多于一个数字(\d+
而不是 \d
):
rules = [
Rule(SgmlLinkExtractor(allow='/user/\d+/track', restrict_xpaths='//li[@class="pager-next"]', canonicalize=False),
callback='parse_page',
follow=True)
]
问题陈述:
我在每行的文件名 myurls.csv 中有一个论坛网址列表,如下所示:
https://www.drupal.org/user/3178461/track
https://www.drupal.org/user/511008/track
我写了一个CrawlSpider代码来爬取论坛帖子如下:
class fileuserurl(CrawlSpider):
name = "fileuserurl"
allowed_domains = []
start_urls = []
rules = (
Rule(SgmlLinkExtractor(allow=('/user/\d/track'),
restrict_xpaths = ('//li[@class="pager-next"]',),
canonicalize=False ),callback='parse_page',follow=True)
)
def __init__(self):
f = open('./myurls.txt','r').readlines()
self.allowed_domains = ['www.drupal.org']
self.start_urls = [l.strip() for l in f]
super(fileuserurl,self).__init__()
def parse_page(self, response):
print '*********** START PARSE_PAGE METHOD**************'
# print response.url
items = response.xpath("//tbody/tr")
myposts=[]
for temp in items:
item = TopicPosts()
item['topic'] = temp.xpath(".//td[2]/a/text()").extract()
relative_url = temp.xpath(".//td[2]/a/@href").extract()[0]
item['topiclink'] = 'https://www.drupal.org'+relative_url
item['author'] = temp.xpath(".//td[3]/a/text()").extract()
try:
item['replies'] = str(temp.xpath(".//td[4]/text()").extract()[0]).strip('\n')
except:
continue
myposts.append(item)
return myposts
问题:
它只给我文本文件中提到的 url 的第一页输出。我想转到首页下一个定义的每个 link 页面。
相反,定义 start_requests()
method:
def start_requests(self):
with open('./myurls.txt','r') as f:
for url in f:
url = url.strip()
yield scrapy.Request(url)
并且,您需要将 rules
定义为可迭代对象。另外,allow
中的正则表达式应该允许多于一个数字(\d+
而不是 \d
):
rules = [
Rule(SgmlLinkExtractor(allow='/user/\d+/track', restrict_xpaths='//li[@class="pager-next"]', canonicalize=False),
callback='parse_page',
follow=True)
]