crawlspider 不使用文本文件中的 url 进行爬网

Question

问题陈述：

我在每行的文件名 myurls.csv 中有一个论坛网址列表，如下所示：

https://www.drupal.org/user/3178461/track
https://www.drupal.org/user/511008/track

我写了一个CrawlSpider代码来爬取论坛帖子如下：

class fileuserurl(CrawlSpider):
    name = "fileuserurl"
    allowed_domains = []
    start_urls = []

    rules = (
    Rule(SgmlLinkExtractor(allow=('/user/\d/track'),
    restrict_xpaths = ('//li[@class="pager-next"]',),
    canonicalize=False ),callback='parse_page',follow=True)
    )

    def __init__(self):
    f = open('./myurls.txt','r').readlines()
    self.allowed_domains = ['www.drupal.org']
    self.start_urls = [l.strip() for l in f]
    super(fileuserurl,self).__init__()

    def parse_page(self, response):
    print '*********** START PARSE_PAGE METHOD**************'
    # print response.url
    items = response.xpath("//tbody/tr")
    myposts=[]
    for temp in items:
    item = TopicPosts()
    item['topic'] = temp.xpath(".//td[2]/a/text()").extract()
    relative_url = temp.xpath(".//td[2]/a/@href").extract()[0]
    item['topiclink'] = 'https://www.drupal.org'+relative_url
    item['author'] = temp.xpath(".//td[3]/a/text()").extract()
    try:
    item['replies'] = str(temp.xpath(".//td[4]/text()").extract()[0]).strip('\n')
    except:
    continue
    myposts.append(item)
    return myposts

问题：

它只给我文本文件中提到的 url 的第一页输出。我想转到首页下一个定义的每个 link 页面。

Answer 1

相反，定义 start_requests() method:

def start_requests(self):
    with open('./myurls.txt','r') as f:
        for url in f:
            url = url.strip()
            yield scrapy.Request(url)

并且，您需要将 rules 定义为可迭代对象。另外，allow 中的正则表达式应该允许多于一个数字（\d+ 而不是 \d）：

rules = [
    Rule(SgmlLinkExtractor(allow='/user/\d+/track', restrict_xpaths='//li[@class="pager-next"]', canonicalize=False),
         callback='parse_page',
         follow=True)
]

crawlspider 不使用文本文件中的 url 进行爬网

crawlspider not crawling using urls in text files

python

forum

scrapy

web-scraping