使用 python scrapy 从网页中提取链接
Using python scrapy to extract links from a webpage
我是 python 的初学者,使用 scrapy 从以下网页中提取链接
http://www.basketball-reference.com/leagues/NBA_2015_games.html.
我写的代码是
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from basketball.items import BasketballItem
class BasketballSpider(CrawlSpider):
name = 'basketball'
allowed_domains = ['basketball-reference.com/']
start_urls = ['http://www.basketball-reference.com/leagues/NBA_2015_games.html']
rules = [Rule(LinkExtractor(allow=['http://www.basketball-reference.com/boxscores/^\w+$']), 'parse_item')]
def parse_item(self, response):
item = BasketballItem()
item['url'] = response.url
return item
我运行这段代码通过命令提示符,但是创建的文件没有任何链接。有人可以帮忙吗?
找不到链接,修复规则中的正则表达式:
rules = [
Rule(LinkExtractor(allow='boxscores/\w+'))
]
此外,您不必在调用 parse_item
时设置 callback
- 这是默认值。
并且allow
也可以设置为字符串。
rules = [
Rule(LinkExtractor(allow='boxscores/\w+'), callback='parse_item')
]
我是 python 的初学者,使用 scrapy 从以下网页中提取链接 http://www.basketball-reference.com/leagues/NBA_2015_games.html.
我写的代码是
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from basketball.items import BasketballItem
class BasketballSpider(CrawlSpider):
name = 'basketball'
allowed_domains = ['basketball-reference.com/']
start_urls = ['http://www.basketball-reference.com/leagues/NBA_2015_games.html']
rules = [Rule(LinkExtractor(allow=['http://www.basketball-reference.com/boxscores/^\w+$']), 'parse_item')]
def parse_item(self, response):
item = BasketballItem()
item['url'] = response.url
return item
我运行这段代码通过命令提示符,但是创建的文件没有任何链接。有人可以帮忙吗?
找不到链接,修复规则中的正则表达式:
rules = [
Rule(LinkExtractor(allow='boxscores/\w+'))
]
此外,您不必在调用 parse_item
时设置 callback
- 这是默认值。
并且allow
也可以设置为字符串。
rules = [
Rule(LinkExtractor(allow='boxscores/\w+'), callback='parse_item')
]