Scrapy:尝试从不正确的选择器列表中提取数据
Scrapy: Attempts to extract data from selector list not right
我正在尝试从网站上抓取足球赛程,但我的蜘蛛不太正确,因为我要么为所有选择器重复相同的赛程,要么 homeTeam
和 awayTeam
变量是巨大的数组,分别包含所有主场或客场。无论哪种方式,它都应该反映主场 vs 客场的格式。
这是我目前的尝试:
class FixtureSpider(CrawlSpider):
name = "fixturesSpider"
allowed_domains = ["www.bbc.co.uk"]
start_urls = [
"http://www.bbc.co.uk/sport/football/premier-league/fixtures"
]
def parse(self, response):
for sel in response.xpath('//table[@class="table-stats"]/tbody/tr[@class="preview"]'):
item = Fixture()
item['kickoff'] = str(sel.xpath("//table[@class='table-stats']/tbody/tr[@class='preview']/td[3]/text()").extract()[0].strip())
item['homeTeam'] = str(sel.xpath("//table[@class='table-stats']/tbody/tr/td[2]/p/span/a/text()").extract()[0].strip())
item['awayTeam'] = str(sel.xpath("//table[@class='table-stats']/tbody/tr/td[2]/p/span/a/text()").extract()[1].strip())
yield item
这个returns下面的信息重复是不正确的:
2015-03-20 21:41:40+0000 [fixturesSpider] DEBUG: Scraped from <200 http://www.bbc.co.uk/sport/football/premier-league/fixtures>
{'awayTeam': 'West Brom', 'homeTeam': 'Man City', 'kickoff': '12:45'}
2015-03-20 21:41:40+0000 [fixturesSpider] DEBUG: Scraped from <200 http://www.bbc.co.uk/sport/football/premier-league/fixtures>
{'awayTeam': 'West Brom', 'homeTeam': 'Man City', 'kickoff': '12:45'}
有人可以告诉我哪里出错了吗?
试试下面的选择器。我相信您需要 ...tbody//tr/...
而不是 ...tbody/tr/...
来获取所有 table 行,而不仅仅是第一行。
item['kickoff'] = str(sel.xpath("//table[@class='table-stats']/tbody//tr[@class='preview']/td[3]/text()").extract()[0].strip())
item['homeTeam'] = str(sel.xpath("//table[@class='table-stats']/tbody//tr/td[2]/p/span/a/text()").extract()[0].strip())
item['awayTeam'] = str(sel.xpath("//table[@class='table-stats']/tbody//tr/td[2]/p/span/a/text()").extract()[1].strip())
问题是您在循环中使用的 XPath 表达式是绝对的 - 它们从根元素开始,但应该相对于 sel
指向的当前行。换句话说,您需要在当前行上下文中搜索。
固定版本:
for sel in response.xpath('//table[@class="table-stats"]/tbody/tr[@class="preview"]'):
item = Fixture()
item['kickoff'] = str(sel.xpath("td[3]/text()").extract()[0].strip())
item['homeTeam'] = str(sel.xpath("td[2]/p/span/a/text()").extract()[0].strip())
item['awayTeam'] = str(sel.xpath("td[2]/p/span/a/text()").extract()[1].strip())
yield item
这是我得到的输出:
{'awayTeam': 'West Brom', 'homeTeam': 'Man City', 'kickoff': '12:45'}
{'awayTeam': 'Swansea', 'homeTeam': 'Aston Villa', 'kickoff': '15:00'}
{'awayTeam': 'Arsenal', 'homeTeam': 'Newcastle', 'kickoff': '15:00'}
...
如果您想获取匹配日期,您需要更改策略 - 迭代日期(h2
元素与 table-header
class)并获取第一个后续兄弟 table
元素:
for date in response.xpath('//h2[@class="table-header"]'):
matches = date.xpath('.//following-sibling::table[@class="table-stats"][1]/tbody/tr[@class="preview"]')
date = date.xpath('text()').extract()[0].strip()
for match in matches:
item = Fixture()
item['date'] = date
item['kickoff'] = match.xpath("td[3]/text()").extract()[0].strip()
item['homeTeam'] = match.xpath("td[2]/p/span/a/text()").extract()[0].strip()
item['awayTeam'] = match.xpath("td[2]/p/span/a/text()").extract()[1].strip()
yield item
我正在尝试从网站上抓取足球赛程,但我的蜘蛛不太正确,因为我要么为所有选择器重复相同的赛程,要么 homeTeam
和 awayTeam
变量是巨大的数组,分别包含所有主场或客场。无论哪种方式,它都应该反映主场 vs 客场的格式。
这是我目前的尝试:
class FixtureSpider(CrawlSpider):
name = "fixturesSpider"
allowed_domains = ["www.bbc.co.uk"]
start_urls = [
"http://www.bbc.co.uk/sport/football/premier-league/fixtures"
]
def parse(self, response):
for sel in response.xpath('//table[@class="table-stats"]/tbody/tr[@class="preview"]'):
item = Fixture()
item['kickoff'] = str(sel.xpath("//table[@class='table-stats']/tbody/tr[@class='preview']/td[3]/text()").extract()[0].strip())
item['homeTeam'] = str(sel.xpath("//table[@class='table-stats']/tbody/tr/td[2]/p/span/a/text()").extract()[0].strip())
item['awayTeam'] = str(sel.xpath("//table[@class='table-stats']/tbody/tr/td[2]/p/span/a/text()").extract()[1].strip())
yield item
这个returns下面的信息重复是不正确的:
2015-03-20 21:41:40+0000 [fixturesSpider] DEBUG: Scraped from <200 http://www.bbc.co.uk/sport/football/premier-league/fixtures>
{'awayTeam': 'West Brom', 'homeTeam': 'Man City', 'kickoff': '12:45'}
2015-03-20 21:41:40+0000 [fixturesSpider] DEBUG: Scraped from <200 http://www.bbc.co.uk/sport/football/premier-league/fixtures>
{'awayTeam': 'West Brom', 'homeTeam': 'Man City', 'kickoff': '12:45'}
有人可以告诉我哪里出错了吗?
试试下面的选择器。我相信您需要 ...tbody//tr/...
而不是 ...tbody/tr/...
来获取所有 table 行,而不仅仅是第一行。
item['kickoff'] = str(sel.xpath("//table[@class='table-stats']/tbody//tr[@class='preview']/td[3]/text()").extract()[0].strip())
item['homeTeam'] = str(sel.xpath("//table[@class='table-stats']/tbody//tr/td[2]/p/span/a/text()").extract()[0].strip())
item['awayTeam'] = str(sel.xpath("//table[@class='table-stats']/tbody//tr/td[2]/p/span/a/text()").extract()[1].strip())
问题是您在循环中使用的 XPath 表达式是绝对的 - 它们从根元素开始,但应该相对于 sel
指向的当前行。换句话说,您需要在当前行上下文中搜索。
固定版本:
for sel in response.xpath('//table[@class="table-stats"]/tbody/tr[@class="preview"]'):
item = Fixture()
item['kickoff'] = str(sel.xpath("td[3]/text()").extract()[0].strip())
item['homeTeam'] = str(sel.xpath("td[2]/p/span/a/text()").extract()[0].strip())
item['awayTeam'] = str(sel.xpath("td[2]/p/span/a/text()").extract()[1].strip())
yield item
这是我得到的输出:
{'awayTeam': 'West Brom', 'homeTeam': 'Man City', 'kickoff': '12:45'}
{'awayTeam': 'Swansea', 'homeTeam': 'Aston Villa', 'kickoff': '15:00'}
{'awayTeam': 'Arsenal', 'homeTeam': 'Newcastle', 'kickoff': '15:00'}
...
如果您想获取匹配日期,您需要更改策略 - 迭代日期(h2
元素与 table-header
class)并获取第一个后续兄弟 table
元素:
for date in response.xpath('//h2[@class="table-header"]'):
matches = date.xpath('.//following-sibling::table[@class="table-stats"][1]/tbody/tr[@class="preview"]')
date = date.xpath('text()').extract()[0].strip()
for match in matches:
item = Fixture()
item['date'] = date
item['kickoff'] = match.xpath("td[3]/text()").extract()[0].strip()
item['homeTeam'] = match.xpath("td[2]/p/span/a/text()").extract()[0].strip()
item['awayTeam'] = match.xpath("td[2]/p/span/a/text()").extract()[1].strip()
yield item