我的刮板产生了太多的项目而不是将它们全部合并成一小堆
My scraper yields too many items instead of merging them all into only a small bunch
我写了一个抓取工具,它应该浏览每支足球队的几个页面并基本上获取所有历史数据,最终将每个数据合并到一个漂亮的 json 文件中,每个球队有 1 个项目。
例如我最终为每个团队准备了一些东西:
{'clubName': [u'West Ham United'],
'matches': [{'date': [u'17/08/1974'],
'opponent': [u'Manchester City'],
'place': [u'A'],
'results': [u'0:4 '],
'round': [u'1. Round'],
'time': []},
{'date': [u'19/08/1974'],
'opponent': [u'Luton Town'],
'place': [u'H'],
'results': [u'2:0 '],
'round': [u'2. Round'],
'time': []},
{'date': [u'24/08/1974'],
'opponent': [u'Everton FC'],
'place': [u'H'],
'results': [u'2:3 '],
'round': [u'3. Round'],
'time': []},
基本上功能是:
- 获取 20 个团队,然后 link 转到他们的页面
- 获取历史结果link
- 从历史结果中获取季节的所有 links
- 将匹配数据合并回项目
为了调试,我在每个函数之后都生成了项目。我最终应该得到 20 个项目。如果我在 1,2 和 3 个函数之后产生项目,我最终只有 20 行,这是完美的,但它在第 4 个函数中变得混乱,我最终得到了数千个项目。每个俱乐部有很多多个项目等
我最终得到这样的项目:
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
有时一排 30 个,除了俱乐部名称外基本都是空白的。
我是 python 的新手,我整个上午都在看这个东西,但无法弄清楚为什么它不起作用。
这是我的代码:
import scrapy
from ..items import PremierleagueItem
class PremierleagueSpider(scrapy.Spider):
name = "premierleague"
allowed_domains = ["worldfootball.net"]
start_urls = [
"http://www.worldfootball.net/competition/eng-premier-league/"
]
# get teams in the match
def parse(self, response):
for sel in response.xpath('//div[@id="tabelle_0"]/div[@class="data"]/table[1]/tr'):
clubName = sel.xpath('.//td[3]/a/text()').extract()
if clubName:
item = PremierleagueItem()
item['clubName'] = clubName
clubHref = sel.xpath('.//td[2]/a/@href').extract_first()
clubUrl = response.urljoin(clubHref)
request = scrapy.Request(clubUrl,callback=self.parse_get_historic_results_link)
request.meta['item'] = item
yield request
def parse_get_historic_results_link(self,response):
item = response.meta['item']
href2 = response.xpath('//div[@class="navibox2"]/div[@class="data"]/ul[5]/li[2]/a[1]/@href').extract_first()
url2 = response.urljoin(href2)
request = scrapy.Request(url2,callback=self.parse_seasons)
request.meta['item'] = item
yield request
def parse_seasons(self,response):
item = response.meta['item']
for sel in response.xpath('(//table[@class="standard_tabelle"])[1]/tr/td[2]/a'):
href = sel.xpath('.//@href').extract_first()
url = response.urljoin(href)
request = scrapy.Request(url,callback=self.parse_results)
request.meta['item'] = item
yield request
def parse_results(self,response):
item = response.meta['item']
item['matches'] = []
for sel in response.xpath('(//table[@class="standard_tabelle"])[1]/tr'):
results = sel.xpath('.//td[7]/a/text()').extract()
if results:
matchDict = {
'round' : sel.xpath('.//td[1]/a/text()').extract(),
'date' : sel.xpath('.//td[2]/a/text()').extract(),
'time' : sel.xpath('.//td[3]/text()').extract(),
'place' : sel.xpath('.//td[4]/text()').extract(),
'opponent' : sel.xpath('.//td[6]/a/text()').extract(),
'results' : results
}
item['matches'].append(matchDict)
yield item
我在这里塞了什么?
编辑
澄清一下,我们最终得到的理想格式是多维数组,例如(伪代码):
Team name Y {
premierLeagueMatches {
{'date': [...],
'opponent': [...],
'place': [...],
'results': [...],
'round': [...],
'time': [...]
}
otherMatches {
same as above
}
},
Team name X {
premierLeagueMatches {
{'date': [...],
'opponent': [...],
'place': [...],
'results': [...],
'round': [...],
'time': [...]
}
otherMatches {
same as above
}
}
在数组的顶层,只有俱乐部名称,每个都是唯一的。没有重复的团队名称 x 或 y 等。但目前顶层唯一的唯一键是赛季日期。
在带有错误代码的 json 的最终输出中,我可以搜索 "clubName": [ "West Ham United" ] 并得到 75 个结果而不是 1 个。所以虽然有大量数据可以追溯到大约 1900 年代 :) 而不是目前的刮擦计数是 1670(我猜是赛季总数 * 英超联赛中的球队数量),我试图结束只有 20 个项目(每个团队一个)。
你的 xpaths 在 parse_results 中是错误的,这是一个获取你想要的数据的可运行示例:
import scrapy
class PremierleagueItem(scrapy.Item):
round = scrapy.Field()
date = scrapy.Field()
time = scrapy.Field()
place = scrapy.Field()
opponent = scrapy.Field()
results = scrapy.Field()
clubName = scrapy.Field()
matches = scrapy.Field()
class PremierleagueSpider(scrapy.Spider):
name = "premierleague"
allowed_domains = ["worldfootball.net"]
start_urls = [
"http://www.worldfootball.net/competition/eng-premier-league/"
]
# get teams in the match
def parse(self, response):
for sel in response.xpath('//div[@id="tabelle_0"]/div[@class="data"]/table[1]/tr'):
clubName = sel.xpath('.//td[3]/a/text()').extract()
if clubName:
item = PremierleagueItem()
item['clubName'] = clubName
clubHref = sel.xpath('.//td[2]/a/@href').extract_first()
clubUrl = response.urljoin(clubHref)
request = scrapy.Request(clubUrl, callback=self.parse_get_historic_results_link)
request.meta['item'] = item
yield request
def parse_get_historic_results_link(self, response):
item = response.meta['item']
href2 = response.xpath('//div[@class="navibox2"]/div[@class="data"]/ul[5]/li[2]/a[1]/@href').extract_first()
url2 = response.urljoin(href2)
request = scrapy.Request(url2, callback=self.parse_seasons)
request.meta['item'] = item
yield request
def parse_seasons(self, response):
item = response.meta['item']
for sel in response.xpath('(//table[@class="standard_tabelle"])[1]/tr/td[2]/a'):
href = sel.xpath('.//@href').extract_first()
url = response.urljoin(href)
request = scrapy.Request(url, callback=self.parse_results)
request.meta['item'] = item
yield request
@staticmethod
def parse_results(response):
item = response.meta['item']
item['matches'] = []
for sel in response.xpath('(//table[@class="standard_tabelle"])[1]/tr[position() > 3]'):
matchDict = dict(zip((
'round',
'date',
'place',
'opponent',
'results'), filter(None, map(unicode.strip,(sel.xpath("./td[normalize-space(.)]//text()").extract())))))
item['matches'].append(matchDict)
yield item
输出片段:
{'clubName': [u'Manchester City'],
'matches': [{'date': u'09/09/1911',
'opponent': u'Liverpool FC',
'place': u'A',
'results': u'2:2',
'round': u'2. Round'},
{'date': u'16/09/1911',
'opponent': u'Aston Villa',
'place': u'H',
'results': u'2:6',
'round': u'3. Round'},
{'date': u'23/09/1911',
'opponent': u'Newcastle United',
'place': u'A',
'results': u'0:1',
'round': u'4. Round'},
{'date': u'30/09/1911',
'opponent': u'Sheffield United',
'place': u'H',
'results': u'0:0',
'round': u'5. Round'},
{'date': u'07/10/1911',
'opponent': u'Oldham Athletic',
'place': u'A',
'results': u'1:4',
'round': u'6. Round'},
{'date': u'14/10/1911',
'opponent': u'Bolton Wanderers',
'place': u'H',
'results': u'3:1',
'round': u'8. Round'},
{'date': u'21/10/1911',
'opponent': u'Bradford City',
'place': u'A',
'results': u'1:4',
'round': u'9. Round'},
{'date': u'28/10/1911',
'opponent': u'Woolwich Arsenal',
'place': u'H',
'results': u'3:3',
'round': u'9. Round'},
{'date': u'04/11/1911',
'opponent': u'Preston North End',
'place': u'A',
'results': u'1:2',
'round': u'10. Round'},
{'date': u'11/11/1911',
'opponent': u'Everton FC',
'place': u'A',
'results': u'0:1',
'round': u'12. Round'},
{'date': u'18/11/1911',
'opponent': u'West Bromwich Albion',
'place': u'H',
'results': u'0:2',
'round': u'12. Round'},
{'date': u'25/11/1911',
'opponent': u'Sunderland AFC',
'place': u'A',
'results': u'1:1',
'round': u'13. Round'},
{'date': u'02/12/1911',
'opponent': u'Blackburn Rovers',
'place': u'H',
'results': u'3:0',
'round': u'15. Round'},
{'date': u'09/12/1911',
'opponent': u'Sheffield Wednesday',
'place': u'A',
'results': u'0:3',
'round': u'15. Round'},
{'date': u'16/12/1911',
'opponent': u'Bury FC',
'place': u'H',
'results': u'2:0',
'round': u'16. Round'},
{'date': u'23/12/1911',
'opponent': u'Middlesbrough FC',
'place': u'A',
'results': u'1:3',
'round': u'17. Round'},
{'date': u'25/12/1911',
'opponent': u'Notts County',
'place': u'A',
'results': u'1:0',
'round': u'18. Round'},
{'date': u'26/12/1911',
'opponent': u'Notts County',
'place': u'H',
'results': u'4:0',
'round': u'19. Round'},
{'date': u'30/12/1911',
'opponent': u'Manchester United',
'place': u'A',
'results': u'0:0',
'round': u'20. Round'},
{'date': u'06/01/1912',
'opponent': u'Liverpool FC',
'place': u'H',
'results': u'2:3',
'round': u'21. Round'},
{'date': u'20/01/1912',
'opponent': u'Aston Villa',
'place': u'A',
'results': u'1:3',
'round': u'22. Round'},
{'date': u'27/01/1912',
'opponent': u'Newcastle United',
'place': u'H',
'results': u'1:1',
'round': u'23. Round'},
{'date': u'10/02/1912',
'opponent': u'Oldham Athletic',
'place': u'H',
'results': u'1:3',
'round': u'24. Round'},
{'date': u'17/02/1912',
'opponent': u'Bolton Wanderers',
'place': u'A',
'results': u'1:2',
'round': u'27. Round'},
{'date': u'26/02/1912',
'opponent': u'Sheffield United',
'place': u'A',
'results': u'2:6',
'round': u'26. Round'},
{'date': u'02/03/1912',
'opponent': u'Woolwich Arsenal',
'place': u'A',
'results': u'0:2',
'round': u'28. Round'},
{'date': u'09/03/1912',
'opponent': u'Preston North End',
'place': u'H',
'results': u'0:0',
'round': u'28. Round'},
{'date': u'16/03/1912',
'opponent': u'Everton FC',
'place': u'H',
'results': u'4:0',
'round': u'29. Round'},
{'date': u'23/03/1912',
'opponent': u'West Bromwich Albion',
'place': u'A',
'results': u'1:1',
'round': u'30. Round'},
{'date': u'28/03/1912',
'opponent': u'Bradford City',
'place': u'H',
'results': u'4:0',
'round': u'31. Round'},
{'date': u'30/03/1912',
'opponent': u'Sunderland AFC',
'place': u'H',
'results': u'2:0',
'round': u'32. Round'},
{'date': u'05/04/1912',
'opponent': u'Tottenham Hotspur',
'place': u'H',
'results': u'2:1',
'round': u'33. Round'},
{'date': u'06/04/1912',
'opponent': u'Blackburn Rovers',
'place': u'A',
'results': u'0:2',
'round': u'31. Round'},
{'date': u'08/04/1912',
'opponent': u'Tottenham Hotspur',
'place': u'A',
'results': u'2:0',
'round': u'35. Round'},
{'date': u'13/04/1912',
'opponent': u'Sheffield Wednesday',
'place': u'H',
'results': u'4:0',
'round': u'36. Round'},
{'date': u'20/04/1912',
'opponent': u'Bury FC',
'place': u'A',
'results': u'2:1',
'round': u'37. Round'},
{'date': u'27/04/1912',
'opponent': u'Middlesbrough FC',
'place': u'H',
'results': u'2:0',
'round': u'38. Round'}]}
您需要做更多的工作才能获得您想要的确切格式,但无论您做什么,您都需要使用正确的 xpath,您还应该意识到您将回到大约 1900 年,因此将会很多输出可能更适合数据库。我还从每个页面中只提取了第一个 table,当有不止一个是联赛结果时,有些页面只有 F.A 杯赛结果等和 yput 球队等......如果你想获取所有数据,它会是这样的:
for tbl in response.xpath('(//table[@class="standard_tabelle"])'):
for sel in tbl.xpath("./tr[position() > 3]"):
matchDict = dict(zip((
'round',
'date',
'place',
'opponent',
'results'),
filter(None, map(unicode.strip, (sel.xpath("./td[normalize-space(.)]//text()").extract())))))
item['matches'].append(matchDict)
yield item
前 table 秒的下半部分也有一些杯赛结果,所以如果你只想要英超联赛:
@staticmethod
def parse_results(response):
item = response.meta['item']
item['matches'] = []
table = response.xpath('(//table[@class="standard_tabelle"])[1]')
for sel in table.xpath("./tr[position() > 3]"):
title = sel.xpath("./td/a/@title").extract_first()
if title and "premier" not in title.lower():
return
matchDict = dict(zip((
'round',
'date',
'place',
'opponent',
'results'),
filter(None, map(unicode.strip, (sel.xpath("./td[normalize-space(.)]//text()").extract())))))
item['matches'].append(matchDict)
yield item
我写了一个抓取工具,它应该浏览每支足球队的几个页面并基本上获取所有历史数据,最终将每个数据合并到一个漂亮的 json 文件中,每个球队有 1 个项目。
例如我最终为每个团队准备了一些东西:
{'clubName': [u'West Ham United'],
'matches': [{'date': [u'17/08/1974'],
'opponent': [u'Manchester City'],
'place': [u'A'],
'results': [u'0:4 '],
'round': [u'1. Round'],
'time': []},
{'date': [u'19/08/1974'],
'opponent': [u'Luton Town'],
'place': [u'H'],
'results': [u'2:0 '],
'round': [u'2. Round'],
'time': []},
{'date': [u'24/08/1974'],
'opponent': [u'Everton FC'],
'place': [u'H'],
'results': [u'2:3 '],
'round': [u'3. Round'],
'time': []},
基本上功能是:
- 获取 20 个团队,然后 link 转到他们的页面
- 获取历史结果link
- 从历史结果中获取季节的所有 links
- 将匹配数据合并回项目
为了调试,我在每个函数之后都生成了项目。我最终应该得到 20 个项目。如果我在 1,2 和 3 个函数之后产生项目,我最终只有 20 行,这是完美的,但它在第 4 个函数中变得混乱,我最终得到了数千个项目。每个俱乐部有很多多个项目等
我最终得到这样的项目:
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
{"matches": [], "clubName": ["Arsenal FC"]},
有时一排 30 个,除了俱乐部名称外基本都是空白的。
我是 python 的新手,我整个上午都在看这个东西,但无法弄清楚为什么它不起作用。
这是我的代码:
import scrapy
from ..items import PremierleagueItem
class PremierleagueSpider(scrapy.Spider):
name = "premierleague"
allowed_domains = ["worldfootball.net"]
start_urls = [
"http://www.worldfootball.net/competition/eng-premier-league/"
]
# get teams in the match
def parse(self, response):
for sel in response.xpath('//div[@id="tabelle_0"]/div[@class="data"]/table[1]/tr'):
clubName = sel.xpath('.//td[3]/a/text()').extract()
if clubName:
item = PremierleagueItem()
item['clubName'] = clubName
clubHref = sel.xpath('.//td[2]/a/@href').extract_first()
clubUrl = response.urljoin(clubHref)
request = scrapy.Request(clubUrl,callback=self.parse_get_historic_results_link)
request.meta['item'] = item
yield request
def parse_get_historic_results_link(self,response):
item = response.meta['item']
href2 = response.xpath('//div[@class="navibox2"]/div[@class="data"]/ul[5]/li[2]/a[1]/@href').extract_first()
url2 = response.urljoin(href2)
request = scrapy.Request(url2,callback=self.parse_seasons)
request.meta['item'] = item
yield request
def parse_seasons(self,response):
item = response.meta['item']
for sel in response.xpath('(//table[@class="standard_tabelle"])[1]/tr/td[2]/a'):
href = sel.xpath('.//@href').extract_first()
url = response.urljoin(href)
request = scrapy.Request(url,callback=self.parse_results)
request.meta['item'] = item
yield request
def parse_results(self,response):
item = response.meta['item']
item['matches'] = []
for sel in response.xpath('(//table[@class="standard_tabelle"])[1]/tr'):
results = sel.xpath('.//td[7]/a/text()').extract()
if results:
matchDict = {
'round' : sel.xpath('.//td[1]/a/text()').extract(),
'date' : sel.xpath('.//td[2]/a/text()').extract(),
'time' : sel.xpath('.//td[3]/text()').extract(),
'place' : sel.xpath('.//td[4]/text()').extract(),
'opponent' : sel.xpath('.//td[6]/a/text()').extract(),
'results' : results
}
item['matches'].append(matchDict)
yield item
我在这里塞了什么?
编辑
澄清一下,我们最终得到的理想格式是多维数组,例如(伪代码):
Team name Y {
premierLeagueMatches {
{'date': [...],
'opponent': [...],
'place': [...],
'results': [...],
'round': [...],
'time': [...]
}
otherMatches {
same as above
}
},
Team name X {
premierLeagueMatches {
{'date': [...],
'opponent': [...],
'place': [...],
'results': [...],
'round': [...],
'time': [...]
}
otherMatches {
same as above
}
}
在数组的顶层,只有俱乐部名称,每个都是唯一的。没有重复的团队名称 x 或 y 等。但目前顶层唯一的唯一键是赛季日期。
在带有错误代码的 json 的最终输出中,我可以搜索 "clubName": [ "West Ham United" ] 并得到 75 个结果而不是 1 个。所以虽然有大量数据可以追溯到大约 1900 年代 :) 而不是目前的刮擦计数是 1670(我猜是赛季总数 * 英超联赛中的球队数量),我试图结束只有 20 个项目(每个团队一个)。
你的 xpaths 在 parse_results 中是错误的,这是一个获取你想要的数据的可运行示例:
import scrapy
class PremierleagueItem(scrapy.Item):
round = scrapy.Field()
date = scrapy.Field()
time = scrapy.Field()
place = scrapy.Field()
opponent = scrapy.Field()
results = scrapy.Field()
clubName = scrapy.Field()
matches = scrapy.Field()
class PremierleagueSpider(scrapy.Spider):
name = "premierleague"
allowed_domains = ["worldfootball.net"]
start_urls = [
"http://www.worldfootball.net/competition/eng-premier-league/"
]
# get teams in the match
def parse(self, response):
for sel in response.xpath('//div[@id="tabelle_0"]/div[@class="data"]/table[1]/tr'):
clubName = sel.xpath('.//td[3]/a/text()').extract()
if clubName:
item = PremierleagueItem()
item['clubName'] = clubName
clubHref = sel.xpath('.//td[2]/a/@href').extract_first()
clubUrl = response.urljoin(clubHref)
request = scrapy.Request(clubUrl, callback=self.parse_get_historic_results_link)
request.meta['item'] = item
yield request
def parse_get_historic_results_link(self, response):
item = response.meta['item']
href2 = response.xpath('//div[@class="navibox2"]/div[@class="data"]/ul[5]/li[2]/a[1]/@href').extract_first()
url2 = response.urljoin(href2)
request = scrapy.Request(url2, callback=self.parse_seasons)
request.meta['item'] = item
yield request
def parse_seasons(self, response):
item = response.meta['item']
for sel in response.xpath('(//table[@class="standard_tabelle"])[1]/tr/td[2]/a'):
href = sel.xpath('.//@href').extract_first()
url = response.urljoin(href)
request = scrapy.Request(url, callback=self.parse_results)
request.meta['item'] = item
yield request
@staticmethod
def parse_results(response):
item = response.meta['item']
item['matches'] = []
for sel in response.xpath('(//table[@class="standard_tabelle"])[1]/tr[position() > 3]'):
matchDict = dict(zip((
'round',
'date',
'place',
'opponent',
'results'), filter(None, map(unicode.strip,(sel.xpath("./td[normalize-space(.)]//text()").extract())))))
item['matches'].append(matchDict)
yield item
输出片段:
{'clubName': [u'Manchester City'],
'matches': [{'date': u'09/09/1911',
'opponent': u'Liverpool FC',
'place': u'A',
'results': u'2:2',
'round': u'2. Round'},
{'date': u'16/09/1911',
'opponent': u'Aston Villa',
'place': u'H',
'results': u'2:6',
'round': u'3. Round'},
{'date': u'23/09/1911',
'opponent': u'Newcastle United',
'place': u'A',
'results': u'0:1',
'round': u'4. Round'},
{'date': u'30/09/1911',
'opponent': u'Sheffield United',
'place': u'H',
'results': u'0:0',
'round': u'5. Round'},
{'date': u'07/10/1911',
'opponent': u'Oldham Athletic',
'place': u'A',
'results': u'1:4',
'round': u'6. Round'},
{'date': u'14/10/1911',
'opponent': u'Bolton Wanderers',
'place': u'H',
'results': u'3:1',
'round': u'8. Round'},
{'date': u'21/10/1911',
'opponent': u'Bradford City',
'place': u'A',
'results': u'1:4',
'round': u'9. Round'},
{'date': u'28/10/1911',
'opponent': u'Woolwich Arsenal',
'place': u'H',
'results': u'3:3',
'round': u'9. Round'},
{'date': u'04/11/1911',
'opponent': u'Preston North End',
'place': u'A',
'results': u'1:2',
'round': u'10. Round'},
{'date': u'11/11/1911',
'opponent': u'Everton FC',
'place': u'A',
'results': u'0:1',
'round': u'12. Round'},
{'date': u'18/11/1911',
'opponent': u'West Bromwich Albion',
'place': u'H',
'results': u'0:2',
'round': u'12. Round'},
{'date': u'25/11/1911',
'opponent': u'Sunderland AFC',
'place': u'A',
'results': u'1:1',
'round': u'13. Round'},
{'date': u'02/12/1911',
'opponent': u'Blackburn Rovers',
'place': u'H',
'results': u'3:0',
'round': u'15. Round'},
{'date': u'09/12/1911',
'opponent': u'Sheffield Wednesday',
'place': u'A',
'results': u'0:3',
'round': u'15. Round'},
{'date': u'16/12/1911',
'opponent': u'Bury FC',
'place': u'H',
'results': u'2:0',
'round': u'16. Round'},
{'date': u'23/12/1911',
'opponent': u'Middlesbrough FC',
'place': u'A',
'results': u'1:3',
'round': u'17. Round'},
{'date': u'25/12/1911',
'opponent': u'Notts County',
'place': u'A',
'results': u'1:0',
'round': u'18. Round'},
{'date': u'26/12/1911',
'opponent': u'Notts County',
'place': u'H',
'results': u'4:0',
'round': u'19. Round'},
{'date': u'30/12/1911',
'opponent': u'Manchester United',
'place': u'A',
'results': u'0:0',
'round': u'20. Round'},
{'date': u'06/01/1912',
'opponent': u'Liverpool FC',
'place': u'H',
'results': u'2:3',
'round': u'21. Round'},
{'date': u'20/01/1912',
'opponent': u'Aston Villa',
'place': u'A',
'results': u'1:3',
'round': u'22. Round'},
{'date': u'27/01/1912',
'opponent': u'Newcastle United',
'place': u'H',
'results': u'1:1',
'round': u'23. Round'},
{'date': u'10/02/1912',
'opponent': u'Oldham Athletic',
'place': u'H',
'results': u'1:3',
'round': u'24. Round'},
{'date': u'17/02/1912',
'opponent': u'Bolton Wanderers',
'place': u'A',
'results': u'1:2',
'round': u'27. Round'},
{'date': u'26/02/1912',
'opponent': u'Sheffield United',
'place': u'A',
'results': u'2:6',
'round': u'26. Round'},
{'date': u'02/03/1912',
'opponent': u'Woolwich Arsenal',
'place': u'A',
'results': u'0:2',
'round': u'28. Round'},
{'date': u'09/03/1912',
'opponent': u'Preston North End',
'place': u'H',
'results': u'0:0',
'round': u'28. Round'},
{'date': u'16/03/1912',
'opponent': u'Everton FC',
'place': u'H',
'results': u'4:0',
'round': u'29. Round'},
{'date': u'23/03/1912',
'opponent': u'West Bromwich Albion',
'place': u'A',
'results': u'1:1',
'round': u'30. Round'},
{'date': u'28/03/1912',
'opponent': u'Bradford City',
'place': u'H',
'results': u'4:0',
'round': u'31. Round'},
{'date': u'30/03/1912',
'opponent': u'Sunderland AFC',
'place': u'H',
'results': u'2:0',
'round': u'32. Round'},
{'date': u'05/04/1912',
'opponent': u'Tottenham Hotspur',
'place': u'H',
'results': u'2:1',
'round': u'33. Round'},
{'date': u'06/04/1912',
'opponent': u'Blackburn Rovers',
'place': u'A',
'results': u'0:2',
'round': u'31. Round'},
{'date': u'08/04/1912',
'opponent': u'Tottenham Hotspur',
'place': u'A',
'results': u'2:0',
'round': u'35. Round'},
{'date': u'13/04/1912',
'opponent': u'Sheffield Wednesday',
'place': u'H',
'results': u'4:0',
'round': u'36. Round'},
{'date': u'20/04/1912',
'opponent': u'Bury FC',
'place': u'A',
'results': u'2:1',
'round': u'37. Round'},
{'date': u'27/04/1912',
'opponent': u'Middlesbrough FC',
'place': u'H',
'results': u'2:0',
'round': u'38. Round'}]}
您需要做更多的工作才能获得您想要的确切格式,但无论您做什么,您都需要使用正确的 xpath,您还应该意识到您将回到大约 1900 年,因此将会很多输出可能更适合数据库。我还从每个页面中只提取了第一个 table,当有不止一个是联赛结果时,有些页面只有 F.A 杯赛结果等和 yput 球队等......如果你想获取所有数据,它会是这样的:
for tbl in response.xpath('(//table[@class="standard_tabelle"])'):
for sel in tbl.xpath("./tr[position() > 3]"):
matchDict = dict(zip((
'round',
'date',
'place',
'opponent',
'results'),
filter(None, map(unicode.strip, (sel.xpath("./td[normalize-space(.)]//text()").extract())))))
item['matches'].append(matchDict)
yield item
前 table 秒的下半部分也有一些杯赛结果,所以如果你只想要英超联赛:
@staticmethod
def parse_results(response):
item = response.meta['item']
item['matches'] = []
table = response.xpath('(//table[@class="standard_tabelle"])[1]')
for sel in table.xpath("./tr[position() > 3]"):
title = sel.xpath("./td/a/@title").extract_first()
if title and "premier" not in title.lower():
return
matchDict = dict(zip((
'round',
'date',
'place',
'opponent',
'results'),
filter(None, map(unicode.strip, (sel.xpath("./td[normalize-space(.)]//text()").extract())))))
item['matches'].append(matchDict)
yield item