Scrapy:无法从 xpath 获取数据

Scrapy: Unable to get data from xpath

我正在尝试从以下脚本中获取数据。我在解析函数中将 XPath 分成了 02 个部分。第一部分包含我不想循环的固定数据,第二部分包含我想循环的 table 。当我 运行 脚本时,它只提供第二部分数据。我使用 Splash 来渲染 HTML.

import scrapy
from scrapy_splash import SplashRequest


class RaceSpider(scrapy.Spider):
    name = 'race'
    allowed_domains = ['www.racing.com']

    script = '''
        function main(splash, args)
            splash.private_mode_enabled = false
            assert(splash:go(args.url))
            assert(splash:wait(5))

            splash:set_viewport_full()
            return splash:html()
        end
    '''

    def start_requests(self):
        yield SplashRequest(
        url= 'https://www.racing.com/form/2020-08-08/flemington/race/1/results#/results',
        callback=self.parse, endpoint='execute', args={
            'lua_source': self.script
        }
    )

    def parse(self, response):
        information = response.xpath("//div[@class='race-results-table ng-scope']/table")
        yield{

            #part 1

                'Race Number': response.xpath("(.//span[@class='number-circle xlg'])[1]/text()").get(),
                'Title': response.xpath("(.//div[@class='popup ng-scope']/h1)[1]/text()").get(),

                'Result Distance Thumbnail': response.xpath(".//div[@class='ng-scope']/p/text()").get(),
                'Track Condition': response.xpath(".//div[@class='condition']/div/p/span/text()").get(),
                'Rail': response.xpath("(.//div[@class='rail']/div/p/span)[1]/text()").get(),
        }
        for info in information:
            yield{

                #part 2

                'Position': info.xpath("(.//td[@class='td-position tcenter']/span)[1]/text()").get(),
                'Horse Entry Number': info.xpath("(.//td[@class='horse-name']/h3/a/span)[1]/text()").get(),
                'Horse Full Name': info.xpath("(.//td[@class='horse-name']/h3/a/span)[2]/text()").get(),
                'Horse Barrier Number': info.xpath("(.//td[@class='horse-name']/h3/a/span)[3]/text()").get(),
                'Trainers': info.xpath("(.//td[@class='horse-details']/span/a)[1]/text()").get(),
                'Jockey': info.xpath("(.//td[@class='horse-details']/span/a)[2]/text()").get(),
            } 

输出

2021-09-08 22:58:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.racing.com/form/2020-08-08/flemington/race/1/results#/results via http://localhost:8050/execute> (referer: None)
2021-09-08 22:58:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.racing.com/form/2020-08-08/flemington/race/1/results#/results>
{'Race Number': '1', 'Title': 'Flemington', 'Date': 'Sat, 8th Aug', 'Result Time': '2:05am', 'Result Distance': '2530m\xa0\xa0', 'Race Name': 'TAB Handicap', 'Result Distance Thumbnail': '2530m', 'Track Condition': 'Soft 7', 'Rail': 'Out 10m Entire Circuit\n                    ', 'Track Record': 'Unavailable', 'Price Money': '5,000'}
2021-09-08 22:58:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.racing.com/form/2020-08-08/flemington/race/1/results#/results>
{'Position': None, 'Horse Entry Number': None, 'Horse Full Name': None, 'Horse Barrier Number': None, 'Trainers': None, 'Jockey': None, 'Gear': None, 'WGT': None, 'Price': None, '800m': None, '400m': None, 'Margin': None, 'SP': None}
2021-09-08 22:58:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.racing.com/form/2020-08-08/flemington/race/1/results#/results>
{'Position': '1st', 'Horse Entry Number': '5. ', 'Horse Full Name': 'Exemplar (IRE)', 'Horse Barrier Number': ' (7)', 'Trainers': 'C.Maher & D.Eustace', 'Jockey': 'J.Allen', 'Gear': '1', 'WGT': '56.5kg', 'Price': ',250', '800m': '1st', '400m': '1st', 'Margin': '2:45.74', 'SP': '.00'}
2021-09-08 22:58:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.racing.com/form/2020-08-08/flemington/race/1/results#/results>
{'Position': '2nd', 'Horse Entry Number': '3. ', 'Horse Full Name': 'Double You Tee', 'Horse Barrier Number': ' (6)', 'Trainers': 'P.Payne', 'Jockey': 'W.J.Egan', 'Gear': '0', 'WGT': '57.5kg', 'Price': ',300', '800m': '6th', '400m': '4th', 'Margin': '1.25L', 'SP': '.80'}
2021-09-08 22:58:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.racing.com/form/2020-08-08/flemington/race/1/results#/results>
{'Position': '3rd', 'Horse Entry Number': '6. ', 'Horse Full Name': 'Bertwhistle', 'Horse Barrier Number': ' (4)', 'Trainers': 'D.I.Dodson', 'Jockey': 'L.J.Neindorf', 'Gear': '0', 'WGT': '54kg', 'Price': ',150', '800m': '4th', '400m': '3rd', 'Margin': '4.75L', 'SP': '.00'}
2021-09-08 22:58:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.racing.com/form/2020-08-08/flemington/race/1/results#/results>
{'Position': '4th', 'Horse Entry Number': '7. ', 'Horse Full Name': 'Flag Edition (NZ)', 'Horse Barrier Number': ' (2)', 'Trainers': 'M.Payne', 'Jockey': 'M.Payne', 'Gear': '0', 'WGT': '56kg', 'Price': ',750', '800m': '5th', '400m': '6th', 'Margin': '4.85L', 'SP': '.00'}
2021-09-08 22:58:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.racing.com/form/2020-08-08/flemington/race/1/results#/results>
{'Position': '5th', 'Horse Entry Number': '8. ', 'Horse Full Name': 'Blandford Lad (NZ)', 'Horse Barrier Number': ' (3)', 'Trainers': 'P.Gelagotis', 'Jockey': 'W.T.Price', 'Gear': '2', 'WGT': '53kg', 'Price': ',050', '800m': '7th', '400m': '7th', 'Margin': '5.6L', 'SP': '.00'}
2021-09-08 22:58:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.racing.com/form/2020-08-08/flemington/race/1/results#/results>
{'Position': '6th', 'Horse Entry Number': '4. ', 'Horse Full Name': 'South Pacific (GB)', 'Horse Barrier Number': ' (5)', 'Trainers': 'C.Maher & D.Eustace', 'Jockey': 'D.Oliver', 'Gear': '3', 'WGT': '57.5kg', 'Price': ',700', '800m': '2nd', '400m': '2nd', 'Margin': '5.8L', 'SP': '.95'}
2021-09-08 22:58:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.racing.com/form/2020-08-08/flemington/race/1/results#/results>
{'Position': '7th', 'Horse Entry Number': '1. ', 'Horse Full Name': 'Home By Midnight (NZ)', 'Horse Barrier Number': ' (1)', 'Trainers': 'P.Payne', 'Jockey': 'T.J.Hope', 'Gear': '2', 'WGT': '60kg', 'Price': 
',700', '800m': '3rd', '400m': '5th', 'Margin': '6.55L', 'SP': '.00'}
2021-09-08 22:58:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.racing.com/form/2020-08-08/flemington/race/1/results#/results>
{'Position': '\n                ', 'Horse Entry Number': '2. ', 'Horse Full Name': 'Lord Belvedere (GB)', 'Horse Barrier Number': None, 'Trainers': 'C.Maher & D.Eustace', 'Jockey': 'B.J.Melham', 'Gear': '0', 'WGT': '60kg', 'Price': '–', '800m': None, '400m': None, 'Margin': None, 'SP': None}
2021-09-08 22:58:36 [scrapy.core.engine] INFO: Closing spider (finished)
2021-09-08 22:58:36 [scrapy.extensions.feedexport] INFO: Stored csv feed (10 items) in: data1.csv
2021-09-08 22:58:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 839,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 427762,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 23.855061,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 9, 8, 16, 58, 36, 640162),
 'item_scraped_count': 10,
 'log_count/DEBUG': 86,
 'log_count/INFO': 13,
 'log_count/WARNING': 3,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'splash/execute/request_count': 1,
 'splash/execute/response_count/200': 1,
 'start_time': datetime.datetime(2021, 9, 8, 16, 58, 12, 785101)}
2021-09-08 22:58:36 [scrapy.core.engine] INFO: Spider closed (finished)

scrapy 无法在同一个响应中使用两个 yield 方法。 实际上,数据是从 API 调用 json 响应生成的。您可以通过后门生成数据轻松做到这一点,并且可以随心所欲地获取数据项。

这是工作解决方案的示例:

代码:

import scrapy
import json

class RaceSpider(scrapy.Spider):
    name = 'race'
    
    headers = {
        'accept': 'application/json, text/plain, */*',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-US,en;q=0.9,bn;q=0.8,es;q=0.7,ar;q=0.6',
        'origin': 'https://www.racing.com',
        'referer': 'https://www.racing.com/',
        'sec-ch-ua': '"Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
        'sec-fetch-dest': 'empty',
        'sec-fetch-mode': 'cors',
        'sec-fetch-site': 'same-site',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'}

    def start_requests(self):
        yield scrapy.Request(
            url='https://api.racing.com/v1/en-au/meet/details/5162295/',
            callback=self.parse,
            method="GET",
            headers=self.headers)

    def parse(self, response):
        response = json.loads(response.body)
        for resp in response['raceCollection']:
            for res in resp['raceResultsCollection']:
            #print(resp)

                items = {
                    'Race Number': resp['raceNumber'],
                    'Result Distance Thumbnail': resp['distance'],
                    'Title_name': resp['name'],
                    'Position':res ['barrierNumber'],
                    'Horse Full Name': res['horse']['fullName'],
                    'Jockey': res['jockey']['fullName']
                    }
                yield items

输出:

{'Race Number': 1, 'Result Distance Thumbnail': 2530, 'Title_name': 'TAB Handicap', 'Position': 6, 'Horse Full Name': 'Double You Tee', 'Jockey': 'W.J.Egan'}
2021-09-09 23:22:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.racing.com/v1/en-au/meet/details/5162295/>
{'Race Number': 1, 'Result Distance Thumbnail': 2530, 'Title_name': 'TAB Handicap', 'Position': 4, 'Horse Full Name': 'Bertwhistle', 'Jockey': 'L.J.Neindorf'}
2021-09-09 23:22:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.racing.com/v1/en-au/meet/details/5162295/>
{'Race Number': 1, 'Result Distance Thumbnail': 2530, 'Title_name': 'TAB Handicap', 'Position': 2, 'Horse Full Name': 'Flag Edition (NZ)', 'Jockey': 'M.Payne'}
2021-09-09 23:22:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.racing.com/v1/en-au/meet/details/5162295/>
{'Race Number': 9, 'Result Distance Thumbnail': 1410, 'Title_name': 'Rubaroc Handicap', 'Position': 0, 'Horse Full Name': 'Honorable Mention (NZ)', 'Jockey': 'B.Allen'}
2021-09-09 23:22:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.racing.com/v1/en-au/meet/details/5162295/>
{'Race Number': 9, 'Result Distance Thumbnail': 1410, 'Title_name': 'Rubaroc Handicap', 'Position': 0, 'Horse Full Name': 'Copper Fox', 'Jockey': 'G.J.Cartwright'}
2021-09-09 23:22:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.racing.com/v1/en-au/meet/details/5162295/>
{'Race Number': 9, 'Result Distance Thumbnail': 1410, 'Title_name': 'Rubaroc Handicap', 'Position': 0, 'Horse Full Name': 'Muswellbrook', 'Jockey': 'J.Mott'}
2021-09-09 23:22:22 [scrapy.core.engine] INFO: Closing spider (finished)
2021-09-09 23:22:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 605,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 19582,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 5.144949,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 9, 9, 17, 22, 22, 334263),
 'httpcompression/response_bytes': 205617,
 'httpcompression/response_count': 1,
 'item_scraped_count': 100,

...等等