使用 css 选择器 scrapy 找不到 class
Can not find class using css selector scrapy
我正在测试是否可以使用 scrapy 抓取网站。我从该站点得到响应,但我可以访问我想要的元素或数据。我的选择器是正确的,尽管我是 scrapy 的初学者,但我认为命令没有错误。
我想获得带有 class results-race-name 的标签
我通过 scrapy shell 运行了它
在 shell 中,我使用了以下命令
In [1]: fetch('https://greyhoundbet.racingpost.com/#results-list/r_date=2021-01-01/')
2022-01-07 15:08:58 [scrapy.core.engine] INFO: Spider opened
2022-01-07 15:09:01 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://greyhoundbet.racingpost.com/robots.txt> (referer: None)
2022-01-07 15:09:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://greyhoundbet.racingpost.com/#results-list/r_date=2021-01-01/> (referer: None)
In [2]: view(response)
Out[2]: True
In [3]: response.css('.results-race-name').extract()
Out[3]: []
注意 视图(响应)给我输出直到加载徽标
这不是 css 问题。数据是动态创建的。你可以从 json 文件中获取它(在浏览器中打开 devtools 点击网络选项卡,查看 json 请求并获取你需要的)。
In [1]: req = scrapy.Request('https://greyhoundbet.racingpost.com/results/blocks.sd?r_date=2021-01-01&blocks=header%2Cm
...: eetings')
In [2]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://greyhoundbet.racingpost.com/results/blocks.sd?r_date=2021-01-01&blocks=header%2Cmeetings> (referer: None)
In [3]: json_data = response.json()
In [4]: for data in json_data['meetings']['tracks']['1']['races']:
...: print(data['track'])
...:
Newcastle
Swindon
Kinsley
In [5]: for data in json_data['meetings']['tracks']['2']['races']:
...: print(data['track'])
...:
Monmore
Crayford
Hove
Harlow
Henlow
编辑:
spider.py
import scrapy
class ExampleSpider(scrapy.Spider):
name = "exampleSpider"
start_urls = ['https://greyhoundbet.racingpost.com/results/blocks.sd?r_date=2021-01-01&blocks=header%2Cmeetings']
def parse(self, response):
json_data = response.json()
for data in json_data['meetings']['tracks']['1']['races']:
yield {'race': data['track']}
for data in json_data['meetings']['tracks']['2']['races']:
yield {'race': data['track']}
main.py:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
if __name__ == "__main__":
spider = 'exampleSpider'
settings = get_project_settings()
settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
process = CrawlerProcess(settings)
process.crawl(spider)
process.start()
我正在测试是否可以使用 scrapy 抓取网站。我从该站点得到响应,但我可以访问我想要的元素或数据。我的选择器是正确的,尽管我是 scrapy 的初学者,但我认为命令没有错误。 我想获得带有 class results-race-name 的标签 我通过 scrapy shell 运行了它 在 shell 中,我使用了以下命令
In [1]: fetch('https://greyhoundbet.racingpost.com/#results-list/r_date=2021-01-01/')
2022-01-07 15:08:58 [scrapy.core.engine] INFO: Spider opened
2022-01-07 15:09:01 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://greyhoundbet.racingpost.com/robots.txt> (referer: None)
2022-01-07 15:09:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://greyhoundbet.racingpost.com/#results-list/r_date=2021-01-01/> (referer: None)
In [2]: view(response)
Out[2]: True
In [3]: response.css('.results-race-name').extract()
Out[3]: []
注意 视图(响应)给我输出直到加载徽标
这不是 css 问题。数据是动态创建的。你可以从 json 文件中获取它(在浏览器中打开 devtools 点击网络选项卡,查看 json 请求并获取你需要的)。
In [1]: req = scrapy.Request('https://greyhoundbet.racingpost.com/results/blocks.sd?r_date=2021-01-01&blocks=header%2Cm
...: eetings')
In [2]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://greyhoundbet.racingpost.com/results/blocks.sd?r_date=2021-01-01&blocks=header%2Cmeetings> (referer: None)
In [3]: json_data = response.json()
In [4]: for data in json_data['meetings']['tracks']['1']['races']:
...: print(data['track'])
...:
Newcastle
Swindon
Kinsley
In [5]: for data in json_data['meetings']['tracks']['2']['races']:
...: print(data['track'])
...:
Monmore
Crayford
Hove
Harlow
Henlow
编辑:
spider.py
import scrapy
class ExampleSpider(scrapy.Spider):
name = "exampleSpider"
start_urls = ['https://greyhoundbet.racingpost.com/results/blocks.sd?r_date=2021-01-01&blocks=header%2Cmeetings']
def parse(self, response):
json_data = response.json()
for data in json_data['meetings']['tracks']['1']['races']:
yield {'race': data['track']}
for data in json_data['meetings']['tracks']['2']['races']:
yield {'race': data['track']}
main.py:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
if __name__ == "__main__":
spider = 'exampleSpider'
settings = get_project_settings()
settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
process = CrawlerProcess(settings)
process.crawl(spider)
process.start()