Scrapy Python 不能用更稳定的xpath提取链接
Scrapy Python can‘t extract links with more stable xpath
我正在为此构建一个抓取工具 website。我正在使用 Python 和 scrapy Shell 来提取我想要的数据:xpath 将是://a[@class=“sb-card sb-card-company site-1x1 with-hover]/@href“
使用 response.xpath(‘//a[@class=“sb-card sb-card-company site-1x1 with-hover]/@href“‘
returns []
我尝试使用 contains(@class,“sb-card-company“)
结果相同。以同样的方式使用其他容器,没有任何改变。使用不同的页面也没有效果。使用硬节点反而有效,但我很好奇我做错了什么。
这不是xpath的问题。这是一个 dynamically-loaded content 问题。
这是一个如何从 json 文件中获取它的示例:
scrapy shell
In [1]: url='https://www.startbase.de/api/companies/?format=json&display=small&sort=company.startbase_score&sort-direct
...: ion=desc&page=1&limit=21&filters={%22company.type%22:%22startup%22,%22startup_profile.industry_id%22:[10]}'
In [2]: headers = {
...: "Accept": "application/json",
...: "Accept-Encoding": "gzip, deflate, br",
...: "Accept-Language": "en-US,en;q=0.5",
...: "Cache-Control": "no-cache",
...: "Connection": "keep-alive",
...: "Content-Type": "application/json",
...: "DNT": "1",
...: "Host": "www.startbase.de",
...: "Pragma": "no-cache",
...: "Referer": "https://www.startbase.de/startups/?listOptions%5Bcompany-startup%5D=%7B%22version%22%3A1.3%2C%22sor
...: t%22%3A%22company.startbase_score%22%2C%22sortDirection%22%3A%22desc%22%2C%22display%22%3A%22small%22%2C%22item
...: sPerPage%22%3A21%2C%22page%22%3A1%2C%22userLocation%22%3Anull%2C%22filters%22%3A%7B%22startup_profile.industry_
...: id%22%3A%5B10%5D%7D%7D",
...: "Sec-Fetch-Dest": "empty",
...: "Sec-Fetch-Mode": "cors",
...: "Sec-Fetch-Site": "same-origin",
...: "Sec-GPC": "1",
...: "TE": "trailers",
...: "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.372
...: 9.169 Safari/537.36",
...: "X-KL-Ajax-Request": "Ajax_Request"
...: }
In [3]: req = scrapy.Request(url=url, headers=headers)
In [4]: fetch(req)
2021-10-16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.startbase.de/api/companies/?format=json&display=small&sort=company.startbase_score&sort-direction=desc&page=1&limit=21&filters=%7B%22company.type%22:%22startup%22,%22startup_profile.industry_id%22:[10]%7D> (referer: https://www.startbase.de/startups/?listOptions%5Bcompany-startup%5D=%7B%22version%22%3A1.3%2C%22sort%22%3A%22company.startbase_score%22%2C%22sortDirection%22%3A%22desc%22%2C%22display%22%3A%22small%22%2C%22itemsPerPage%22%3A21%2C%22page%22%3A1%2C%22userLocation%22%3Anull%2C%22filters%22%3A%7B%22startup_profile.industry_id%22%3A%5B10%5D%7D%7D)
In [5]: json_data = response.json()
In [6]: for company in json_data['body']['items']:
...: print(company['company.url'])
...:
/organization/creditshelf/
/organization/amafin-gmbh/
/organization/fincompare/
/organization/epap/
/organization/clearvat/
/organization/51nodes/
/organization/altruja-gmbh/
/organization/flexvelop/
/organization/coin-analyst-ug/
/organization/caya/
/organization/rubarb/
/organization/memrange/
/organization/sevdesk-sevenit/
/organization/getsafe/
/organization/xavin/
/organization/giromatch/
/organization/digi-bel-projekt-von-meeting-minds/
/organization/digioptions/
/organization/trafinscout/
/organization/tangany-gmbh/
/organization/kiwi-financial-living/
我正在为此构建一个抓取工具 website。我正在使用 Python 和 scrapy Shell 来提取我想要的数据:xpath 将是://a[@class=“sb-card sb-card-company site-1x1 with-hover]/@href“
使用 response.xpath(‘//a[@class=“sb-card sb-card-company site-1x1 with-hover]/@href“‘
returns []
我尝试使用 contains(@class,“sb-card-company“)
结果相同。以同样的方式使用其他容器,没有任何改变。使用不同的页面也没有效果。使用硬节点反而有效,但我很好奇我做错了什么。
这不是xpath的问题。这是一个 dynamically-loaded content 问题。
这是一个如何从 json 文件中获取它的示例:
scrapy shell
In [1]: url='https://www.startbase.de/api/companies/?format=json&display=small&sort=company.startbase_score&sort-direct
...: ion=desc&page=1&limit=21&filters={%22company.type%22:%22startup%22,%22startup_profile.industry_id%22:[10]}'
In [2]: headers = {
...: "Accept": "application/json",
...: "Accept-Encoding": "gzip, deflate, br",
...: "Accept-Language": "en-US,en;q=0.5",
...: "Cache-Control": "no-cache",
...: "Connection": "keep-alive",
...: "Content-Type": "application/json",
...: "DNT": "1",
...: "Host": "www.startbase.de",
...: "Pragma": "no-cache",
...: "Referer": "https://www.startbase.de/startups/?listOptions%5Bcompany-startup%5D=%7B%22version%22%3A1.3%2C%22sor
...: t%22%3A%22company.startbase_score%22%2C%22sortDirection%22%3A%22desc%22%2C%22display%22%3A%22small%22%2C%22item
...: sPerPage%22%3A21%2C%22page%22%3A1%2C%22userLocation%22%3Anull%2C%22filters%22%3A%7B%22startup_profile.industry_
...: id%22%3A%5B10%5D%7D%7D",
...: "Sec-Fetch-Dest": "empty",
...: "Sec-Fetch-Mode": "cors",
...: "Sec-Fetch-Site": "same-origin",
...: "Sec-GPC": "1",
...: "TE": "trailers",
...: "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.372
...: 9.169 Safari/537.36",
...: "X-KL-Ajax-Request": "Ajax_Request"
...: }
In [3]: req = scrapy.Request(url=url, headers=headers)
In [4]: fetch(req)
2021-10-16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.startbase.de/api/companies/?format=json&display=small&sort=company.startbase_score&sort-direction=desc&page=1&limit=21&filters=%7B%22company.type%22:%22startup%22,%22startup_profile.industry_id%22:[10]%7D> (referer: https://www.startbase.de/startups/?listOptions%5Bcompany-startup%5D=%7B%22version%22%3A1.3%2C%22sort%22%3A%22company.startbase_score%22%2C%22sortDirection%22%3A%22desc%22%2C%22display%22%3A%22small%22%2C%22itemsPerPage%22%3A21%2C%22page%22%3A1%2C%22userLocation%22%3Anull%2C%22filters%22%3A%7B%22startup_profile.industry_id%22%3A%5B10%5D%7D%7D)
In [5]: json_data = response.json()
In [6]: for company in json_data['body']['items']:
...: print(company['company.url'])
...:
/organization/creditshelf/
/organization/amafin-gmbh/
/organization/fincompare/
/organization/epap/
/organization/clearvat/
/organization/51nodes/
/organization/altruja-gmbh/
/organization/flexvelop/
/organization/coin-analyst-ug/
/organization/caya/
/organization/rubarb/
/organization/memrange/
/organization/sevdesk-sevenit/
/organization/getsafe/
/organization/xavin/
/organization/giromatch/
/organization/digi-bel-projekt-von-meeting-minds/
/organization/digioptions/
/organization/trafinscout/
/organization/tangany-gmbh/
/organization/kiwi-financial-living/