python scrapy如何提取没有标签的特定文本?(新问题)
how to extract a specific text with no tag by python scrapy?(new problem)
我正在使用 scrapy 在 html 中提取目标文本,如下所示:
我的 scrapy 代码是:
import scrapy
from scrapy.crawler import CrawlerProcess
class MmSpider(scrapy.Spider):
name = 'name'
start_urls = ['file:///Users/saihhold/Desktop/maimai.mht']
def parse(self, response):
for title in response.xpath('//div[@class="media-body"]/div/div[1]'):
yield {
title.xpath('.//text()').getall()
}
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(MmSpider)
process.start()
然后使用这个命令运行它:
scrapy runspider mmspider.py -o mm.jl
但是mm.jl文件是空的,我的代码或者xpath有什么问题吗?
您的代码没问题,但 xpath 选择器是 incorrect.You 可以按照下一个示例如何使用 xpath 获取标题。
import scrapy
from scrapy.crawler import CrawlerProcess
class MmSpider(scrapy.Spider):
name = 'name'
start_urls = ['https://www.timeout.com/film/best-movies-of-all-time']
def parse(self, response):
for title in response.xpath('//h3[@class="_h3_cuogz_1"]'):
yield {
'title':title.xpath('.//text()').getall()[-1].replace('\xa0','')
}
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(MmSpider)
process.start()
输出:
{'title': '2001: A Space Odyssey (1968)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'The Godfather (1972)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Citizen Kane (1941)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Jeanne Dielman, 23, Quai du Commerce, 1080 Bruxelles (1975)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Raiders of the Lost Ark (1981)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'La Dolce Vita (1960)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Seven Samurai (1954)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'In the Mood for Love (2000)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'There Will Be Blood (2007)'}
...等等
我正在使用 scrapy 在 html 中提取目标文本,如下所示:
我的 scrapy 代码是:
import scrapy
from scrapy.crawler import CrawlerProcess
class MmSpider(scrapy.Spider):
name = 'name'
start_urls = ['file:///Users/saihhold/Desktop/maimai.mht']
def parse(self, response):
for title in response.xpath('//div[@class="media-body"]/div/div[1]'):
yield {
title.xpath('.//text()').getall()
}
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(MmSpider)
process.start()
然后使用这个命令运行它:
scrapy runspider mmspider.py -o mm.jl
但是mm.jl文件是空的,我的代码或者xpath有什么问题吗?
您的代码没问题,但 xpath 选择器是 incorrect.You 可以按照下一个示例如何使用 xpath 获取标题。
import scrapy
from scrapy.crawler import CrawlerProcess
class MmSpider(scrapy.Spider):
name = 'name'
start_urls = ['https://www.timeout.com/film/best-movies-of-all-time']
def parse(self, response):
for title in response.xpath('//h3[@class="_h3_cuogz_1"]'):
yield {
'title':title.xpath('.//text()').getall()[-1].replace('\xa0','')
}
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(MmSpider)
process.start()
输出:
{'title': '2001: A Space Odyssey (1968)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'The Godfather (1972)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Citizen Kane (1941)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Jeanne Dielman, 23, Quai du Commerce, 1080 Bruxelles (1975)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Raiders of the Lost Ark (1981)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'La Dolce Vita (1960)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Seven Samurai (1954)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'In the Mood for Love (2000)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'There Will Be Blood (2007)'}
...等等