Python Scrapy获取文章正文,extract_first()获取None
Python Scrapy get article body, extract_first() get None
我尝试使用 Scrapy 从新闻网站获取文章正文。
import scrapy
import sys
import json
class ReutersPage(scrapy.Spider):
name = "reutersPage"
start_urls = [
'https://www.reuters.com/article/chile-sqm-stocks/lithium-miner-sqm-shares-up-2-7-pct-chile-court-clears-way-for-tianqi-stake-purchase-idUSC0N1OX01C'
]
def parse(self, response):
articleBody = response.css('div.StandardArticleBody_body::text').extract_first()
print('######## Article body ##########')
print(articleBody)
yield {
'body': articleBody
}
我尝试在 div StandardArticleBody_body 中获取文本,但总是得到 None 值。
输出为
2018-10-26 14:23:44 [scrapy.core.engine] INFO: Spider opened
2018-10-26 14:23:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-26 14:23:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-26 14:23:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reuters.com/robots.txt> (referer: None)
2018-10-26 14:23:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reuters.com/article/chile-sqm-stocks/lithium-miner-sqm-shares-up-2-7-pct-chile-court-clears-way-for-tianqi-stake-purchase-idUSC0N1OX01C> (referer: None)
######## Parse article ##########
######## Article body ##########
None
2018-10-26 14:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.reuters.com/article/chile-sqm-stocks/lithium-miner-sqm-shares-up-2-7-pct-chile-court-clears-way-for-tianqi-stake-purchase-idUSC0N1OX01C>
{'body': None}
2018-10-26 14:23:45 [scrapy.core.engine] INFO: Closing spider (finished)
没有任何文本直接属于您选择的 div
,而是属于它的后代。选择器路径和 ::
之间的 space 将获得所有后代的 text
,而不仅仅是您选择的节点的文本。
试试这个
articleBody = response.css('div.StandardArticleBody_body ::text').extract_first()
这样您就可以获得 div
后代的所有文本。
In [27]: response.css('div.StandardArticleBody_body > p::text').extract()
Out[27]:
['SANTIAGO, Oct 26 (Reuters) - Shares in lithium miner SQM jumped 2.7 percent on Friday after Chile’s Constitutional Court rejected a lawsuit to block Chinese miner Tianqi Lithium Corp’s .1 billion purchase of a stake in the Chilean lithium miner. ',
'SQM’s B-series shares touched 29,400 pesos (.55) at the open of Santiago’s Stock Exchange. ']
我尝试使用 Scrapy 从新闻网站获取文章正文。
import scrapy
import sys
import json
class ReutersPage(scrapy.Spider):
name = "reutersPage"
start_urls = [
'https://www.reuters.com/article/chile-sqm-stocks/lithium-miner-sqm-shares-up-2-7-pct-chile-court-clears-way-for-tianqi-stake-purchase-idUSC0N1OX01C'
]
def parse(self, response):
articleBody = response.css('div.StandardArticleBody_body::text').extract_first()
print('######## Article body ##########')
print(articleBody)
yield {
'body': articleBody
}
我尝试在 div StandardArticleBody_body 中获取文本,但总是得到 None 值。
输出为
2018-10-26 14:23:44 [scrapy.core.engine] INFO: Spider opened
2018-10-26 14:23:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-26 14:23:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-26 14:23:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reuters.com/robots.txt> (referer: None)
2018-10-26 14:23:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reuters.com/article/chile-sqm-stocks/lithium-miner-sqm-shares-up-2-7-pct-chile-court-clears-way-for-tianqi-stake-purchase-idUSC0N1OX01C> (referer: None)
######## Parse article ##########
######## Article body ##########
None
2018-10-26 14:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.reuters.com/article/chile-sqm-stocks/lithium-miner-sqm-shares-up-2-7-pct-chile-court-clears-way-for-tianqi-stake-purchase-idUSC0N1OX01C>
{'body': None}
2018-10-26 14:23:45 [scrapy.core.engine] INFO: Closing spider (finished)
没有任何文本直接属于您选择的 div
,而是属于它的后代。选择器路径和 ::
之间的 space 将获得所有后代的 text
,而不仅仅是您选择的节点的文本。
试试这个
articleBody = response.css('div.StandardArticleBody_body ::text').extract_first()
这样您就可以获得 div
后代的所有文本。
In [27]: response.css('div.StandardArticleBody_body > p::text').extract()
Out[27]:
['SANTIAGO, Oct 26 (Reuters) - Shares in lithium miner SQM jumped 2.7 percent on Friday after Chile’s Constitutional Court rejected a lawsuit to block Chinese miner Tianqi Lithium Corp’s .1 billion purchase of a stake in the Chilean lithium miner. ',
'SQM’s B-series shares touched 29,400 pesos (.55) at the open of Santiago’s Stock Exchange. ']