如何使用 Scrapy 从 Google 新闻网页中抓取标题?
How do I grab the headline titles from the Google News webpage with Scrapy?
我保存了 https://news.google.com/search?q=amazon&hl=en-US&gl=US&ceid=US%3Aen
的离线文件
无法确定如何获取所列文章的标题。
import scrapy
class newsSpider(scrapy.Spider):
name = "news"
start_urls = ['file:///127.0.0.1/home/toni/Desktop/crawldeez/googlenewsoffline.html/'
]
def parse(self, response):
for xrnccd in response.css('a.MQsxIb.xTewfe.R7GTQ.keNKEd.j7vNaf.Cc0Z5d.EjqUne'):
yield {
'ipQwMb.ekueJc.RD0gLb': xrnccd.css('h3.ipQwMb.ekueJc.RD0gLb::ipQwMb.ekueJc.RD0gLb').get(),
}
问题似乎在于页面内容是使用 JavaScript 动态呈现的,因此无法使用 css
或 [=12 从 HTML 中提取=] 方法。但是,它存在于响应正文中,因此您可以使用正则表达式提取它。这是 Scrapy shell session 来展示如何:
$ scrapy shell "https://news.google.com/search?q=amazon&hl=en-US&gl=US&ceid=US%3Aen"
...
>>> import re
>>> from pprint import pprint
>>>
>>> titles = re.findall(r'<h3 class="[^"]+?"><a[^>]+?>(.+?)</a>', response.text)
>>> pprint(titles)
['Amazon will no longer sell Chinese goods in China',
'YouTube is finally coming back to Amazon’s Fire TV devices',
'Amazon Plans to Use Digital Media to Expand Its Advertising Business',
'Amazon flooded with fake reviews; Learn how to spot them',
'How To Win in Today's Amazon World',
'Amazon Day: How to schedule Amazon deliveries',
'Bezos Disputes Amazon’s Market Power. But His Merchants Feel the Pinch',
'20 Best Action Movies to Stream on Amazon Prime',
...]
我保存了 https://news.google.com/search?q=amazon&hl=en-US&gl=US&ceid=US%3Aen
的离线文件无法确定如何获取所列文章的标题。
import scrapy
class newsSpider(scrapy.Spider):
name = "news"
start_urls = ['file:///127.0.0.1/home/toni/Desktop/crawldeez/googlenewsoffline.html/'
]
def parse(self, response):
for xrnccd in response.css('a.MQsxIb.xTewfe.R7GTQ.keNKEd.j7vNaf.Cc0Z5d.EjqUne'):
yield {
'ipQwMb.ekueJc.RD0gLb': xrnccd.css('h3.ipQwMb.ekueJc.RD0gLb::ipQwMb.ekueJc.RD0gLb').get(),
}
问题似乎在于页面内容是使用 JavaScript 动态呈现的,因此无法使用 css
或 [=12 从 HTML 中提取=] 方法。但是,它存在于响应正文中,因此您可以使用正则表达式提取它。这是 Scrapy shell session 来展示如何:
$ scrapy shell "https://news.google.com/search?q=amazon&hl=en-US&gl=US&ceid=US%3Aen"
...
>>> import re
>>> from pprint import pprint
>>>
>>> titles = re.findall(r'<h3 class="[^"]+?"><a[^>]+?>(.+?)</a>', response.text)
>>> pprint(titles)
['Amazon will no longer sell Chinese goods in China',
'YouTube is finally coming back to Amazon’s Fire TV devices',
'Amazon Plans to Use Digital Media to Expand Its Advertising Business',
'Amazon flooded with fake reviews; Learn how to spot them',
'How To Win in Today's Amazon World',
'Amazon Day: How to schedule Amazon deliveries',
'Bezos Disputes Amazon’s Market Power. But His Merchants Feel the Pinch',
'20 Best Action Movies to Stream on Amazon Prime',
...]