使用 requests get 和 beautifulsoup srcaping rss 的不同输出
Different output from srcaping rss by using requests get and beautifulsoup
我想从 link 的代码中抓取数据:https://news.ycombinator.com/rss。它包括 html 语法:“link>the URL 但不能放在这里)。
但是,使用此代码时,link 的打印输出为:'link/>the URL' 并且 json 文件中没有键 'link' 的内容。
import requests
import bs4
from bs4 import BeautifulSoup
import json
import html5lib
def rss(x):
r = requests.get(x)
s = BeautifulSoup(r.content, features='html5lib')
the_list = []
for i in s.find_all('item'):
title = i.find('title').text
link = i.find('link').text
date = i.find('pubdate').text
article = {
'title' : title,
'link' : link,
'date' : date
}
the_list.append(article)
with open('the_list.json','w') as f:
json.dump(the_list,f)
rss('https://news.ycombinator.com/rss')
会发生什么?
正如您已经提到的 - 似乎蜜蜂在汤中的结构不正确,因为它缺少 <link>
它只有 </link>
,因此您无法使用 [=14 从中获取文本=] 属性.
不过好消息是有解决办法。
如何修复?
只是 select <link>
元素的 next_sibling
文本:
i.find('link').next_sibling
输出
[{"title": "Gitlab from YC to IPO", "link": "https://blog.ycombinator.com/gitlab-from-yc-to-ipo/", "date": "Thu, 14 Oct 2021 13:31:43 +0000"}, {"title": "Apple Joins Blender Development Fund", "link": "https://www.blender.org/press/apple-joins-blender-development-fund/", "date": "Thu, 14 Oct 2021 14:48:59 +0000"}, {"title": "Sunset Geometry (2016)", "link": "https://www.shapeoperator.com/2016/12/12/sunset-geometry/", "date": "Thu, 14 Oct 2021 14:29:08 +0000"}, {"title": "iPhone Macro: A Big Day for Small Things", "link": "https://lux.camera/iphone-macro-camera-a-big-day-for-small-things/", "date": "Mon, 11 Oct 2021 10:22:06 +0000"}, {"title": "Michelin Airless", "link": "https://www.michelin.com/en/innovation/vision-concept/airless/", "date": "Thu, 14 Oct 2021 14:36:58 +0000"}, {"title": "Release (YC W20) Is Hiring \u2013 Product Marketing Manager", "link": "https://releasehub.com/company#hire", "date": "Thu, 14 Oct 2021 17:00:15 +0000"}, {"title": "Global Climate Report \u2013 September 2021", "link": "https://www.ncdc.noaa.gov/sotc/global/202109", "date": "Thu, 14 Oct 2021 14:49:59 +0000"}, {"title": "Esbuild \u2013 An extremely fast JavaScript bundler", "link": "https://esbuild.github.io/", "date": "Thu, 14 Oct 2021 05:07:27 +0000"}, {"title": "Small Language Models Are Also Few-Shot Learners", "link": "https://aclanthology.org/2021.naacl-main.185/", "date": "Tue, 12 Oct 2021 09:59:34 +0000"}, {"title": "Who was Aleph Null? (2013)", "link": "http://bit-player.org/2013/who-was-aleph-null", "date": "Mon, 11 Oct 2021 08:35:29 +0000"}, {"title": "Hands-On Rust: Effective Learning Through 2D Game Development and Play", "link": "https://pragprog.com/titles/hwrust/hands-on-rust/", "date": "Thu, 14 Oct 2021 07:59:24 +0000"}, {"title": "Ask HN: What's the Point of Life?", "link": "https://news.ycombinator.com/item?id=28866558", "date": "Thu, 14 Oct 2021 16:38:15 +0000"}, {"title": "What I wish I knew when learning F#", "link": "https://danielbachler.de/2020/12/23/what-i-wish-i-knew-when-learning-fsharp.html", "date": "Thu, 14 Oct 2021 12:07:40 +0000"}, {"title": "Investing in Startups by Passing the Series 65", "link": "https://www.natecation.com/accredited-investor-investing-startups-series-65/", "date": "Wed, 13 Oct 2021 17:57:25 +0000"}, {"title": "OpenBSD 7.0", "link": "https://www.openbsd.org/70.html", "date": "Thu, 14 Oct 2021 10:24:21 +0000"}, {"title": "Countries are gathering in an effort to stop a biodiversity collapse", "link": "https://www.nytimes.com/2021/10/14/climate/un-biodiversity-conference-climate-change.html", "date": "Thu, 14 Oct 2021 13:32:00 +0000"}, {"title": "Alden Global Capital, the secretive hedge fund gutting newsrooms", "link": "https://www.theatlantic.com/magazine/archive/2021/11/alden-global-capital-killing-americas-newspapers/620171/", "date": "Thu, 14 Oct 2021 15:17:06 +0000"}, {"title": "Child suicides in Japan hit record high", "link": "https://www3.nhk.or.jp/nhkworld/en/news/20211013_19/", "date": "Thu, 14 Oct 2021 08:52:39 +0000"}, {"title": "Every search bar looks like a URL bar to users", "link": "https://shkspr.mobi/blog/2021/10/every-search-bar-looks-like-a-url-bar-to-users/", "date": "Thu, 14 Oct 2021 13:27:58 +0000"}, {"title": "Psychonetics: A nerd's toolset to work with mind and perception", "link": "http://deconcentration-of-attention.com/psychonetics.html", "date": "Tue, 12 Oct 2021 11:28:43 +0000"}, {"title": "FB seals off some internal message boards to prevent leaking, immediately leaked", "link": "https://www.businessinsider.com/facebook-whistleblower-leaks-restricts-staff-access-message-boards-elections-safety-2021-10", "date": "Thu, 14 Oct 2021 11:09:08 +0000"}, {"title": "Working around expired root certificates", "link": "https://scotthelme.co.uk/should-clients-care-about-the-expiration-of-a-root-certificate/", "date": "Mon, 11 Oct 2021 21:27:27 +0000"}, {"title": "An unprecedented wave of online bank fraud is hitting Britain", "link": "https://www.reuters.com/world/uk/welcome-britain-bank-scam-capital-world-2021-10-14/", "date": "Thu, 14 Oct 2021 09:57:39 +0000"}, {"title": "Interoperable Serendipity", "link": "https://noeldemartin.com/blog/interoperable-serendipity", "date": "Wed, 13 Oct 2021 12:02:37 +0000"}, {"title": "Instagram took down post with figure from paper showing male advantage in sports", "link": "https://twitter.com/SwipeWright/status/1448064426670583814", "date": "Thu, 14 Oct 2021 16:36:30 +0000"}, {"title": "IoT hacking and rickrolling my high school district", "link": "https://whitehoodhacker.net/posts/2021-10-04-the-big-rick", "date": "Tue, 12 Oct 2021 19:38:06 +0000"}, {"title": "Boeing says certain 787 parts improperly manufactured", "link": "https://www.reuters.com/business/aerospace-defense/boeing-deals-with-new-defect-787-dreamliner-wsj-2021-10-14/", "date": "Thu, 14 Oct 2021 13:26:46 +0000"}, {"title": "Practice Problems for Hardware Engineers", "link": "https://arxiv.org/abs/2110.06526", "date": "Thu, 14 Oct 2021 03:48:24 +0000"}, {"title": "Interface ergonomics: automation isn't just about time saved", "link": "https://macoy.me/blog/programming/InterfaceFriction", "date": "Wed, 13 Oct 2021 01:05:52 +0000"}, {"title": "Syncthing \u2013 a continuous file synchronization program", "link": "https://syncthing.net/", "date": "Thu, 14 Oct 2021 01:23:19 +0000"}]
我想从 link 的代码中抓取数据:https://news.ycombinator.com/rss。它包括 html 语法:“link>the URL 但不能放在这里)。 但是,使用此代码时,link 的打印输出为:'link/>the URL' 并且 json 文件中没有键 'link' 的内容。
import requests
import bs4
from bs4 import BeautifulSoup
import json
import html5lib
def rss(x):
r = requests.get(x)
s = BeautifulSoup(r.content, features='html5lib')
the_list = []
for i in s.find_all('item'):
title = i.find('title').text
link = i.find('link').text
date = i.find('pubdate').text
article = {
'title' : title,
'link' : link,
'date' : date
}
the_list.append(article)
with open('the_list.json','w') as f:
json.dump(the_list,f)
rss('https://news.ycombinator.com/rss')
会发生什么?
正如您已经提到的 - 似乎蜜蜂在汤中的结构不正确,因为它缺少 <link>
它只有 </link>
,因此您无法使用 [=14 从中获取文本=] 属性.
不过好消息是有解决办法。
如何修复?
只是 select <link>
元素的 next_sibling
文本:
i.find('link').next_sibling
输出
[{"title": "Gitlab from YC to IPO", "link": "https://blog.ycombinator.com/gitlab-from-yc-to-ipo/", "date": "Thu, 14 Oct 2021 13:31:43 +0000"}, {"title": "Apple Joins Blender Development Fund", "link": "https://www.blender.org/press/apple-joins-blender-development-fund/", "date": "Thu, 14 Oct 2021 14:48:59 +0000"}, {"title": "Sunset Geometry (2016)", "link": "https://www.shapeoperator.com/2016/12/12/sunset-geometry/", "date": "Thu, 14 Oct 2021 14:29:08 +0000"}, {"title": "iPhone Macro: A Big Day for Small Things", "link": "https://lux.camera/iphone-macro-camera-a-big-day-for-small-things/", "date": "Mon, 11 Oct 2021 10:22:06 +0000"}, {"title": "Michelin Airless", "link": "https://www.michelin.com/en/innovation/vision-concept/airless/", "date": "Thu, 14 Oct 2021 14:36:58 +0000"}, {"title": "Release (YC W20) Is Hiring \u2013 Product Marketing Manager", "link": "https://releasehub.com/company#hire", "date": "Thu, 14 Oct 2021 17:00:15 +0000"}, {"title": "Global Climate Report \u2013 September 2021", "link": "https://www.ncdc.noaa.gov/sotc/global/202109", "date": "Thu, 14 Oct 2021 14:49:59 +0000"}, {"title": "Esbuild \u2013 An extremely fast JavaScript bundler", "link": "https://esbuild.github.io/", "date": "Thu, 14 Oct 2021 05:07:27 +0000"}, {"title": "Small Language Models Are Also Few-Shot Learners", "link": "https://aclanthology.org/2021.naacl-main.185/", "date": "Tue, 12 Oct 2021 09:59:34 +0000"}, {"title": "Who was Aleph Null? (2013)", "link": "http://bit-player.org/2013/who-was-aleph-null", "date": "Mon, 11 Oct 2021 08:35:29 +0000"}, {"title": "Hands-On Rust: Effective Learning Through 2D Game Development and Play", "link": "https://pragprog.com/titles/hwrust/hands-on-rust/", "date": "Thu, 14 Oct 2021 07:59:24 +0000"}, {"title": "Ask HN: What's the Point of Life?", "link": "https://news.ycombinator.com/item?id=28866558", "date": "Thu, 14 Oct 2021 16:38:15 +0000"}, {"title": "What I wish I knew when learning F#", "link": "https://danielbachler.de/2020/12/23/what-i-wish-i-knew-when-learning-fsharp.html", "date": "Thu, 14 Oct 2021 12:07:40 +0000"}, {"title": "Investing in Startups by Passing the Series 65", "link": "https://www.natecation.com/accredited-investor-investing-startups-series-65/", "date": "Wed, 13 Oct 2021 17:57:25 +0000"}, {"title": "OpenBSD 7.0", "link": "https://www.openbsd.org/70.html", "date": "Thu, 14 Oct 2021 10:24:21 +0000"}, {"title": "Countries are gathering in an effort to stop a biodiversity collapse", "link": "https://www.nytimes.com/2021/10/14/climate/un-biodiversity-conference-climate-change.html", "date": "Thu, 14 Oct 2021 13:32:00 +0000"}, {"title": "Alden Global Capital, the secretive hedge fund gutting newsrooms", "link": "https://www.theatlantic.com/magazine/archive/2021/11/alden-global-capital-killing-americas-newspapers/620171/", "date": "Thu, 14 Oct 2021 15:17:06 +0000"}, {"title": "Child suicides in Japan hit record high", "link": "https://www3.nhk.or.jp/nhkworld/en/news/20211013_19/", "date": "Thu, 14 Oct 2021 08:52:39 +0000"}, {"title": "Every search bar looks like a URL bar to users", "link": "https://shkspr.mobi/blog/2021/10/every-search-bar-looks-like-a-url-bar-to-users/", "date": "Thu, 14 Oct 2021 13:27:58 +0000"}, {"title": "Psychonetics: A nerd's toolset to work with mind and perception", "link": "http://deconcentration-of-attention.com/psychonetics.html", "date": "Tue, 12 Oct 2021 11:28:43 +0000"}, {"title": "FB seals off some internal message boards to prevent leaking, immediately leaked", "link": "https://www.businessinsider.com/facebook-whistleblower-leaks-restricts-staff-access-message-boards-elections-safety-2021-10", "date": "Thu, 14 Oct 2021 11:09:08 +0000"}, {"title": "Working around expired root certificates", "link": "https://scotthelme.co.uk/should-clients-care-about-the-expiration-of-a-root-certificate/", "date": "Mon, 11 Oct 2021 21:27:27 +0000"}, {"title": "An unprecedented wave of online bank fraud is hitting Britain", "link": "https://www.reuters.com/world/uk/welcome-britain-bank-scam-capital-world-2021-10-14/", "date": "Thu, 14 Oct 2021 09:57:39 +0000"}, {"title": "Interoperable Serendipity", "link": "https://noeldemartin.com/blog/interoperable-serendipity", "date": "Wed, 13 Oct 2021 12:02:37 +0000"}, {"title": "Instagram took down post with figure from paper showing male advantage in sports", "link": "https://twitter.com/SwipeWright/status/1448064426670583814", "date": "Thu, 14 Oct 2021 16:36:30 +0000"}, {"title": "IoT hacking and rickrolling my high school district", "link": "https://whitehoodhacker.net/posts/2021-10-04-the-big-rick", "date": "Tue, 12 Oct 2021 19:38:06 +0000"}, {"title": "Boeing says certain 787 parts improperly manufactured", "link": "https://www.reuters.com/business/aerospace-defense/boeing-deals-with-new-defect-787-dreamliner-wsj-2021-10-14/", "date": "Thu, 14 Oct 2021 13:26:46 +0000"}, {"title": "Practice Problems for Hardware Engineers", "link": "https://arxiv.org/abs/2110.06526", "date": "Thu, 14 Oct 2021 03:48:24 +0000"}, {"title": "Interface ergonomics: automation isn't just about time saved", "link": "https://macoy.me/blog/programming/InterfaceFriction", "date": "Wed, 13 Oct 2021 01:05:52 +0000"}, {"title": "Syncthing \u2013 a continuous file synchronization program", "link": "https://syncthing.net/", "date": "Thu, 14 Oct 2021 01:23:19 +0000"}]