从新闻网站上抓取新闻标题
Scraping the news titles from news websites
我一直在尝试从新闻网站上抓取新闻标题。为此,我遇到了两个 python 库,即报纸和 beautifulsoup4。使用漂亮的 soup 库,我已经能够从一个特定的新闻网站获得所有 link s,这些 links 导致新闻文章。
从下面的代码中,我已经能够从单个 link.
中提取新闻文章的标题
from newspaper import Article
url= "https://www.ndtv.com/india-news/tamil-nadu-government-reverses-decision-to-reopen-schools-from-november-16-for-classes-9-12-news-agency-pti-2324199"
article=Article(url)
article.download()
article.parse()
print(article.title)
我想合并来自两个库的代码,即报纸和 beautifulsoup4,这样我从 beautifulsoup 库输出的所有 link 应该放在报纸库中的 url 命令中,我得到了 link 的所有标题。
下面是 beautfulsoup 的代码,我可以从中提取新闻文章中的所有 link。
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("https://www.ndtv.com/coronavirus?pfrom=home-mainnavgation")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
for link in soup.find_all('a', href=True):
print(link['href'])
你的意思是这样的吗?
links = []
for link in soup.find_all('a', href=True):
links.append(link['href'])
for link in links:
article=Article(link)
article.download()
article.parse()
print(article.title)
我一直在尝试从新闻网站上抓取新闻标题。为此,我遇到了两个 python 库,即报纸和 beautifulsoup4。使用漂亮的 soup 库,我已经能够从一个特定的新闻网站获得所有 link s,这些 links 导致新闻文章。 从下面的代码中,我已经能够从单个 link.
中提取新闻文章的标题from newspaper import Article
url= "https://www.ndtv.com/india-news/tamil-nadu-government-reverses-decision-to-reopen-schools-from-november-16-for-classes-9-12-news-agency-pti-2324199"
article=Article(url)
article.download()
article.parse()
print(article.title)
我想合并来自两个库的代码,即报纸和 beautifulsoup4,这样我从 beautifulsoup 库输出的所有 link 应该放在报纸库中的 url 命令中,我得到了 link 的所有标题。 下面是 beautfulsoup 的代码,我可以从中提取新闻文章中的所有 link。
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("https://www.ndtv.com/coronavirus?pfrom=home-mainnavgation")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
for link in soup.find_all('a', href=True):
print(link['href'])
你的意思是这样的吗?
links = []
for link in soup.find_all('a', href=True):
links.append(link['href'])
for link in links:
article=Article(link)
article.download()
article.parse()
print(article.title)