Web 抓取新闻文章和关键字搜索
Web scraping news articles and keyword search
我有一个代码可以获取网页中新闻文章的标题。我使用了一个 for 循环,在其中我获得了 4 个新闻网站的标题。我还实现了一个单词搜索,它可以告诉我使用“冠状病毒”这个词的文章数量。我想要单词搜索,这样它可以告诉我每个网站上带有“冠状病毒”一词的文章的数量。现在我得到的是“冠状病毒”这个词在所有网站中被使用的次数的输出。请帮助我,我必须尽快提交这个项目。
以下是代码:
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
from newspaper import Article
import requests
URL=["https://www.timesnownews.com/coronavirus","https://www.indiatoday.in/coronavirus", "https://www.ndtv.com/coronavirus?pfrom=home-mainnavigation"]
for url in URL:
parser = 'html.parser'
resp = requests.get(url)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
links = []
for link in soup.find_all('a', href=True):
if "javascript" in link["href"]:
continue
links.append(link['href'])
count = 0
for link in links:
try:
article = Article(link)
article.download()
article.parse()
print(article.title)
if "COVID" in article.title or "coronavirus" in article.title or "Coronavirus"in article.title or "Covid-19" in article.title or "COVID-19" in article.title :
count += 1
except:
pass
print(" number of articles with the word COVID:")
print(count)
实际上您得到的只是最后的网站计数。如果你想得到所有的,把它附加到一个列表中,然后你可以打印每个站点的计数。
首先创建一个空列表并在每次迭代后追加最终计数:
URL = ["https://www.timesnownews.com/coronavirus", "https://www.indiatoday.in/coronavirus",
"https://www.ndtv.com/coronavirus?pfrom=home-mainnavigation"]
Url_count = []
for url in URL:
parser = 'html.parser'
...
...
except:
pass
Url_count.append(count)
然后可以使用zip
打印结果:
for url, count in zip(URL, Url_count):
print("Site:", url, "Count:", count)
我有一个代码可以获取网页中新闻文章的标题。我使用了一个 for 循环,在其中我获得了 4 个新闻网站的标题。我还实现了一个单词搜索,它可以告诉我使用“冠状病毒”这个词的文章数量。我想要单词搜索,这样它可以告诉我每个网站上带有“冠状病毒”一词的文章的数量。现在我得到的是“冠状病毒”这个词在所有网站中被使用的次数的输出。请帮助我,我必须尽快提交这个项目。 以下是代码:
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
from newspaper import Article
import requests
URL=["https://www.timesnownews.com/coronavirus","https://www.indiatoday.in/coronavirus", "https://www.ndtv.com/coronavirus?pfrom=home-mainnavigation"]
for url in URL:
parser = 'html.parser'
resp = requests.get(url)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
links = []
for link in soup.find_all('a', href=True):
if "javascript" in link["href"]:
continue
links.append(link['href'])
count = 0
for link in links:
try:
article = Article(link)
article.download()
article.parse()
print(article.title)
if "COVID" in article.title or "coronavirus" in article.title or "Coronavirus"in article.title or "Covid-19" in article.title or "COVID-19" in article.title :
count += 1
except:
pass
print(" number of articles with the word COVID:")
print(count)
实际上您得到的只是最后的网站计数。如果你想得到所有的,把它附加到一个列表中,然后你可以打印每个站点的计数。
首先创建一个空列表并在每次迭代后追加最终计数:
URL = ["https://www.timesnownews.com/coronavirus", "https://www.indiatoday.in/coronavirus",
"https://www.ndtv.com/coronavirus?pfrom=home-mainnavigation"]
Url_count = []
for url in URL:
parser = 'html.parser'
...
...
except:
pass
Url_count.append(count)
然后可以使用zip
打印结果:
for url, count in zip(URL, Url_count):
print("Site:", url, "Count:", count)