使用请求、bs4 和报纸包的新闻文章提取。为什么 links=soup.select(".r a") 找不到任何东西?此代码较早工作
News article extract using requests,bs4 and newspaper packages. why doesn't links=soup.select(".r a") find anything?. This code was working earlier
Objective: 我正在尝试根据关键字下载新闻文章以进行情感分析。
这段代码几个月前还可以工作,但现在 return 是一个空值。我尝试解决此问题,但 links=soup.select(".r a")
return 空值。
import pandas as pd
import requests
from bs4 import BeautifulSoup
import string
import nltk
from urllib.request import urlopen
import sys
import webbrowser
import newspaper
import time
from newspaper import Article
Company_name1 =[]
Article_number1=[]
Article_Title1=[]
Article_Authors1=[]
Article_pub_date1=[]
Article_Text1=[]
Article_Summary1=[]
Article_Keywords1=[]
Final_dataframe=[]
class Newspapr_pd:
def __init__(self,term):
self.term=term
self.subjectivity=0
self.sentiment=0
self.url='https://www.google.com/search?q={0}&safe=active&tbs=qdr:w,sdb:1&tbm=nws&source=lnt&dpr=1'.format(self.term)
def NewsArticlerun_pd(self):
response=requests.get(self.url)
response.raise_for_status()
#print(response.text)
soup=bs4.BeautifulSoup(response.text,'html.parser')
links=soup.select(".r a")
numOpen = min(5, len(links))
Article_number=0
for i in range(numOpen):
response_links = webbrower.open("https://www.google.com" + links[i].get("href"))
#For different language newspaper refer above table
article = Article(response_links, language="en") # en for English
Article_number+=1
print('*************************************************************************************')
Article_number1.append(Article_number)
Company_name1.append(self.term)
#To download the article
try:
article.download()
#To parse the article
article.parse()
#To perform natural language processing ie..nlp
article.nlp()
#To extract title
Article_Title1.append(article.title)
#To extract text
Article_Text1.append(article.text)
#To extract Author name
Article_Authors1.append(article.authors)
#To extract article published date
Article_pub_date1.append(article.publish_date)
#To extract summary
Article_Summary1.append(article.summary)
#To extract keywords
Article_Keywords1.append(article.keywords)
except:
print('Error in loading page')
continue
for art_num,com_name,title,text,auth,pub_dt,summaries,keywds in zip(Article_number1,Company_name1,Article_Title1,Article_Text1,Article_Authors1,Article_pub_date1,Article_Summary1,Article_Keywords1):
Final_dataframe.append({'Article_link_num':art_num, 'Company_name':com_name,'Article_Title':title,'Article_Text':text,'Article_Author':auth,
'Article_Published_date':pub_dt,'Article_Summary':summaries,'Article_Keywords':keywds})
list_of_companies=['Amazon','Jetairways','nirav modi']
for i in list_of_companies:
comp = str('"'+ i + '"')
a=Newspapr_pd(comp)
a.NewsArticlerun_pd()
Final_new_dataframe=pd.DataFrame(Final_dataframe)
Final_new_dataframe.tail()
这是一个非常复杂的问题,因为 Google News 不断更改其 class 名称。此外 Google 将为文章网址添加各种前缀,并加入一些隐藏的广告或社交媒体标签。
下面的答案仅涉及从 Google 新闻中抓取文章。需要更多测试来确定它如何处理大量关键字和 Google 新闻更改页面结构。
Newspaper3k
提取更加复杂,因为每篇文章可以有不同的结构。我建议您查看我的 Newspaper3k Usage Overview 文档,了解有关如何设计该部分代码的详细信息。
P.S。我目前正在写 new news scraper, because the development for Newspaper3k 已死。我不确定我的代码的发布日期。
import requests
import re as regex
from bs4 import BeautifulSoup
def get_google_news_article(search_string):
articles = []
url = f'https://www.google.com/search?q={search_string}&safe=active&tbs=qdr:w,sdb:1&tbm=nws&source=lnt&dpr=1'
response = requests.get(url)
raw_html = BeautifulSoup(response.text, "lxml")
main_tag = raw_html.find('div', {'id': 'main'})
for div_tag in main_tag.find_all('div', {'class': regex.compile('xpd')}):
for a_tag in div_tag.find_all('a', href=True):
if not a_tag.get('href').startswith('/search?'):
none_articles = bool(regex.search('amazon.com|facebook.com|twitter.com|youtube.com|wikipedia.org', a_tag['href']))
if none_articles is False:
if a_tag.get('href').startswith('/url?q='):
find_article = regex.search('(.*)(&sa=)', a_tag.get('href'))
article = find_article.group(1).replace('/url?q=', '')
if article.startswith('https://'):
articles.append(article)
return articles
list_of_companies = ['amazon', 'jet airways', 'nirav modi']
for company_name in list_of_companies:
print(company_name)
search_results = get_google_news_article(company_name)
for item in sorted(set(search_results)):
print(item)
print('\n')
这是上面代码的输出:
amazon
https://9to5mac.com/2021/11/15/amazon-releases-native-prime-video-app-for-macos-with-purchase-support-and-more/
https://wtvbam.com/2021/11/15/india-police-to-question-amazon-executives-in-probe-over-marijuana-smuggling/
https://www.cnet.com/home/smart-home/all-the-new-amazon-features-for-your-smart-home-alexa-disney-echo/
https://www.cnet.com/tech/amazon-unveils-black-friday-deals-starting-on-nov-25/
https://www.crossroadstoday.com/i/amazons-best-black-friday-deals-for-2021-2/
https://www.reuters.com/technology/ibm-amazon-partner-extend-reach-data-tools-oil-companies-2021-11-15/
https://www.theverge.com/2021/11/15/22783275/amazon-basics-smart-switches-price-release-date-specs
https://www.tomsguide.com/news/amazon-echo-motion-detection
https://www.usatoday.com/story/money/shopping/2021/11/15/amazon-black-friday-2021-deals-online/8623710002/
https://www.winknews.com/2021/11/15/new-amazon-sortation-center-began-operations-monday-could-bring-faster-deliveries/
jet airways
https://economictimes.indiatimes.com/markets/expert-view/first-time-in-two-decades-new-airlines-are-starting-instead-of-closing-down-jyotiraditya-scindia/articleshow/87660724.cms
https://menafn.com/1103125331/Jet-Airways-to-resume-operations-in-Q1-2022
https://simpleflying.com/jet-airways-100-aircraft-5-years/
https://simpleflying.com/jet-airways-q3-loss/
https://www.business-standard.com/article/companies/defunct-carrier-jet-airways-posts-rs-306-cr-loss-in-september-quarter-121110901693_1.html
https://www.business-standard.com/article/markets/stocks-to-watch-ril-aurobindo-bhel-m-m-jet-airways-idfc-powergrid-121110900189_1.html
https://www.financialexpress.com/market/nykaa-hdfc-zee-media-jet-airways-power-grid-berger-paints-petronet-lng-stocks-in-focus/2366063/
https://www.moneycontrol.com/news/business/earnings/jet-airways-standalone-september-2021-net-sales-at-rs-41-02-crore-up-313-51-y-o-y-7702891.html
https://www.spokesman.com/stories/2021/nov/11/boeing-set-to-dent-airbus-india-dominance-with-737/
https://www.timesnownews.com/business-economy/industry/article/times-now-summit-2021-jet-airways-will-make-a-comeback-into-indian-skies-akasa-to-take-off-next-year-says-jyotiraditya-scindia/831090
nirav modi
https://m.republicworld.com/india-news/general-news/piyush-goyal-says-few-rotten-eggs-destroyed-credibility-of-countrys-ca-sector.html
https://www.bulletnews.net/akkad-bakkad-rafu-chakkar-review-the-story-of-robbing-people-by-making-fake-banks/
https://www.daijiworld.com/news/newsDisplay%3FnewsID%3D893048
https://www.devdiscourse.com/article/law-order/1805317-hc-seeks-centres-stand-on-bankers-challenge-to-dismissal-from-service
https://www.geo.tv/latest/381560-arif-naqvis-extradition-case-to-be-heard-after-nirav-modi-case-ruling
https://www.hindustantimes.com/india-news/cbiand-ed-appointments-that-triggered-controversies-101636954580012.html
https://www.law360.com/articles/1439470/suicide-test-ruling-delays-abraaj-founder-s-extradition-case
https://www.moneycontrol.com/news/trends/current-affairs-trends/nirav-modi-extradition-case-outcome-of-appeal-to-also-affect-pakistani-origin-global-financier-facing-16-charges-of-fraud-and-money-laundering-7717231.html
https://www.thehansindia.com/hans/opinion/news-analysis/uniform-law-needed-for-free-exit-of-rich-businessmen-714566
https://www.thenews.com.pk/print/908374-uk-judge-delays-arif-naqvi-s-extradition-to-us
Objective: 我正在尝试根据关键字下载新闻文章以进行情感分析。
这段代码几个月前还可以工作,但现在 return 是一个空值。我尝试解决此问题,但 links=soup.select(".r a")
return 空值。
import pandas as pd
import requests
from bs4 import BeautifulSoup
import string
import nltk
from urllib.request import urlopen
import sys
import webbrowser
import newspaper
import time
from newspaper import Article
Company_name1 =[]
Article_number1=[]
Article_Title1=[]
Article_Authors1=[]
Article_pub_date1=[]
Article_Text1=[]
Article_Summary1=[]
Article_Keywords1=[]
Final_dataframe=[]
class Newspapr_pd:
def __init__(self,term):
self.term=term
self.subjectivity=0
self.sentiment=0
self.url='https://www.google.com/search?q={0}&safe=active&tbs=qdr:w,sdb:1&tbm=nws&source=lnt&dpr=1'.format(self.term)
def NewsArticlerun_pd(self):
response=requests.get(self.url)
response.raise_for_status()
#print(response.text)
soup=bs4.BeautifulSoup(response.text,'html.parser')
links=soup.select(".r a")
numOpen = min(5, len(links))
Article_number=0
for i in range(numOpen):
response_links = webbrower.open("https://www.google.com" + links[i].get("href"))
#For different language newspaper refer above table
article = Article(response_links, language="en") # en for English
Article_number+=1
print('*************************************************************************************')
Article_number1.append(Article_number)
Company_name1.append(self.term)
#To download the article
try:
article.download()
#To parse the article
article.parse()
#To perform natural language processing ie..nlp
article.nlp()
#To extract title
Article_Title1.append(article.title)
#To extract text
Article_Text1.append(article.text)
#To extract Author name
Article_Authors1.append(article.authors)
#To extract article published date
Article_pub_date1.append(article.publish_date)
#To extract summary
Article_Summary1.append(article.summary)
#To extract keywords
Article_Keywords1.append(article.keywords)
except:
print('Error in loading page')
continue
for art_num,com_name,title,text,auth,pub_dt,summaries,keywds in zip(Article_number1,Company_name1,Article_Title1,Article_Text1,Article_Authors1,Article_pub_date1,Article_Summary1,Article_Keywords1):
Final_dataframe.append({'Article_link_num':art_num, 'Company_name':com_name,'Article_Title':title,'Article_Text':text,'Article_Author':auth,
'Article_Published_date':pub_dt,'Article_Summary':summaries,'Article_Keywords':keywds})
list_of_companies=['Amazon','Jetairways','nirav modi']
for i in list_of_companies:
comp = str('"'+ i + '"')
a=Newspapr_pd(comp)
a.NewsArticlerun_pd()
Final_new_dataframe=pd.DataFrame(Final_dataframe)
Final_new_dataframe.tail()
这是一个非常复杂的问题,因为 Google News 不断更改其 class 名称。此外 Google 将为文章网址添加各种前缀,并加入一些隐藏的广告或社交媒体标签。
下面的答案仅涉及从 Google 新闻中抓取文章。需要更多测试来确定它如何处理大量关键字和 Google 新闻更改页面结构。
Newspaper3k
提取更加复杂,因为每篇文章可以有不同的结构。我建议您查看我的 Newspaper3k Usage Overview 文档,了解有关如何设计该部分代码的详细信息。
P.S。我目前正在写 new news scraper, because the development for Newspaper3k 已死。我不确定我的代码的发布日期。
import requests
import re as regex
from bs4 import BeautifulSoup
def get_google_news_article(search_string):
articles = []
url = f'https://www.google.com/search?q={search_string}&safe=active&tbs=qdr:w,sdb:1&tbm=nws&source=lnt&dpr=1'
response = requests.get(url)
raw_html = BeautifulSoup(response.text, "lxml")
main_tag = raw_html.find('div', {'id': 'main'})
for div_tag in main_tag.find_all('div', {'class': regex.compile('xpd')}):
for a_tag in div_tag.find_all('a', href=True):
if not a_tag.get('href').startswith('/search?'):
none_articles = bool(regex.search('amazon.com|facebook.com|twitter.com|youtube.com|wikipedia.org', a_tag['href']))
if none_articles is False:
if a_tag.get('href').startswith('/url?q='):
find_article = regex.search('(.*)(&sa=)', a_tag.get('href'))
article = find_article.group(1).replace('/url?q=', '')
if article.startswith('https://'):
articles.append(article)
return articles
list_of_companies = ['amazon', 'jet airways', 'nirav modi']
for company_name in list_of_companies:
print(company_name)
search_results = get_google_news_article(company_name)
for item in sorted(set(search_results)):
print(item)
print('\n')
这是上面代码的输出:
amazon
https://9to5mac.com/2021/11/15/amazon-releases-native-prime-video-app-for-macos-with-purchase-support-and-more/
https://wtvbam.com/2021/11/15/india-police-to-question-amazon-executives-in-probe-over-marijuana-smuggling/
https://www.cnet.com/home/smart-home/all-the-new-amazon-features-for-your-smart-home-alexa-disney-echo/
https://www.cnet.com/tech/amazon-unveils-black-friday-deals-starting-on-nov-25/
https://www.crossroadstoday.com/i/amazons-best-black-friday-deals-for-2021-2/
https://www.reuters.com/technology/ibm-amazon-partner-extend-reach-data-tools-oil-companies-2021-11-15/
https://www.theverge.com/2021/11/15/22783275/amazon-basics-smart-switches-price-release-date-specs
https://www.tomsguide.com/news/amazon-echo-motion-detection
https://www.usatoday.com/story/money/shopping/2021/11/15/amazon-black-friday-2021-deals-online/8623710002/
https://www.winknews.com/2021/11/15/new-amazon-sortation-center-began-operations-monday-could-bring-faster-deliveries/
jet airways
https://economictimes.indiatimes.com/markets/expert-view/first-time-in-two-decades-new-airlines-are-starting-instead-of-closing-down-jyotiraditya-scindia/articleshow/87660724.cms
https://menafn.com/1103125331/Jet-Airways-to-resume-operations-in-Q1-2022
https://simpleflying.com/jet-airways-100-aircraft-5-years/
https://simpleflying.com/jet-airways-q3-loss/
https://www.business-standard.com/article/companies/defunct-carrier-jet-airways-posts-rs-306-cr-loss-in-september-quarter-121110901693_1.html
https://www.business-standard.com/article/markets/stocks-to-watch-ril-aurobindo-bhel-m-m-jet-airways-idfc-powergrid-121110900189_1.html
https://www.financialexpress.com/market/nykaa-hdfc-zee-media-jet-airways-power-grid-berger-paints-petronet-lng-stocks-in-focus/2366063/
https://www.moneycontrol.com/news/business/earnings/jet-airways-standalone-september-2021-net-sales-at-rs-41-02-crore-up-313-51-y-o-y-7702891.html
https://www.spokesman.com/stories/2021/nov/11/boeing-set-to-dent-airbus-india-dominance-with-737/
https://www.timesnownews.com/business-economy/industry/article/times-now-summit-2021-jet-airways-will-make-a-comeback-into-indian-skies-akasa-to-take-off-next-year-says-jyotiraditya-scindia/831090
nirav modi
https://m.republicworld.com/india-news/general-news/piyush-goyal-says-few-rotten-eggs-destroyed-credibility-of-countrys-ca-sector.html
https://www.bulletnews.net/akkad-bakkad-rafu-chakkar-review-the-story-of-robbing-people-by-making-fake-banks/
https://www.daijiworld.com/news/newsDisplay%3FnewsID%3D893048
https://www.devdiscourse.com/article/law-order/1805317-hc-seeks-centres-stand-on-bankers-challenge-to-dismissal-from-service
https://www.geo.tv/latest/381560-arif-naqvis-extradition-case-to-be-heard-after-nirav-modi-case-ruling
https://www.hindustantimes.com/india-news/cbiand-ed-appointments-that-triggered-controversies-101636954580012.html
https://www.law360.com/articles/1439470/suicide-test-ruling-delays-abraaj-founder-s-extradition-case
https://www.moneycontrol.com/news/trends/current-affairs-trends/nirav-modi-extradition-case-outcome-of-appeal-to-also-affect-pakistani-origin-global-financier-facing-16-charges-of-fraud-and-money-laundering-7717231.html
https://www.thehansindia.com/hans/opinion/news-analysis/uniform-law-needed-for-free-exit-of-rich-businessmen-714566
https://www.thenews.com.pk/print/908374-uk-judge-delays-arif-naqvi-s-extradition-to-us