如何在通过网络抓取创建的 json 文件中组织数据
How to organize data in a json file created through webscraping
我正在尝试从雅虎新闻获取文章标题并将其组织到 json 文件中。当我将数据转储到 json 文件时,它看起来难以阅读。我将如何组织数据,是在转储之后还是从头开始?
这是一个网络抓取项目,我必须在其中获取热门新闻文章及其正文并将它们导出到 json 文件,然后该文件可以发送到其他人的程序。目前,我正在努力从雅虎财经主页获取标题。
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm
Mt(0.8em)--sm", id="15")
#Organizing data for export
data = {'title1': title[0].get_text(),
'title2': title[1].get_text(),
'title3': title[2].get_text(),
'title4': title[3].get_text(),
'title5': title[4].get_text()}
#Exporting the data to results.json
with open("results.json", "w") as write_file:
json.dump(str(data), write_file)
这是最终写入 json 文件的内容(在撰写此 post 时):
"{'title1': 'These US taxpayers face higher payments thanks to new law',
'title2': 'These 12 Stocks Are the Best Values in 2019, According to Pros
Who\u2019ve Outsmarted the Market', '\ntitle3': 'The Best Move You Can
Make With Your Investments in 2019, According to 5 Market Professionals',
'title4': 'The auto industry said goodbye to a lot of cars in 2018',
'title5': '7 Stock Picks From Top-Rated Wall Street Analysts'}"
我想编写代码以在单独的行上显示每篇文章的标题,并删除出现在中间的随机 '\'。
我有 运行 你的代码,但我没有得到像你那样的结果。您已经定义了 'title3' 这是一个常量,但是您得到了 '\n' 而我实际上并没有得到它。顺便说一句,你得到 / 是因为你没有像 'utf8' 那样正确编码它并且 ascii 确保设置为 false。我建议进行两个更改,例如 - 'lxml' parser not 'html.parser' 和此代码段:
with open("results.json", "w",encoding='utf8') as write_file:
json.dump(str(data), write_file ,ensure_ascii=False)
这对我来说完全有效 / 的排除和 ascii 问题也得到了解决。
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm Mt(0.8em)--sm", id="15")
title=[x.get_text().strip() for x in title]
limit=len(title) #change this to 5 if you need only the first 5
data={"title"+str(i+1):title[i] for i in range(0,limit)}
with open("results.json", "w",encoding='utf-8') as write_file:
write_file.write(json.dumps(data, ensure_ascii=False,indent=4))
results.json:
{
"title1": "These 12 Stocks Are the Best Values in 2019, According to Pros Who’ve Outsmarted the Market",
"title2": "These US taxpayers face higher payments thanks to new law",
"title3": "The Best Move You Can Make With Your Investments in 2019, According to 5 Market Professionals",
"title4": "Cramer Remix: Here's where your first ,000 should be i...",
"title5": "The auto industry said goodbye to a lot of cars in 2018",
"title6": "Ocado Pips Adyen to Take Crown of 2018's Best European Stock",
"title7": "7 Stock Picks From Top-Rated Wall Street Analysts",
"title8": "Buy IBM Stock as It Begins 2019 as the Cheapest Dow Component",
"title9": " Oil Could Be Right Around The Corner",
"title10": "What Is the Highest Credit Score and How Do You Get It?",
"title11": "Silver Price Forecast – Silver markets stall on New Year’s Eve",
"title12": "This Chart Says the S&P 500 Could Rebound in 2019",
"title13": "Should You Buy Some Berkshire Hathaway Stock?",
"title14": "How Much Does a Financial Advisor Cost?",
"title15": "Here Are the World's Biggest Billionaire Winners and Losers of 2018",
"title16": "Tax tips: What you need to know before you file your taxes in 2019",
"title17": "Kevin O’Leary: Make This Your Top New Year’s Resolution",
"title18": "Dakota Access pipeline developer slow to replace some trees",
"title19": "Einhorn's Greenlight Extends Decline to 34% in Worst Year",
"title20": "4 companies to watch in 2019",
"title21": "What Is My Debt-to-Income Ratio?",
"title22": "US recession unlikely, market volatility to continue in 2019, El-Erian says",
"title23": "Fidelity: Ignore stock market turbulence and stick to long-term goals",
"title24": "Tax season: How you can come out a winner",
"title25": "IBD 50 Growth Stocks To Watch"
}
我正在尝试从雅虎新闻获取文章标题并将其组织到 json 文件中。当我将数据转储到 json 文件时,它看起来难以阅读。我将如何组织数据,是在转储之后还是从头开始?
这是一个网络抓取项目,我必须在其中获取热门新闻文章及其正文并将它们导出到 json 文件,然后该文件可以发送到其他人的程序。目前,我正在努力从雅虎财经主页获取标题。
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm
Mt(0.8em)--sm", id="15")
#Organizing data for export
data = {'title1': title[0].get_text(),
'title2': title[1].get_text(),
'title3': title[2].get_text(),
'title4': title[3].get_text(),
'title5': title[4].get_text()}
#Exporting the data to results.json
with open("results.json", "w") as write_file:
json.dump(str(data), write_file)
这是最终写入 json 文件的内容(在撰写此 post 时):
"{'title1': 'These US taxpayers face higher payments thanks to new law',
'title2': 'These 12 Stocks Are the Best Values in 2019, According to Pros
Who\u2019ve Outsmarted the Market', '\ntitle3': 'The Best Move You Can
Make With Your Investments in 2019, According to 5 Market Professionals',
'title4': 'The auto industry said goodbye to a lot of cars in 2018',
'title5': '7 Stock Picks From Top-Rated Wall Street Analysts'}"
我想编写代码以在单独的行上显示每篇文章的标题,并删除出现在中间的随机 '\'。
我有 运行 你的代码,但我没有得到像你那样的结果。您已经定义了 'title3' 这是一个常量,但是您得到了 '\n' 而我实际上并没有得到它。顺便说一句,你得到 / 是因为你没有像 'utf8' 那样正确编码它并且 ascii 确保设置为 false。我建议进行两个更改,例如 - 'lxml' parser not 'html.parser' 和此代码段:
with open("results.json", "w",encoding='utf8') as write_file:
json.dump(str(data), write_file ,ensure_ascii=False)
这对我来说完全有效 / 的排除和 ascii 问题也得到了解决。
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm Mt(0.8em)--sm", id="15")
title=[x.get_text().strip() for x in title]
limit=len(title) #change this to 5 if you need only the first 5
data={"title"+str(i+1):title[i] for i in range(0,limit)}
with open("results.json", "w",encoding='utf-8') as write_file:
write_file.write(json.dumps(data, ensure_ascii=False,indent=4))
results.json:
{
"title1": "These 12 Stocks Are the Best Values in 2019, According to Pros Who’ve Outsmarted the Market",
"title2": "These US taxpayers face higher payments thanks to new law",
"title3": "The Best Move You Can Make With Your Investments in 2019, According to 5 Market Professionals",
"title4": "Cramer Remix: Here's where your first ,000 should be i...",
"title5": "The auto industry said goodbye to a lot of cars in 2018",
"title6": "Ocado Pips Adyen to Take Crown of 2018's Best European Stock",
"title7": "7 Stock Picks From Top-Rated Wall Street Analysts",
"title8": "Buy IBM Stock as It Begins 2019 as the Cheapest Dow Component",
"title9": " Oil Could Be Right Around The Corner",
"title10": "What Is the Highest Credit Score and How Do You Get It?",
"title11": "Silver Price Forecast – Silver markets stall on New Year’s Eve",
"title12": "This Chart Says the S&P 500 Could Rebound in 2019",
"title13": "Should You Buy Some Berkshire Hathaway Stock?",
"title14": "How Much Does a Financial Advisor Cost?",
"title15": "Here Are the World's Biggest Billionaire Winners and Losers of 2018",
"title16": "Tax tips: What you need to know before you file your taxes in 2019",
"title17": "Kevin O’Leary: Make This Your Top New Year’s Resolution",
"title18": "Dakota Access pipeline developer slow to replace some trees",
"title19": "Einhorn's Greenlight Extends Decline to 34% in Worst Year",
"title20": "4 companies to watch in 2019",
"title21": "What Is My Debt-to-Income Ratio?",
"title22": "US recession unlikely, market volatility to continue in 2019, El-Erian says",
"title23": "Fidelity: Ignore stock market turbulence and stick to long-term goals",
"title24": "Tax season: How you can come out a winner",
"title25": "IBD 50 Growth Stocks To Watch"
}