尝试从 URL 读取 JSON 并解析为 CSV 格式
Trying to read JSON from URL and parse into CSV format
我正在尝试遍历列表中的四个 URL,抓取每个 URL 的内容,并将每个保存为单独的 CSV。我认为下面的代码很接近,但它似乎并没有真正将 JSON 字符串解析为 human-readable 格式。此外,缺少 headers。
这是我破解的代码。
import urllib
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize
all_links = ['https://www.baptisthealthsystem.com/docs/global/standard-charges/474131755_abrazomaranahospital_standardcharges.json?sfvrsn=9a27928_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621861138_abrazocavecreekhospital_standardcharges.json?sfvrsn=674fd6f_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621809851_abrazomesahospital_standardcharges.json?sfvrsn=13953222_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621811285_abrazosurprisehospital_standardcharges.json?sfvrsn=c8113dcf_2']
for item in all_links:
#print(item)
try:
length = len(item)
first_under = item.find('_') + 1
last_under = item.rfind('?') - 21
file_name = item[first_under:last_under]
r = requests.get(item)
print(r.json)
df = pd.DataFrame(r)
df.head()
DOWNLOAD_PATH = 'C:\Users\ryans\Desktop\hospital_data\' + file_name + '.csv'
#urllib.request.urlretrieve(df,DOWNLOAD_PATH)
r = requests.get(item)
with open(DOWNLOAD_PATH,'wb') as f:
f.write(r.content)
except Exception as e: print(e)
这是数据的样子。这是正确的吗?我认为如果将数据从 JSON 转换为 CSV,数据看起来会更清晰。
不知道你是不是征求意见。是否需要 CSV?或者您只是想将其保存为可读(最好是可编程可提取)格式?
无论如何,这是将一个结果导出到 .csv
文件的方法:
# Required imports
import json
import csv
"""I am assuming you already made the request and got the response"""
data_set = r.json() # An iterable of python dict objects
DOWNLOAD_PATH = 'C:\Users\ryans\Desktop\hospital_data\' + file_name + '.csv'
# (Please take enough care to not reveal personal info above)
# getting the headers
headers = data_set[0].keys()
with open(DOWNLOAD_PATH, "wt", newline="") as file:
writer = csv.writer(file)
writer.writerow(headers) # writing the headers
for data in data_set: # writing the values
write.writerow(data.values())
下面是将一个结果导出到 .json
文件的方法:
import json
data_set = r.json()
DOWNLOAD_PATH = 'C:\Users\ryans\Desktop\hospital_data\' + file_name + '.json'
with open(DOWNLOAD_PATH, "wt") as file:
json.dump(data_set, file)
您可以使用对您而言更易读的文件。请注意,上述方法是针对单个请求结果的。对于从网站提取的每个结果,您都必须执行上述操作。
CSV 方法的学分:GeeksforGeeks
您很接近,这是您需要更改的内容:
- 您可以使用 pandas 数据帧通过
df = pd.read_json(text, lines=True)
读取 json - 为此请确保指定 lines=True
因为您的某些数据包含 \n
字符
- 您可以使用相同的数据帧通过
df.to_csv(file)
输出到 csv
总而言之,您的代码中有些东西可以删除,例如您无缘无故地调用了 requests.get
两次,这大大降低了您的代码速度。
import requests
import pandas as pd
all_links = ['https://www.baptisthealthsystem.com/docs/global/standard-charges/474131755_abrazomaranahospital_standardcharges.json?sfvrsn=9a27928_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621861138_abrazocavecreekhospital_standardcharges.json?sfvrsn=674fd6f_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621809851_abrazomesahospital_standardcharges.json?sfvrsn=13953222_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621811285_abrazosurprisehospital_standardcharges.json?sfvrsn=c8113dcf_2']
for item in all_links:
try:
length = len(item)
first_under = item.find('_') + 1
last_under = item.rfind('?') - 21
file_name = item[first_under:last_under]
r = requests.get(item)
df = pd.read_json(r.text, lines=True)
DOWNLOAD_PATH = 'C:\Users\ryans\Desktop\hospital_data\' + file_name + '.csv'
with open(DOWNLOAD_PATH,'wb') as f:
df.to_csv(f)
except Exception as e: print(e)
我正在尝试遍历列表中的四个 URL,抓取每个 URL 的内容,并将每个保存为单独的 CSV。我认为下面的代码很接近,但它似乎并没有真正将 JSON 字符串解析为 human-readable 格式。此外,缺少 headers。
这是我破解的代码。
import urllib
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize
all_links = ['https://www.baptisthealthsystem.com/docs/global/standard-charges/474131755_abrazomaranahospital_standardcharges.json?sfvrsn=9a27928_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621861138_abrazocavecreekhospital_standardcharges.json?sfvrsn=674fd6f_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621809851_abrazomesahospital_standardcharges.json?sfvrsn=13953222_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621811285_abrazosurprisehospital_standardcharges.json?sfvrsn=c8113dcf_2']
for item in all_links:
#print(item)
try:
length = len(item)
first_under = item.find('_') + 1
last_under = item.rfind('?') - 21
file_name = item[first_under:last_under]
r = requests.get(item)
print(r.json)
df = pd.DataFrame(r)
df.head()
DOWNLOAD_PATH = 'C:\Users\ryans\Desktop\hospital_data\' + file_name + '.csv'
#urllib.request.urlretrieve(df,DOWNLOAD_PATH)
r = requests.get(item)
with open(DOWNLOAD_PATH,'wb') as f:
f.write(r.content)
except Exception as e: print(e)
这是数据的样子。这是正确的吗?我认为如果将数据从 JSON 转换为 CSV,数据看起来会更清晰。
不知道你是不是征求意见。是否需要 CSV?或者您只是想将其保存为可读(最好是可编程可提取)格式?
无论如何,这是将一个结果导出到 .csv
文件的方法:
# Required imports
import json
import csv
"""I am assuming you already made the request and got the response"""
data_set = r.json() # An iterable of python dict objects
DOWNLOAD_PATH = 'C:\Users\ryans\Desktop\hospital_data\' + file_name + '.csv'
# (Please take enough care to not reveal personal info above)
# getting the headers
headers = data_set[0].keys()
with open(DOWNLOAD_PATH, "wt", newline="") as file:
writer = csv.writer(file)
writer.writerow(headers) # writing the headers
for data in data_set: # writing the values
write.writerow(data.values())
下面是将一个结果导出到 .json
文件的方法:
import json
data_set = r.json()
DOWNLOAD_PATH = 'C:\Users\ryans\Desktop\hospital_data\' + file_name + '.json'
with open(DOWNLOAD_PATH, "wt") as file:
json.dump(data_set, file)
您可以使用对您而言更易读的文件。请注意,上述方法是针对单个请求结果的。对于从网站提取的每个结果,您都必须执行上述操作。
CSV 方法的学分:GeeksforGeeks
您很接近,这是您需要更改的内容:
- 您可以使用 pandas 数据帧通过
df = pd.read_json(text, lines=True)
读取 json - 为此请确保指定lines=True
因为您的某些数据包含\n
字符 - 您可以使用相同的数据帧通过
df.to_csv(file)
输出到 csv
总而言之,您的代码中有些东西可以删除,例如您无缘无故地调用了 requests.get
两次,这大大降低了您的代码速度。
import requests
import pandas as pd
all_links = ['https://www.baptisthealthsystem.com/docs/global/standard-charges/474131755_abrazomaranahospital_standardcharges.json?sfvrsn=9a27928_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621861138_abrazocavecreekhospital_standardcharges.json?sfvrsn=674fd6f_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621809851_abrazomesahospital_standardcharges.json?sfvrsn=13953222_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621811285_abrazosurprisehospital_standardcharges.json?sfvrsn=c8113dcf_2']
for item in all_links:
try:
length = len(item)
first_under = item.find('_') + 1
last_under = item.rfind('?') - 21
file_name = item[first_under:last_under]
r = requests.get(item)
df = pd.read_json(r.text, lines=True)
DOWNLOAD_PATH = 'C:\Users\ryans\Desktop\hospital_data\' + file_name + '.csv'
with open(DOWNLOAD_PATH,'wb') as f:
df.to_csv(f)
except Exception as e: print(e)