尝试从 URL 读取 JSON 并解析为 CSV 格式

Question

我正在尝试遍历列表中的四个 URL，抓取每个 URL 的内容，并将每个保存为单独的 CSV。我认为下面的代码很接近，但它似乎并没有真正将 JSON 字符串解析为 human-readable 格式。此外，缺少 headers。

这是我破解的代码。

import urllib
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize

all_links = ['https://www.baptisthealthsystem.com/docs/global/standard-charges/474131755_abrazomaranahospital_standardcharges.json?sfvrsn=9a27928_2',
  'https://www.baptisthealthsystem.com/docs/global/standard-charges/621861138_abrazocavecreekhospital_standardcharges.json?sfvrsn=674fd6f_2',
  'https://www.baptisthealthsystem.com/docs/global/standard-charges/621809851_abrazomesahospital_standardcharges.json?sfvrsn=13953222_2',
  'https://www.baptisthealthsystem.com/docs/global/standard-charges/621811285_abrazosurprisehospital_standardcharges.json?sfvrsn=c8113dcf_2']
for item in all_links:
    #print(item)
    try:
        length = len(item)
        first_under = item.find('_') + 1
        last_under = item.rfind('?') - 21
        file_name = item[first_under:last_under]
        r = requests.get(item)
        print(r.json)
        df = pd.DataFrame(r)
        df.head()
        DOWNLOAD_PATH = 'C:\Users\ryans\Desktop\hospital_data\' + file_name + '.csv'
        #urllib.request.urlretrieve(df,DOWNLOAD_PATH)
        r = requests.get(item)
        with open(DOWNLOAD_PATH,'wb') as f:
            f.write(r.content)
    except Exception as e: print(e)

这是数据的样子。这是正确的吗？我认为如果将数据从 JSON 转换为 CSV，数据看起来会更清晰。

Answer 1

不知道你是不是征求意见。是否需要 CSV？或者您只是想将其保存为可读（最好是可编程可提取）格式？

无论如何，这是将一个结果导出到 .csv 文件的方法：

# Required imports
import json
import csv

"""I am assuming you already made the request and got the response"""
data_set = r.json()   # An iterable of python dict objects
DOWNLOAD_PATH = 'C:\Users\ryans\Desktop\hospital_data\' + file_name + '.csv'
# (Please take enough care to not reveal personal info above)

# getting the headers
headers = data_set[0].keys()

with open(DOWNLOAD_PATH, "wt", newline="") as file:
    writer = csv.writer(file)
    
    writer.writerow(headers)    # writing the headers

    for data in data_set:    # writing the values
        write.writerow(data.values())

下面是将一个结果导出到 .json 文件的方法：

import json

data_set = r.json()
DOWNLOAD_PATH = 'C:\Users\ryans\Desktop\hospital_data\' + file_name + '.json'

with open(DOWNLOAD_PATH, "wt") as file:
    json.dump(data_set, file)

您可以使用对您而言更易读的文件。请注意，上述方法是针对单个请求结果的。对于从网站提取的每个结果，您都必须执行上述操作。

CSV 方法的学分：GeeksforGeeks

Answer 2

您很接近，这是您需要更改的内容：

您可以使用 pandas 数据帧通过 df = pd.read_json(text, lines=True) 读取 json - 为此请确保指定 lines=True 因为您的某些数据包含 \n字符
您可以使用相同的数据帧通过 df.to_csv(file)

总而言之，您的代码中有些东西可以删除，例如您无缘无故地调用了 requests.get 两次，这大大降低了您的代码速度。

import requests
import pandas as pd

all_links = ['https://www.baptisthealthsystem.com/docs/global/standard-charges/474131755_abrazomaranahospital_standardcharges.json?sfvrsn=9a27928_2',
  'https://www.baptisthealthsystem.com/docs/global/standard-charges/621861138_abrazocavecreekhospital_standardcharges.json?sfvrsn=674fd6f_2',
  'https://www.baptisthealthsystem.com/docs/global/standard-charges/621809851_abrazomesahospital_standardcharges.json?sfvrsn=13953222_2',
  'https://www.baptisthealthsystem.com/docs/global/standard-charges/621811285_abrazosurprisehospital_standardcharges.json?sfvrsn=c8113dcf_2']
for item in all_links:
    try:
        length = len(item)
        first_under = item.find('_') + 1
        last_under = item.rfind('?') - 21
        file_name = item[first_under:last_under]
        r = requests.get(item)
        df = pd.read_json(r.text, lines=True)
        DOWNLOAD_PATH = 'C:\Users\ryans\Desktop\hospital_data\' + file_name + '.csv'
        with open(DOWNLOAD_PATH,'wb') as f:
            df.to_csv(f)
    except Exception as e: print(e)

尝试从 URL 读取 JSON 并解析为 CSV 格式

Trying to read JSON from URL and parse into CSV format

python

json

screen-scraping

python-3.x