读取 JSON 文件并将其格式化为 CSV
read a JSON file and format it to CSV
我必须读取 json 文件并提取数据以生成 CSV 文件。
服务器是Redhat 7,python是Python2.7.5
import time
import os
import sys
import json
with open('abcdc04_abcd11_ig_Host_metrics.json') as data_file:
data = json.load(data_file)
with open('abcdc04_abcd11_ig_Host_metrics.txt', 'w') as f:
for row in data:
symmetrixID= row['symmetrixID']
HostID= row['HostID']
HostMBReads= row['HostMBReads']
timestamp= row['timestamp']
joined = ",".join([symmetrixID , HostID, HostMBReads , timestamp])
f.write(joined)
结果是:
Traceback (most recent call last):
File "./json_scv", line 23, in <module>
symmetrixID= row['symmetrixID']
TypeError: string indices must be integers
我的输入 json 文件是这样的:
{
"symmetrixID": "000123401234",
"HostID": "jupiter_ig",
"perf_data": [
{
"HostMBReads": 0.00024720083,
"timestamp": 1553637300000,
"Writes": 0.0,
"ReadResponseTime": 0.15273508,
"Reads": 0.06328341,
"WriteResponseTime": 0.0,
"ResponseTime": 0.15273508,
"SyscallCount": 0.09326678,
"HostMBWrites": 0.0,
"HostIOs": 0.06328341,
"MBs": 0.00024720083
},
{
"HostMBReads": 0.0004939684,
"timestamp": 1553637600000,
"Writes": 0.0,
"ReadResponseTime": 0.15828949,
"Reads": 0.1264559,
"WriteResponseTime": 0.0,
"ResponseTime": 0.15828949,
"SyscallCount": 0.123128116,
"HostMBWrites": 0.0,
"HostIOs": 0.1264559,
"MBs": 0.0004939684
},
{
"HostMBReads": 0.0,
"timestamp": 1553637900000,
"Writes": 0.0,
"ReadResponseTime": 0.0,
"Reads": 0.0,
"WriteResponseTime": 0.0,
"ResponseTime": 0.0,
"SyscallCount": 0.2,
"HostMBWrites": 0.0,
"HostIOs": 0.0,
"MBs": 0.0
}
],
"reporting_level": "Host"
}
我希望 csv 格式如下所示:
SymmID,HostName,TimeStamp,HostIOs,HostMBs,ResponseTime,Reads,Writes,HostMBReads,HostMBWrites,ReadResponseTime,WriteResponseTime SyscallCount
000123401234,jupiter_ig,1553637600000,0.12666667,0.000494792,0.15257895,0.12666667,0,0.000494792,0,0.15257895,0,0.21333334
000123401234,jupiter_ig, 1553637600000,0.1264559,0.000493968,0.15828949,0.1264559,0,0.000493968,0,0.15828949,0,0.123128116
000123401234,jupiter_ig,1553637600000,0 ,0,0,0,0,0,0,0,0,0.2
您的名为 data
的变量最终应该是一个字典,而不是一个列表。因此,当您尝试执行“for row in data:
”时,您说的是 "Do the following for each key in the dictionary", 而不是 列表中的项目!字典没有排序,但无论哪个键首先被选为 row
,该命令都会失败,因为它无法在其中找到任何名为“symmetrixID
”的内容。例如,如果 HostID
是循环中选择的第一个键,则 row['symmetrixID']
表示 data['HostID']['symmetrixID']
。
如果你仔细观察,字典中只有一个列表可以迭代,那就是 data["perf_data"]
。所以试试那里的循环。
所以暂时将您的数据粘贴在字符串中:
s = """
{
"symmetrixID": "000123401234",
"HostID": "jupiter_ig",
"perf_data": [
{
"HostMBReads": 0.00024720083,
"timestamp": 1553637300000,
"Writes": 0.0,
"ReadResponseTime": 0.15273508,
"Reads": 0.06328341,
"WriteResponseTime": 0.0,
"ResponseTime": 0.15273508,
"SyscallCount": 0.09326678,
"HostMBWrites": 0.0,
"HostIOs": 0.06328341,
"MBs": 0.00024720083
},
{
"HostMBReads": 0.0004939684,
"timestamp": 1553637600000,
"Writes": 0.0,
"ReadResponseTime": 0.15828949,
"Reads": 0.1264559,
"WriteResponseTime": 0.0,
"ResponseTime": 0.15828949,
"SyscallCount": 0.123128116,
"HostMBWrites": 0.0,
"HostIOs": 0.1264559,
"MBs": 0.0004939684
},
{
"HostMBReads": 0.0,
"timestamp": 1553637900000,
"Writes": 0.0,
"ReadResponseTime": 0.0,
"Reads": 0.0,
"WriteResponseTime": 0.0,
"ResponseTime": 0.0,
"SyscallCount": 0.2,
"HostMBWrites": 0.0,
"HostIOs": 0.0,
"MBs": 0.0
}
],
"reporting_level": "Host"
}
"""
以下是我获取要格式化的数据的方法:
import json
data = json.loads(s)
symmetrixID= data['symmetrixID']
HostID= data['HostID']
for row in data['perf_data']:
HostMBReads = row['HostMBReads']
timestamp = row['timestamp']
joined = ",".join([str(c) for c in [symmetrixID, HostID, HostMBReads, timestamp]])
print(joined)
请注意,我更改了您的 joined
表达方式。如果您不首先将所有这些浮点值更改为字符串,则 join
将不起作用。无论如何,您可以将 print
命令替换为您需要的写入命令。
我必须读取 json 文件并提取数据以生成 CSV 文件。
服务器是Redhat 7,python是Python2.7.5
import time
import os
import sys
import json
with open('abcdc04_abcd11_ig_Host_metrics.json') as data_file:
data = json.load(data_file)
with open('abcdc04_abcd11_ig_Host_metrics.txt', 'w') as f:
for row in data:
symmetrixID= row['symmetrixID']
HostID= row['HostID']
HostMBReads= row['HostMBReads']
timestamp= row['timestamp']
joined = ",".join([symmetrixID , HostID, HostMBReads , timestamp])
f.write(joined)
结果是:
Traceback (most recent call last):
File "./json_scv", line 23, in <module>
symmetrixID= row['symmetrixID']
TypeError: string indices must be integers
我的输入 json 文件是这样的:
{
"symmetrixID": "000123401234",
"HostID": "jupiter_ig",
"perf_data": [
{
"HostMBReads": 0.00024720083,
"timestamp": 1553637300000,
"Writes": 0.0,
"ReadResponseTime": 0.15273508,
"Reads": 0.06328341,
"WriteResponseTime": 0.0,
"ResponseTime": 0.15273508,
"SyscallCount": 0.09326678,
"HostMBWrites": 0.0,
"HostIOs": 0.06328341,
"MBs": 0.00024720083
},
{
"HostMBReads": 0.0004939684,
"timestamp": 1553637600000,
"Writes": 0.0,
"ReadResponseTime": 0.15828949,
"Reads": 0.1264559,
"WriteResponseTime": 0.0,
"ResponseTime": 0.15828949,
"SyscallCount": 0.123128116,
"HostMBWrites": 0.0,
"HostIOs": 0.1264559,
"MBs": 0.0004939684
},
{
"HostMBReads": 0.0,
"timestamp": 1553637900000,
"Writes": 0.0,
"ReadResponseTime": 0.0,
"Reads": 0.0,
"WriteResponseTime": 0.0,
"ResponseTime": 0.0,
"SyscallCount": 0.2,
"HostMBWrites": 0.0,
"HostIOs": 0.0,
"MBs": 0.0
}
],
"reporting_level": "Host"
}
我希望 csv 格式如下所示:
SymmID,HostName,TimeStamp,HostIOs,HostMBs,ResponseTime,Reads,Writes,HostMBReads,HostMBWrites,ReadResponseTime,WriteResponseTime SyscallCount
000123401234,jupiter_ig,1553637600000,0.12666667,0.000494792,0.15257895,0.12666667,0,0.000494792,0,0.15257895,0,0.21333334
000123401234,jupiter_ig, 1553637600000,0.1264559,0.000493968,0.15828949,0.1264559,0,0.000493968,0,0.15828949,0,0.123128116
000123401234,jupiter_ig,1553637600000,0 ,0,0,0,0,0,0,0,0,0.2
您的名为 data
的变量最终应该是一个字典,而不是一个列表。因此,当您尝试执行“for row in data:
”时,您说的是 "Do the following for each key in the dictionary", 而不是 列表中的项目!字典没有排序,但无论哪个键首先被选为 row
,该命令都会失败,因为它无法在其中找到任何名为“symmetrixID
”的内容。例如,如果 HostID
是循环中选择的第一个键,则 row['symmetrixID']
表示 data['HostID']['symmetrixID']
。
如果你仔细观察,字典中只有一个列表可以迭代,那就是 data["perf_data"]
。所以试试那里的循环。
所以暂时将您的数据粘贴在字符串中:
s = """
{
"symmetrixID": "000123401234",
"HostID": "jupiter_ig",
"perf_data": [
{
"HostMBReads": 0.00024720083,
"timestamp": 1553637300000,
"Writes": 0.0,
"ReadResponseTime": 0.15273508,
"Reads": 0.06328341,
"WriteResponseTime": 0.0,
"ResponseTime": 0.15273508,
"SyscallCount": 0.09326678,
"HostMBWrites": 0.0,
"HostIOs": 0.06328341,
"MBs": 0.00024720083
},
{
"HostMBReads": 0.0004939684,
"timestamp": 1553637600000,
"Writes": 0.0,
"ReadResponseTime": 0.15828949,
"Reads": 0.1264559,
"WriteResponseTime": 0.0,
"ResponseTime": 0.15828949,
"SyscallCount": 0.123128116,
"HostMBWrites": 0.0,
"HostIOs": 0.1264559,
"MBs": 0.0004939684
},
{
"HostMBReads": 0.0,
"timestamp": 1553637900000,
"Writes": 0.0,
"ReadResponseTime": 0.0,
"Reads": 0.0,
"WriteResponseTime": 0.0,
"ResponseTime": 0.0,
"SyscallCount": 0.2,
"HostMBWrites": 0.0,
"HostIOs": 0.0,
"MBs": 0.0
}
],
"reporting_level": "Host"
}
"""
以下是我获取要格式化的数据的方法:
import json
data = json.loads(s)
symmetrixID= data['symmetrixID']
HostID= data['HostID']
for row in data['perf_data']:
HostMBReads = row['HostMBReads']
timestamp = row['timestamp']
joined = ",".join([str(c) for c in [symmetrixID, HostID, HostMBReads, timestamp]])
print(joined)
请注意,我更改了您的 joined
表达方式。如果您不首先将所有这些浮点值更改为字符串,则 join
将不起作用。无论如何,您可以将 print
命令替换为您需要的写入命令。