为什么 pandas.read_json 修改长整数的值?
Why is pandas.read_json, modifying the value of long integers?
不知道为什么id_1&id_2原来的内容在我打印的时候变了它。
我有一个名为 test_data.json
的 json 文件
{
"objects":{
"value":{
"1298543947669573634":{
"timestamp":"Wed Aug 26 08:52:57 +0000 2020",
"id_1":"1298543947669573634",
"id_2":"1298519559306190850"
}
}
}
}
输出
python test_data.py
id_1 id_2 timestamp
0 1298543947669573632 1298519559306190848 2020-08-26 08:52:57+00:00
我的代码 test_data.py 是
import pandas as pd
import json
file = "test_data.json"
with open (file, "r") as f:
all_data = json.loads(f.read())
data = pd.read_json(json.dumps(all_data['objects']['value']), orient='index')
data = data.reset_index(drop=True)
print(data.head())
我该如何解决这个问题,以便正确解释数值?
- 使用
python 3.8.5
和 pandas 1.1.1
当前实施
- 首先,代码读取文件并将其从
str
类型转换为 dict
,其中 json.loads
with open (file, "r") as f:
all_data = json.loads(f.read())
- 然后
'value'
被转换回 str
json.dumps(all_data['objects']['value'])
- 使用
orient='index'
将 keys
设置为列 headers 并且 values
在行中。
- 此时数据也被转换成一个
int
,数值发生变化
- 我猜测这一步中存在一些浮点转换问题
pd.read_json(json.dumps(all_data['objects']['value']), orient='index')
更新代码
选项 1
- 使用
pandas.DataFrame.from_dict
然后转换为数字。
file = "test_data.json"
with open (file, "r") as f:
all_data = json.loads(f.read())
# use .from_dict
data = pd.DataFrame.from_dict(all_data['objects']['value'], orient='index')
# convert columns to numeric
data[['id_1', 'id_2']] = data[['id_1', 'id_2']].apply(pd.to_numeric, errors='coerce')
data = data.reset_index(drop=True)
# display(data)
timestamp id_1 id_2
0 Wed Aug 26 08:52:57 +0000 2020 1298543947669573634 1298519559306190850
print(data.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 timestamp 1 non-null object
1 id_1 1 non-null int64
2 id_2 1 non-null int64
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes
选项 2
- 使用
pandas.json_normalize
,然后将列转换为数字。
file = "test_data.json"
with open (file, "r") as f:
all_data = json.loads(f.read())
# read all_data into a dataframe
df = pd.json_normalize(all_data['objects']['value'])
# rename the columns
df.columns = [x.split('.')[1] for x in df.columns]
# convert to numeric
df[['id_1', 'id_2']] = df[['id_1', 'id_2']].apply(pd.to_numeric, errors='coerce')
# display(df)
timestamp id_1 id_2
0 Wed Aug 26 08:52:57 +0000 2020 1298543947669573634 1298519559306190850
print(df.info()
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 timestamp 1 non-null object
1 id_1 1 non-null int64
2 id_2 1 non-null int64
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes
这是由issue 20608引起的,并且在Pandas的当前1.2.4版本中仍然存在。
这是我的解决方法,在我的数据上什至比 read_json
稍微快一点:
def broken_load_json(path):
"""There's an open issue: https://github.com/pandas-dev/pandas/issues/20608
about read_csv loading large integers incorrectly because it's converting
from string to float to int, losing precision."""
df = pd.read_json(pathlib.Path(path), orient='index')
return df
def orjson_load_json(path):
import orjson # The builting json module would also work
with open(path) as f:
d = orjson.loads(f.read())
df = pd.DataFrame.from_dict(d, orient='index') # Builds the index from the dict's keys as strings, sadly
# Fix the dtype of the index
df = df.reset_index()
df['index'] = df['index'].astype('int64')
df = df.set_index('index')
return df
请注意,我的回答保留了 ID 的值,这在我的用例中很有意义。
不知道为什么id_1&id_2原来的内容在我打印的时候变了它。
我有一个名为 test_data.json
的 json 文件{
"objects":{
"value":{
"1298543947669573634":{
"timestamp":"Wed Aug 26 08:52:57 +0000 2020",
"id_1":"1298543947669573634",
"id_2":"1298519559306190850"
}
}
}
}
输出
python test_data.py
id_1 id_2 timestamp
0 1298543947669573632 1298519559306190848 2020-08-26 08:52:57+00:00
我的代码 test_data.py 是
import pandas as pd
import json
file = "test_data.json"
with open (file, "r") as f:
all_data = json.loads(f.read())
data = pd.read_json(json.dumps(all_data['objects']['value']), orient='index')
data = data.reset_index(drop=True)
print(data.head())
我该如何解决这个问题,以便正确解释数值?
- 使用
python 3.8.5
和pandas 1.1.1
当前实施
- 首先,代码读取文件并将其从
str
类型转换为dict
,其中json.loads
with open (file, "r") as f:
all_data = json.loads(f.read())
- 然后
'value'
被转换回str
json.dumps(all_data['objects']['value'])
- 使用
orient='index'
将keys
设置为列 headers 并且values
在行中。- 此时数据也被转换成一个
int
,数值发生变化 - 我猜测这一步中存在一些浮点转换问题
- 此时数据也被转换成一个
pd.read_json(json.dumps(all_data['objects']['value']), orient='index')
更新代码
选项 1
- 使用
pandas.DataFrame.from_dict
然后转换为数字。
file = "test_data.json"
with open (file, "r") as f:
all_data = json.loads(f.read())
# use .from_dict
data = pd.DataFrame.from_dict(all_data['objects']['value'], orient='index')
# convert columns to numeric
data[['id_1', 'id_2']] = data[['id_1', 'id_2']].apply(pd.to_numeric, errors='coerce')
data = data.reset_index(drop=True)
# display(data)
timestamp id_1 id_2
0 Wed Aug 26 08:52:57 +0000 2020 1298543947669573634 1298519559306190850
print(data.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 timestamp 1 non-null object
1 id_1 1 non-null int64
2 id_2 1 non-null int64
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes
选项 2
- 使用
pandas.json_normalize
,然后将列转换为数字。
file = "test_data.json"
with open (file, "r") as f:
all_data = json.loads(f.read())
# read all_data into a dataframe
df = pd.json_normalize(all_data['objects']['value'])
# rename the columns
df.columns = [x.split('.')[1] for x in df.columns]
# convert to numeric
df[['id_1', 'id_2']] = df[['id_1', 'id_2']].apply(pd.to_numeric, errors='coerce')
# display(df)
timestamp id_1 id_2
0 Wed Aug 26 08:52:57 +0000 2020 1298543947669573634 1298519559306190850
print(df.info()
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 timestamp 1 non-null object
1 id_1 1 non-null int64
2 id_2 1 non-null int64
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes
这是由issue 20608引起的,并且在Pandas的当前1.2.4版本中仍然存在。
这是我的解决方法,在我的数据上什至比 read_json
稍微快一点:
def broken_load_json(path):
"""There's an open issue: https://github.com/pandas-dev/pandas/issues/20608
about read_csv loading large integers incorrectly because it's converting
from string to float to int, losing precision."""
df = pd.read_json(pathlib.Path(path), orient='index')
return df
def orjson_load_json(path):
import orjson # The builting json module would also work
with open(path) as f:
d = orjson.loads(f.read())
df = pd.DataFrame.from_dict(d, orient='index') # Builds the index from the dict's keys as strings, sadly
# Fix the dtype of the index
df = df.reset_index()
df['index'] = df['index'].astype('int64')
df = df.set_index('index')
return df
请注意,我的回答保留了 ID 的值,这在我的用例中很有意义。