Pandas 即使将 replace inplace 设置为 true,Dataframe fillna 仍然工作不一致
Pandas Dataframe fillna working inconsistenly even with replace inplace set to true
程序从 RESTApi
检索 JSON 数据
import requests
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows',1000)
url = 'http://xxxxxxxxxxx:7180/api/v15/clusters/cluster/services/impala/impalaQueries?from=2018-07-23T14:00:00&to=2018-07-23T21:00:00&filter=(query_state+%3D+%22FINISHED%22)&offset=0&limit=1000'
username = 'xxxxx'
password = 'xxxxx'
result = requests.get(url, auth=(username,password))
outJSON = result.json()
df = pd.io.json.json_normalize(outJSON['queries'])
filename ="tempNew.csv"
df.to_csv(filename)
CSV 数据包含某些字段的空值和少数字段的 NaN。
Input:
Admitted immediately,,BLAHBLAH,0,NaN,0,0,0,0,0.0,,,,
同时使用 fillna 将所有空值和 NaN 替换为 0,因为它们是目标中的数字字段 Table。
尝试过的代码:
for col in df:
df[col].fillna(0,inplace=True)
df.fillna(0,inplace=True)
Output:
'Admitted immediately', '0', 'BLAHBLAH', '0', 'NaN', '0', '0', '0',
'0', '0.0', '0','0','0'
我如何确保我的数据框中的所有 NaN 值都更改为 0,因为它们加载到的 table 由于 NaN 值而拒绝值?
我从 RESTAPI 逐行处理数据切换到 Dataframe,印象是使用 DF 更容易处理数据。如果 fillna 不起作用,是否有更好的方法来按摩 df 中的数据而无需逐行迭代?
Update:
df = pd.io.json.json_normalize(outJSON['queries'])
fname = "WithouFilna_1.csv"
df.to_csv(fname)
df.fillna(0,inplace=True)
filename ="fillna_1.csv"
df.to_csv(filename)
I tried to write the output of df.fillna before and after. Partial changes are seen for few fields, but not for all of them
Before:
859,Unknown,,,2,0,xxxx,RESET_METADATA,,,,,,,,,,,,,,
860,Admitted immediately,0,,1,2,xxxx,,0,,NaN,0,0,,0
861,Admitted immediately,0,,0,0,xxxx,,0,,NaN,0,0,,0
After:
859,Unknown,0,0,2,0,xxxx,RESET_METADATA,0,,0,0,0,0,0,0,0,0,0,0,0
860,Admitted immediately,0,0,1,2,xxx,0,0,,NaN,0,0,0,0,0,0,0,0,0
861,Admitted immediately,0,0,0,0,xxx,0,0,,NaN,0,0,0,0,0,0,0,0,0
df.dtypes Output
attributes.admission_result object
attributes.admission_wait object
attributes.bytes_streamed object
attributes.client_fetch_wait_time object
attributes.client_fetch_wait_time_percentage object
attributes.connected_user object
attributes.ddl_type object
attributes.estimated_per_node_peak_memory object
attributes.file_formats object
attributes.hdfs_average_scan_range object
attributes.hdfs_bytes_read object
attributes.hdfs_bytes_read_from_cache object
attributes.hdfs_bytes_read_from_cache_percentage object
attributes.hdfs_bytes_read_local object
attributes.hdfs_bytes_read_local_percentage object
attributes.hdfs_bytes_read_remote object
attributes.hdfs_bytes_read_remote_percentage object
attributes.hdfs_bytes_read_short_circuit object
attributes.hdfs_bytes_read_short_circuit_percentage object
attributes.hdfs_scanner_average_bytes_read_per_second object
df.values[5:6, :15]
array([['Unknown', nan, nan, '1', '8', 'xxxxx',
'SHOW_DBS', nan, '', nan, nan, nan, nan, nan, nan]], dtype=object)
问题是由于其余API 返回不一致的数据。来自 API 的受影响字段的数据为 'NaN'
在 df.fillna(0, inplace=True)
期间显然没有考虑
我使用以下方法解决了这个问题:
df.replace({'NaN': '0'}, regex=True, inplace=True)
程序从 RESTApi
检索 JSON 数据import requests
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows',1000)
url = 'http://xxxxxxxxxxx:7180/api/v15/clusters/cluster/services/impala/impalaQueries?from=2018-07-23T14:00:00&to=2018-07-23T21:00:00&filter=(query_state+%3D+%22FINISHED%22)&offset=0&limit=1000'
username = 'xxxxx'
password = 'xxxxx'
result = requests.get(url, auth=(username,password))
outJSON = result.json()
df = pd.io.json.json_normalize(outJSON['queries'])
filename ="tempNew.csv"
df.to_csv(filename)
CSV 数据包含某些字段的空值和少数字段的 NaN。
Input:
Admitted immediately,,BLAHBLAH,0,NaN,0,0,0,0,0.0,,,,
同时使用 fillna 将所有空值和 NaN 替换为 0,因为它们是目标中的数字字段 Table。
尝试过的代码:
for col in df:
df[col].fillna(0,inplace=True)
df.fillna(0,inplace=True)
Output:
'Admitted immediately', '0', 'BLAHBLAH', '0', 'NaN', '0', '0', '0', '0', '0.0', '0','0','0'
我如何确保我的数据框中的所有 NaN 值都更改为 0,因为它们加载到的 table 由于 NaN 值而拒绝值?
我从 RESTAPI 逐行处理数据切换到 Dataframe,印象是使用 DF 更容易处理数据。如果 fillna 不起作用,是否有更好的方法来按摩 df 中的数据而无需逐行迭代?
Update:
df = pd.io.json.json_normalize(outJSON['queries'])
fname = "WithouFilna_1.csv"
df.to_csv(fname)
df.fillna(0,inplace=True)
filename ="fillna_1.csv"
df.to_csv(filename)
I tried to write the output of df.fillna before and after. Partial changes are seen for few fields, but not for all of them
Before:
859,Unknown,,,2,0,xxxx,RESET_METADATA,,,,,,,,,,,,,,
860,Admitted immediately,0,,1,2,xxxx,,0,,NaN,0,0,,0
861,Admitted immediately,0,,0,0,xxxx,,0,,NaN,0,0,,0
After:
859,Unknown,0,0,2,0,xxxx,RESET_METADATA,0,,0,0,0,0,0,0,0,0,0,0,0
860,Admitted immediately,0,0,1,2,xxx,0,0,,NaN,0,0,0,0,0,0,0,0,0
861,Admitted immediately,0,0,0,0,xxx,0,0,,NaN,0,0,0,0,0,0,0,0,0
df.dtypes Output
attributes.admission_result object
attributes.admission_wait object
attributes.bytes_streamed object
attributes.client_fetch_wait_time object
attributes.client_fetch_wait_time_percentage object
attributes.connected_user object
attributes.ddl_type object
attributes.estimated_per_node_peak_memory object
attributes.file_formats object
attributes.hdfs_average_scan_range object
attributes.hdfs_bytes_read object
attributes.hdfs_bytes_read_from_cache object
attributes.hdfs_bytes_read_from_cache_percentage object
attributes.hdfs_bytes_read_local object
attributes.hdfs_bytes_read_local_percentage object
attributes.hdfs_bytes_read_remote object
attributes.hdfs_bytes_read_remote_percentage object
attributes.hdfs_bytes_read_short_circuit object
attributes.hdfs_bytes_read_short_circuit_percentage object
attributes.hdfs_scanner_average_bytes_read_per_second object
df.values[5:6, :15]
array([['Unknown', nan, nan, '1', '8', 'xxxxx',
'SHOW_DBS', nan, '', nan, nan, nan, nan, nan, nan]], dtype=object)
问题是由于其余API 返回不一致的数据。来自 API 的受影响字段的数据为 'NaN'
在 df.fillna(0, inplace=True)
我使用以下方法解决了这个问题:
df.replace({'NaN': '0'}, regex=True, inplace=True)