Pandas 即使将 replace inplace 设置为 true,Dataframe fillna 仍然工作不一致

Pandas Dataframe fillna working inconsistenly even with replace inplace set to true

程序从 RESTApi

检索 JSON 数据
import requests
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows',1000)
url = 'http://xxxxxxxxxxx:7180/api/v15/clusters/cluster/services/impala/impalaQueries?from=2018-07-23T14:00:00&to=2018-07-23T21:00:00&filter=(query_state+%3D+%22FINISHED%22)&offset=0&limit=1000'
username = 'xxxxx'
password = 'xxxxx'
result = requests.get(url, auth=(username,password))
outJSON = result.json()
df = pd.io.json.json_normalize(outJSON['queries'])
filename ="tempNew.csv"
df.to_csv(filename)

CSV 数据包含某些字段的空值和少数字段的 NaN。

Input:

Admitted immediately,,BLAHBLAH,0,NaN,0,0,0,0,0.0,,,,

同时使用 fillna 将所有空值和 NaN 替换为 0,因为它们是目标中的数字字段 Table。

尝试过的代码:

for col in df:
   df[col].fillna(0,inplace=True)

df.fillna(0,inplace=True)

Output:

'Admitted immediately', '0', 'BLAHBLAH', '0', 'NaN', '0', '0', '0', '0', '0.0', '0','0','0'

我如何确保我的数据框中的所有 NaN 值都更改为 0,因为它们加载到的 table 由于 NaN 值而拒绝值?

我从 RESTAPI 逐行处理数据切换到 Dataframe,印象是使用 DF 更容易处理数据。如果 fillna 不起作用,是否有更好的方法来按摩 df 中的数据而无需逐行迭代?

Update:

df = pd.io.json.json_normalize(outJSON['queries'])
fname = "WithouFilna_1.csv"
df.to_csv(fname)
df.fillna(0,inplace=True)
filename ="fillna_1.csv"
df.to_csv(filename)

I tried to write the output of df.fillna before and after. Partial changes are seen for few fields, but not for all of them

Before:

859,Unknown,,,2,0,xxxx,RESET_METADATA,,,,,,,,,,,,,,
860,Admitted immediately,0,,1,2,xxxx,,0,,NaN,0,0,,0
861,Admitted immediately,0,,0,0,xxxx,,0,,NaN,0,0,,0

After:

859,Unknown,0,0,2,0,xxxx,RESET_METADATA,0,,0,0,0,0,0,0,0,0,0,0,0
860,Admitted immediately,0,0,1,2,xxx,0,0,,NaN,0,0,0,0,0,0,0,0,0
861,Admitted immediately,0,0,0,0,xxx,0,0,,NaN,0,0,0,0,0,0,0,0,0

df.dtypes Output

attributes.admission_result                              object
attributes.admission_wait                                object
attributes.bytes_streamed                                object
attributes.client_fetch_wait_time                        object
attributes.client_fetch_wait_time_percentage             object
attributes.connected_user                                object
attributes.ddl_type                                      object
attributes.estimated_per_node_peak_memory                object
attributes.file_formats                                  object
attributes.hdfs_average_scan_range                       object
attributes.hdfs_bytes_read                               object
attributes.hdfs_bytes_read_from_cache                    object
attributes.hdfs_bytes_read_from_cache_percentage         object
attributes.hdfs_bytes_read_local                         object
attributes.hdfs_bytes_read_local_percentage              object
attributes.hdfs_bytes_read_remote                        object
attributes.hdfs_bytes_read_remote_percentage             object
attributes.hdfs_bytes_read_short_circuit                 object
attributes.hdfs_bytes_read_short_circuit_percentage      object
attributes.hdfs_scanner_average_bytes_read_per_second    object

df.values[5:6, :15]

array([['Unknown', nan, nan, '1', '8', 'xxxxx',
        'SHOW_DBS', nan, '', nan, nan, nan, nan, nan, nan]], dtype=object)

问题是由于其余API 返回不一致的数据。来自 API 的受影响字段的数据为 'NaN'

df.fillna(0, inplace=True)

期间显然没有考虑

我使用以下方法解决了这个问题:

df.replace({'NaN': '0'}, regex=True, inplace=True)