Pandas:重命名数据框中的 "Unnamed: *" 或 "NaN"

Pandas: Renaming "Unnamed: *" or "NaN" in data frame

到目前为止,这是我的代码:

import numpy as np
import pandas as pd
df = pd.read_excel(r'file.xlsx', index_col=0)

这是它的样子:

我想将 "Unnamed: *" 列重命名为最后一个有效名称。

这是我尝试过的方法和结果:

df.columns = df.columns.str.replace('Unnamed.*', method='ffill')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-253-c868b8bff7c7> in <module>()
----> 1 df.columns = df.columns.str.replace('Unnamed.*', method='ffill')

TypeError: replace() got an unexpected keyword argument 'method'

这个"works"如果我只是做

df.columns = df.columns.str.replace('Unnamed.*', '')

但是我有空白值或 NaN(如果我用 'NaN' 替换 ''。然后我尝试:

df.columns = df.columns.fillna('ffill')

没有效果。所以我尝试了 inplace=True:

df.columns = df.columns.fillna('ffill', inplace=True)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-279-cce486472d5b> in <module>()
----> 1 df.columns = df.columns.fillna('ffill', inplace=True)

TypeError: fillna() got an unexpected keyword argument 'inplace'

然后我尝试了不同的方法:

i = 0
while i < len(df.columns):
    if df.columns[i] == 'NaN':
        df.columns[i] = df.columns[i-1]
    print(df.columns[i])
    i += 1

这给了我这个错误:

Oil
158 RGN Mistura
Access West Winter Blend 

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-246-bc8fa6881b1a> in <module>()
      2 while i < len(df.columns):
      3     if df.columns[i] == 'NaN':
----> 4         df.columns[i] = df.columns[i-1]
      5     print(df.columns[i])
      6     i += 1

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   2048 
   2049     def __setitem__(self, key, value):
-> 2050         raise TypeError("Index does not support mutable operations")
   2051 
   2052     def __getitem__(self, key):

TypeError: Index does not support mutable operations

可行的东西:

df.columns = df.columns.where(~df.columns.str.startswith('Unnamed')).to_series().ffill()

,完整示例:

import numpy as np
import pandas as pd

df = pd.DataFrame(columns=['First', 'Unnamed: 1', 'Unnamed: 2','Second', 'Unnamed: 3'])

df.columns = df.columns.where(~df.columns.str.startswith('Unnamed')).to_series().ffill()

print(df.columns)

打印:

Index(['First', 'First', 'First', 'Second', 'Second'], dtype='object')

您 运行 遇到的问题与列和索引是 pd.Index 对象这一事实有关。 pandas 索引的 fillna 方法采用的参数与 pandas 系列或 DataFrame 的 fillna 方法采用的参数不同。 我在下面做了一个玩具示例:

import pandas as pd
import numpy as np
df = pd.DataFrame(
         {'a':[1], 'Unnamed:1':[1], 'Unnamed:2':[1], 'b':[1], 'Unnamed:3':[1]}, 
         columns=['a', 'Unnamed:3', 'Unnamed:1', 'b', 'Unnamed:2']))
df 
#   a  Unnamed:3  Unnamed:1  b  Unnamed:2
#0  1          1          1  1          1

您的原始正则表达式没有捕获整个列名,我们来解决这个问题。

df.columns.str.replace('Unnamed:*', '') 
#Index(['a', '3', '1', 'b', '2'], dtype='object')
df.columns.str.replace('Unnamed:\d+', '')
#Index(['a', '', '', 'b', ''], dtype='object')
df.columns.str.replace('Unnamed:.+', '')
#Index(['a', '', '', 'b', ''], dtype='object')

现在让我们将索引转换成一个系列,这样我们就可以使用 pd.Series.replace.fillna 方法以及一个工作正则表达式来替换有问题的列名ffill。最后我们转换成pd.Index

pd.Index(
    pd.Series(
        df.columns
    ).replace('Unnamed:\d+', np.nan, regex=True).fillna(method='ffill')
)
#Index(['a', 'a', 'a', 'b', 'b'], dtype='object')

df.columns = pd.Index(pd.Series(df.columns).replace('Unnamed:\d+', np.nan, regex=True).fillna(method='ffill'))
df.head() 
#   a  a  a  b  b
#0  1  1  1  1  1

我做了以下操作,我认为它保持了您寻求的顺序。

df = pd.read_excel('book1.xlsx')
print df


    a   b   c  Unnamed: 3  Unnamed: 4   d  Unnamed: 6   e  Unnamed: 8   f
0  34  13  73         nan         nan  87         nan  76         nan  36
1  70  48   1         nan         nan  88         nan   2         nan  77
2  37  62  28         nan         nan   2         nan  53         nan  60
3  17  97  78         nan         nan  69         nan  93         nan  48
4  65  19  96         nan         nan  72         nan   4         nan  57
5  63   6  86         nan         nan  14         nan  20         nan  51
6  10  67  54         nan         nan  52         nan  48         nan  79


df.columns = pd.Series([np.nan if 'Unnamed:' in x else x for x in df.columns.values]).ffill().values.flatten()
print df


    a   b   c   c   c   d   d   e   e   f
0  34  13  73 nan nan  87 nan  76 nan  36
1  70  48   1 nan nan  88 nan   2 nan  77
2  37  62  28 nan nan   2 nan  53 nan  60
3  17  97  78 nan nan  69 nan  93 nan  48
4  65  19  96 nan nan  72 nan   4 nan  57
5  63   6  86 nan nan  14 nan  20 nan  51
6  10  67  54 nan nan  52 nan  48 nan  79