Pandas:重命名数据框中的 "Unnamed: *" 或 "NaN"
Pandas: Renaming "Unnamed: *" or "NaN" in data frame
到目前为止,这是我的代码:
import numpy as np
import pandas as pd
df = pd.read_excel(r'file.xlsx', index_col=0)
这是它的样子:
我想将 "Unnamed: *" 列重命名为最后一个有效名称。
这是我尝试过的方法和结果:
df.columns = df.columns.str.replace('Unnamed.*', method='ffill')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-253-c868b8bff7c7> in <module>()
----> 1 df.columns = df.columns.str.replace('Unnamed.*', method='ffill')
TypeError: replace() got an unexpected keyword argument 'method'
这个"works"如果我只是做
df.columns = df.columns.str.replace('Unnamed.*', '')
但是我有空白值或 NaN(如果我用 'NaN' 替换 ''。然后我尝试:
df.columns = df.columns.fillna('ffill')
没有效果。所以我尝试了 inplace=True:
df.columns = df.columns.fillna('ffill', inplace=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-279-cce486472d5b> in <module>()
----> 1 df.columns = df.columns.fillna('ffill', inplace=True)
TypeError: fillna() got an unexpected keyword argument 'inplace'
然后我尝试了不同的方法:
i = 0
while i < len(df.columns):
if df.columns[i] == 'NaN':
df.columns[i] = df.columns[i-1]
print(df.columns[i])
i += 1
这给了我这个错误:
Oil
158 RGN Mistura
Access West Winter Blend
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-246-bc8fa6881b1a> in <module>()
2 while i < len(df.columns):
3 if df.columns[i] == 'NaN':
----> 4 df.columns[i] = df.columns[i-1]
5 print(df.columns[i])
6 i += 1
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
2048
2049 def __setitem__(self, key, value):
-> 2050 raise TypeError("Index does not support mutable operations")
2051
2052 def __getitem__(self, key):
TypeError: Index does not support mutable operations
可行的东西:
df.columns = df.columns.where(~df.columns.str.startswith('Unnamed')).to_series().ffill()
,完整示例:
import numpy as np
import pandas as pd
df = pd.DataFrame(columns=['First', 'Unnamed: 1', 'Unnamed: 2','Second', 'Unnamed: 3'])
df.columns = df.columns.where(~df.columns.str.startswith('Unnamed')).to_series().ffill()
print(df.columns)
打印:
Index(['First', 'First', 'First', 'Second', 'Second'], dtype='object')
您 运行 遇到的问题与列和索引是 pd.Index
对象这一事实有关。 pandas 索引的 fillna 方法采用的参数与 pandas 系列或 DataFrame 的 fillna 方法采用的参数不同。
我在下面做了一个玩具示例:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'a':[1], 'Unnamed:1':[1], 'Unnamed:2':[1], 'b':[1], 'Unnamed:3':[1]},
columns=['a', 'Unnamed:3', 'Unnamed:1', 'b', 'Unnamed:2']))
df
# a Unnamed:3 Unnamed:1 b Unnamed:2
#0 1 1 1 1 1
您的原始正则表达式没有捕获整个列名,我们来解决这个问题。
df.columns.str.replace('Unnamed:*', '')
#Index(['a', '3', '1', 'b', '2'], dtype='object')
df.columns.str.replace('Unnamed:\d+', '')
#Index(['a', '', '', 'b', ''], dtype='object')
df.columns.str.replace('Unnamed:.+', '')
#Index(['a', '', '', 'b', ''], dtype='object')
现在让我们将索引转换成一个系列,这样我们就可以使用 pd.Series
的 .replace
和 .fillna
方法以及一个工作正则表达式来替换有问题的列名ffill
。最后我们转换成pd.Index
pd.Index(
pd.Series(
df.columns
).replace('Unnamed:\d+', np.nan, regex=True).fillna(method='ffill')
)
#Index(['a', 'a', 'a', 'b', 'b'], dtype='object')
df.columns = pd.Index(pd.Series(df.columns).replace('Unnamed:\d+', np.nan, regex=True).fillna(method='ffill'))
df.head()
# a a a b b
#0 1 1 1 1 1
我做了以下操作,我认为它保持了您寻求的顺序。
df = pd.read_excel('book1.xlsx')
print df
a b c Unnamed: 3 Unnamed: 4 d Unnamed: 6 e Unnamed: 8 f
0 34 13 73 nan nan 87 nan 76 nan 36
1 70 48 1 nan nan 88 nan 2 nan 77
2 37 62 28 nan nan 2 nan 53 nan 60
3 17 97 78 nan nan 69 nan 93 nan 48
4 65 19 96 nan nan 72 nan 4 nan 57
5 63 6 86 nan nan 14 nan 20 nan 51
6 10 67 54 nan nan 52 nan 48 nan 79
df.columns = pd.Series([np.nan if 'Unnamed:' in x else x for x in df.columns.values]).ffill().values.flatten()
print df
a b c c c d d e e f
0 34 13 73 nan nan 87 nan 76 nan 36
1 70 48 1 nan nan 88 nan 2 nan 77
2 37 62 28 nan nan 2 nan 53 nan 60
3 17 97 78 nan nan 69 nan 93 nan 48
4 65 19 96 nan nan 72 nan 4 nan 57
5 63 6 86 nan nan 14 nan 20 nan 51
6 10 67 54 nan nan 52 nan 48 nan 79
到目前为止,这是我的代码:
import numpy as np
import pandas as pd
df = pd.read_excel(r'file.xlsx', index_col=0)
这是它的样子:
我想将 "Unnamed: *" 列重命名为最后一个有效名称。
这是我尝试过的方法和结果:
df.columns = df.columns.str.replace('Unnamed.*', method='ffill')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-253-c868b8bff7c7> in <module>()
----> 1 df.columns = df.columns.str.replace('Unnamed.*', method='ffill')
TypeError: replace() got an unexpected keyword argument 'method'
这个"works"如果我只是做
df.columns = df.columns.str.replace('Unnamed.*', '')
但是我有空白值或 NaN(如果我用 'NaN' 替换 ''。然后我尝试:
df.columns = df.columns.fillna('ffill')
没有效果。所以我尝试了 inplace=True:
df.columns = df.columns.fillna('ffill', inplace=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-279-cce486472d5b> in <module>()
----> 1 df.columns = df.columns.fillna('ffill', inplace=True)
TypeError: fillna() got an unexpected keyword argument 'inplace'
然后我尝试了不同的方法:
i = 0
while i < len(df.columns):
if df.columns[i] == 'NaN':
df.columns[i] = df.columns[i-1]
print(df.columns[i])
i += 1
这给了我这个错误:
Oil
158 RGN Mistura
Access West Winter Blend
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-246-bc8fa6881b1a> in <module>()
2 while i < len(df.columns):
3 if df.columns[i] == 'NaN':
----> 4 df.columns[i] = df.columns[i-1]
5 print(df.columns[i])
6 i += 1
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
2048
2049 def __setitem__(self, key, value):
-> 2050 raise TypeError("Index does not support mutable operations")
2051
2052 def __getitem__(self, key):
TypeError: Index does not support mutable operations
可行的东西:
df.columns = df.columns.where(~df.columns.str.startswith('Unnamed')).to_series().ffill()
,完整示例:
import numpy as np
import pandas as pd
df = pd.DataFrame(columns=['First', 'Unnamed: 1', 'Unnamed: 2','Second', 'Unnamed: 3'])
df.columns = df.columns.where(~df.columns.str.startswith('Unnamed')).to_series().ffill()
print(df.columns)
打印:
Index(['First', 'First', 'First', 'Second', 'Second'], dtype='object')
您 运行 遇到的问题与列和索引是 pd.Index
对象这一事实有关。 pandas 索引的 fillna 方法采用的参数与 pandas 系列或 DataFrame 的 fillna 方法采用的参数不同。
我在下面做了一个玩具示例:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'a':[1], 'Unnamed:1':[1], 'Unnamed:2':[1], 'b':[1], 'Unnamed:3':[1]},
columns=['a', 'Unnamed:3', 'Unnamed:1', 'b', 'Unnamed:2']))
df
# a Unnamed:3 Unnamed:1 b Unnamed:2
#0 1 1 1 1 1
您的原始正则表达式没有捕获整个列名,我们来解决这个问题。
df.columns.str.replace('Unnamed:*', '')
#Index(['a', '3', '1', 'b', '2'], dtype='object')
df.columns.str.replace('Unnamed:\d+', '')
#Index(['a', '', '', 'b', ''], dtype='object')
df.columns.str.replace('Unnamed:.+', '')
#Index(['a', '', '', 'b', ''], dtype='object')
现在让我们将索引转换成一个系列,这样我们就可以使用 pd.Series
的 .replace
和 .fillna
方法以及一个工作正则表达式来替换有问题的列名ffill
。最后我们转换成pd.Index
pd.Index(
pd.Series(
df.columns
).replace('Unnamed:\d+', np.nan, regex=True).fillna(method='ffill')
)
#Index(['a', 'a', 'a', 'b', 'b'], dtype='object')
df.columns = pd.Index(pd.Series(df.columns).replace('Unnamed:\d+', np.nan, regex=True).fillna(method='ffill'))
df.head()
# a a a b b
#0 1 1 1 1 1
我做了以下操作,我认为它保持了您寻求的顺序。
df = pd.read_excel('book1.xlsx')
print df
a b c Unnamed: 3 Unnamed: 4 d Unnamed: 6 e Unnamed: 8 f
0 34 13 73 nan nan 87 nan 76 nan 36
1 70 48 1 nan nan 88 nan 2 nan 77
2 37 62 28 nan nan 2 nan 53 nan 60
3 17 97 78 nan nan 69 nan 93 nan 48
4 65 19 96 nan nan 72 nan 4 nan 57
5 63 6 86 nan nan 14 nan 20 nan 51
6 10 67 54 nan nan 52 nan 48 nan 79
df.columns = pd.Series([np.nan if 'Unnamed:' in x else x for x in df.columns.values]).ffill().values.flatten()
print df
a b c c c d d e e f
0 34 13 73 nan nan 87 nan 76 nan 36
1 70 48 1 nan nan 88 nan 2 nan 77
2 37 62 28 nan nan 2 nan 53 nan 60
3 17 97 78 nan nan 69 nan 93 nan 48
4 65 19 96 nan nan 72 nan 4 nan 57
5 63 6 86 nan nan 14 nan 20 nan 51
6 10 67 54 nan nan 52 nan 48 nan 79