如何将另一整列作为参数传递给 pandas fillna()
How to pass another entire column as argument to pandas fillna()
我想使用 fillna
方法用另一列中的值填充一列中的缺失值。
(我读到循环遍历每一行是非常糟糕的做法,一次完成所有事情会更好,但我无法找到如何使用 fillna
。)
之前的数据:
Day Cat1 Cat2
1 cat mouse
2 dog elephant
3 cat giraf
4 NaN ant
之后的数据:
Day Cat1 Cat2
1 cat mouse
2 dog elephant
3 cat giraf
4 ant ant
你可以做到
df.Cat1 = np.where(df.Cat1.isnull(), df.Cat2, df.Cat1)
RHS 的整体结构使用 the ternary pattern from the pandas
cookbook(无论如何都要花钱阅读)。它是 a? b: c
的矢量版本。
只需使用 value
参数代替 method
:
In [20]: df
Out[20]:
Cat1 Cat2 Day
0 cat mouse 1
1 dog elephant 2
2 cat giraf 3
3 NaN ant 4
In [21]: df.Cat1 = df.Cat1.fillna(value=df.Cat2)
In [22]: df
Out[22]:
Cat1 Cat2 Day
0 cat mouse 1
1 dog elephant 2
2 cat giraf 3
3 ant ant 4
您可以将此列提供给 fillna
(参见 docs),它将使用匹配索引上的这些值来填充:
In [17]: df['Cat1'].fillna(df['Cat2'])
Out[17]:
0 cat
1 dog
2 cat
3 ant
Name: Cat1, dtype: object
这是一个更通用的方法(fillna 方法可能更好)
def is_missing(Cat1,Cat2):
if np.isnan(Cat1):
return Cat2
else:
return Cat1
df['Cat1'] = df.apply(lambda x: is_missing(x['Cat1'],x['Cat2']),axis=1)
pandas.DataFrame.combine_first 也有效。
(注意:由于 "Result index columns will be the union of the respective indexes and columns",您应该检查索引和列是否匹配。)
import numpy as np
import pandas as pd
df = pd.DataFrame([["1","cat","mouse"],
["2","dog","elephant"],
["3","cat","giraf"],
["4",np.nan,"ant"]],columns=["Day","Cat1","Cat2"])
In: df["Cat1"].combine_first(df["Cat2"])
Out:
0 cat
1 dog
2 cat
3 ant
Name: Cat1, dtype: object
与其他答案比较:
%timeit df["Cat1"].combine_first(df["Cat2"])
181 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df['Cat1'].fillna(df['Cat2'])
253 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.where(df.Cat1.isnull(), df.Cat2, df.Cat1)
88.1 µs ± 793 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
下面这个方法我没用过:
def is_missing(Cat1,Cat2):
if np.isnan(Cat1):
return Cat2
else:
return Cat1
df['Cat1'] = df.apply(lambda x: is_missing(x['Cat1'],x['Cat2']),axis=1)
因为它会引发异常:
TypeError: ("ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''", 'occurred at index 0')
这意味着 np.isnan 可以应用于原生 dtype 的 NumPy 数组(例如 np.float64),
但在应用于 object 数组时引发 TypeError。
所以我修改方法:
def is_missing(Cat1,Cat2):
if pd.isnull(Cat1):
return Cat2
else:
return Cat1
%timeit df.apply(lambda x: is_missing(x['Cat1'],x['Cat2']),axis=1)
701 µs ± 7.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
我知道这是一个老问题,但我最近需要做类似的事情。我能够使用以下内容:
df = pd.DataFrame([["1","cat","mouse"],
["2","dog","elephant"],
["3","cat","giraf"],
["4",np.nan,"ant"]],columns=["Day","Cat1","Cat2"])
print(df)
Day Cat1 Cat2
0 1 cat mouse
1 2 dog elephant
2 3 cat giraf
3 4 NaN ant
df1 = df.bfill(axis=1).iloc[:, 1]
df1 = df1.to_frame()
print(df1)
产生:
Cat1
0 cat
1 dog
2 cat
3 ant
希望对大家有所帮助!
我想使用 fillna
方法用另一列中的值填充一列中的缺失值。
(我读到循环遍历每一行是非常糟糕的做法,一次完成所有事情会更好,但我无法找到如何使用 fillna
。)
之前的数据:
Day Cat1 Cat2
1 cat mouse
2 dog elephant
3 cat giraf
4 NaN ant
之后的数据:
Day Cat1 Cat2
1 cat mouse
2 dog elephant
3 cat giraf
4 ant ant
你可以做到
df.Cat1 = np.where(df.Cat1.isnull(), df.Cat2, df.Cat1)
RHS 的整体结构使用 the ternary pattern from the pandas
cookbook(无论如何都要花钱阅读)。它是 a? b: c
的矢量版本。
只需使用 value
参数代替 method
:
In [20]: df
Out[20]:
Cat1 Cat2 Day
0 cat mouse 1
1 dog elephant 2
2 cat giraf 3
3 NaN ant 4
In [21]: df.Cat1 = df.Cat1.fillna(value=df.Cat2)
In [22]: df
Out[22]:
Cat1 Cat2 Day
0 cat mouse 1
1 dog elephant 2
2 cat giraf 3
3 ant ant 4
您可以将此列提供给 fillna
(参见 docs),它将使用匹配索引上的这些值来填充:
In [17]: df['Cat1'].fillna(df['Cat2'])
Out[17]:
0 cat
1 dog
2 cat
3 ant
Name: Cat1, dtype: object
这是一个更通用的方法(fillna 方法可能更好)
def is_missing(Cat1,Cat2):
if np.isnan(Cat1):
return Cat2
else:
return Cat1
df['Cat1'] = df.apply(lambda x: is_missing(x['Cat1'],x['Cat2']),axis=1)
pandas.DataFrame.combine_first 也有效。
(注意:由于 "Result index columns will be the union of the respective indexes and columns",您应该检查索引和列是否匹配。)
import numpy as np
import pandas as pd
df = pd.DataFrame([["1","cat","mouse"],
["2","dog","elephant"],
["3","cat","giraf"],
["4",np.nan,"ant"]],columns=["Day","Cat1","Cat2"])
In: df["Cat1"].combine_first(df["Cat2"])
Out:
0 cat
1 dog
2 cat
3 ant
Name: Cat1, dtype: object
与其他答案比较:
%timeit df["Cat1"].combine_first(df["Cat2"])
181 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df['Cat1'].fillna(df['Cat2'])
253 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.where(df.Cat1.isnull(), df.Cat2, df.Cat1)
88.1 µs ± 793 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
下面这个方法我没用过:
def is_missing(Cat1,Cat2):
if np.isnan(Cat1):
return Cat2
else:
return Cat1
df['Cat1'] = df.apply(lambda x: is_missing(x['Cat1'],x['Cat2']),axis=1)
因为它会引发异常:
TypeError: ("ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''", 'occurred at index 0')
这意味着 np.isnan 可以应用于原生 dtype 的 NumPy 数组(例如 np.float64), 但在应用于 object 数组时引发 TypeError。
所以我修改方法:
def is_missing(Cat1,Cat2):
if pd.isnull(Cat1):
return Cat2
else:
return Cat1
%timeit df.apply(lambda x: is_missing(x['Cat1'],x['Cat2']),axis=1)
701 µs ± 7.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
我知道这是一个老问题,但我最近需要做类似的事情。我能够使用以下内容:
df = pd.DataFrame([["1","cat","mouse"],
["2","dog","elephant"],
["3","cat","giraf"],
["4",np.nan,"ant"]],columns=["Day","Cat1","Cat2"])
print(df)
Day Cat1 Cat2
0 1 cat mouse
1 2 dog elephant
2 3 cat giraf
3 4 NaN ant
df1 = df.bfill(axis=1).iloc[:, 1]
df1 = df1.to_frame()
print(df1)
产生:
Cat1
0 cat
1 dog
2 cat
3 ant
希望对大家有所帮助!