python 删除大多数列为 nans 的行
python delete row where most columns are nans
我正在从 excel 导入数据,其中某些行可能在列中有注释,但并不是数据框的真正组成部分。虚拟例如。以下:
H1 H2 H3
*highlighted cols are PII
sam red 5
pam blue 3
rod green 11
* this is the end of the data
将上述文件导入 dfPA 后,它看起来像:
dfPA:
Index H1 H2 H3
1 *highlighted cols are PII
2 sam red 5
3 pam blue 3
4 rod green 11
5 * this is the end of the data
我想删除第一行和最后一行。这就是我所做的。
#get count of cols in df
input: cntcols = dfPA.shape[1]
output: 3
#get count of cols with nan in df
input: a = dfPA.shape[1] - dfPA.count(axis=1)
output:
0 2
1 3
2 3
4 3
5 2
(where a is a series)
#convert a from series to df
dfa = a.to_frame()
#delete rows where no. of nan's are greater than 'n'
n = 1
for r, row in dfa.iterrows():
if (cntcols - dfa.iloc[r][0]) > n:
i = row.name
dfPA = dfPA.drop(index=i)
这行不通。有办法吗?
您应该使用 pandas.DataFrame.dropna 方法。它有一个 thresh
参数,您可以使用该参数定义要删除 row/column.
的最小 NaN 数
想象一下以下数据框:
>>> import numpy as np
>>> df = pd.DataFrame([[1,np.nan,1,np.nan], [1,1,1,1], [1,np.nan,1,1], [np.nan,1,1,1]], columns=list('ABCD'))
A B C D
0 1.0 NaN 1 NaN
1 1.0 1.0 1 1.0
2 1.0 NaN 1 1.0
3 NaN 1.0 1 1.0
您可以使用 NaN 删除列:
>>> df.dropna(axis=1)
C
0 1
1 1
2 1
3 1
thresh
参数定义保留列的非 NaN 值的最小数量:
>>> df.dropna(thresh=3, axis=1)
A C D
0 1.0 1 NaN
1 1.0 1 1.0
2 1.0 1 1.0
3 NaN 1 1.0
如果你想用NaN的个数来推理:
# example for a minimum of 2 NaN to drop the column
>>> df.dropna(thresh=len(df.columns)-(2-1), axis=1)
如果需要筛选行而不是列,请删除轴参数或使用 axis=0
:
>>> df.dropna(thresh=3)
我正在从 excel 导入数据,其中某些行可能在列中有注释,但并不是数据框的真正组成部分。虚拟例如。以下:
H1 H2 H3
*highlighted cols are PII
sam red 5
pam blue 3
rod green 11
* this is the end of the data
将上述文件导入 dfPA 后,它看起来像:
dfPA:
Index H1 H2 H3
1 *highlighted cols are PII
2 sam red 5
3 pam blue 3
4 rod green 11
5 * this is the end of the data
我想删除第一行和最后一行。这就是我所做的。
#get count of cols in df
input: cntcols = dfPA.shape[1]
output: 3
#get count of cols with nan in df
input: a = dfPA.shape[1] - dfPA.count(axis=1)
output:
0 2
1 3
2 3
4 3
5 2
(where a is a series)
#convert a from series to df
dfa = a.to_frame()
#delete rows where no. of nan's are greater than 'n'
n = 1
for r, row in dfa.iterrows():
if (cntcols - dfa.iloc[r][0]) > n:
i = row.name
dfPA = dfPA.drop(index=i)
这行不通。有办法吗?
您应该使用 pandas.DataFrame.dropna 方法。它有一个 thresh
参数,您可以使用该参数定义要删除 row/column.
想象一下以下数据框:
>>> import numpy as np
>>> df = pd.DataFrame([[1,np.nan,1,np.nan], [1,1,1,1], [1,np.nan,1,1], [np.nan,1,1,1]], columns=list('ABCD'))
A B C D
0 1.0 NaN 1 NaN
1 1.0 1.0 1 1.0
2 1.0 NaN 1 1.0
3 NaN 1.0 1 1.0
您可以使用 NaN 删除列:
>>> df.dropna(axis=1)
C
0 1
1 1
2 1
3 1
thresh
参数定义保留列的非 NaN 值的最小数量:
>>> df.dropna(thresh=3, axis=1)
A C D
0 1.0 1 NaN
1 1.0 1 1.0
2 1.0 1 1.0
3 NaN 1 1.0
如果你想用NaN的个数来推理:
# example for a minimum of 2 NaN to drop the column
>>> df.dropna(thresh=len(df.columns)-(2-1), axis=1)
如果需要筛选行而不是列,请删除轴参数或使用 axis=0
:
>>> df.dropna(thresh=3)