删除行和 ValueError 数组的长度不同
remove rows and ValueError Arrays were different lengths
我的数据框有子类别,在每个类别下(cat
、dog
、bird
),会显示统计信息。如果行包含 count
和 freq
中的信息,我需要删除这些行,并且只保留具有 sd
和 mean
值的行。一些值是 NaN
.
ValueError
出现在我的代码中。
df:
var stats A B C
cat mean 2 3 4
NaN sd 2 1 3
NaN count 5 2 6
NaN freq 3 1 19
dog mean 8 1 2
NaN sd 2 1 3
NaN count 4 6 1
NaN freq 3 1 19
bird mean 2 3 4
NaN sd 2 1 3
NaN count 5 2 6
NaN freq NaN NaN NaN
我的代码:
rows = ['count', 'freq']
df = [df.stats != rows]
预期结果
var stats A B C
cat mean 2 3 4
NaN sd 2 1 3
dog mean 8 1 2
NaN sd 2 1 3
bird mean 2 3 4
NaN sd 2 1 3
错误:
File "pandas/_libs/lib.pyx", line 805, in pandas._libs.lib.vec_compare
(pandas/_libs/lib.c:14288)
ValueError: Arrays were different lengths: 819 vs 9
我不确定如何检查数组长度,但在我的 excel 电子表格中,所有列和行的长度都相同。此错误是由我数据中的 NaN/empty 单元格引起的吗?
谢谢!
!=
在这里不起作用。使用 pd.Series.isin
获取掩码,然后您将使用它来过滤数据框。
m = ~df.stats.isin(['count', 'freq'])
print(m)
0 True
1 True
2 False
3 False
4 True
5 True
6 False
7 False
8 True
9 True
10 False
11 False
Name: stats, dtype: bool
print(df[m])
var stats A B C
0 cat mean 2.0 3.0 4.0
1 NaN sd 2.0 1.0 3.0
4 dog mean 8.0 1.0 2.0
5 NaN sd 2.0 1.0 3.0
8 bird mean 2.0 3.0 4.0
9 NaN sd 2.0 1.0 3.0
你可以使用类似SQL的query()
方法:
In [163]: df.query("stats not in ['count','freq']")
Out[163]:
var stats A B C
0 cat mean 2.0 3.0 4.0
1 NaN sd 2.0 1.0 3.0
4 dog mean 8.0 1.0 2.0
5 NaN sd 2.0 1.0 3.0
8 bird mean 2.0 3.0 4.0
9 NaN sd 2.0 1.0 3.0
或使用您的 rows
变量:
In [164]: df.query("stats not in @rows")
Out[164]:
var stats A B C
0 cat mean 2.0 3.0 4.0
1 NaN sd 2.0 1.0 3.0
4 dog mean 8.0 1.0 2.0
5 NaN sd 2.0 1.0 3.0
8 bird mean 2.0 3.0 4.0
9 NaN sd 2.0 1.0 3.0
为了好玩!
rows = ['count', 'freq']
df.merge(pd.DataFrame(dict(stats=np.setdiff1d(df.stats, rows))))
var stats A B C
0 cat mean 2.0 3.0 4.0
1 dog mean 8.0 1.0 2.0
2 bird mean 2.0 3.0 4.0
3 NaN sd 2.0 1.0 3.0
4 NaN sd 2.0 1.0 3.0
5 NaN sd 2.0 1.0 3.0
另一种有趣的方式 index
和 drop
df.set_index('stats').drop(rows).reset_index()
stats var A B C
0 mean cat 2.0 3.0 4.0
1 sd NaN 2.0 1.0 3.0
2 mean dog 8.0 1.0 2.0
3 sd NaN 2.0 1.0 3.0
4 mean bird 2.0 3.0 4.0
5 sd NaN 2.0 1.0 3.0
哈哈:)
df[[x not in rows for x in df.stats]]
Out[520]:
var stats A B C
0 cat mean 2.0 3.0 4.0
1 NaN sd 2.0 1.0 3.0
4 dog mean 8.0 1.0 2.0
5 NaN sd 2.0 1.0 3.0
8 bird mean 2.0 3.0 4.0
9 NaN sd 2.0 1.0 3.0
我的数据框有子类别,在每个类别下(cat
、dog
、bird
),会显示统计信息。如果行包含 count
和 freq
中的信息,我需要删除这些行,并且只保留具有 sd
和 mean
值的行。一些值是 NaN
.
ValueError
出现在我的代码中。
df:
var stats A B C
cat mean 2 3 4
NaN sd 2 1 3
NaN count 5 2 6
NaN freq 3 1 19
dog mean 8 1 2
NaN sd 2 1 3
NaN count 4 6 1
NaN freq 3 1 19
bird mean 2 3 4
NaN sd 2 1 3
NaN count 5 2 6
NaN freq NaN NaN NaN
我的代码:
rows = ['count', 'freq']
df = [df.stats != rows]
预期结果
var stats A B C
cat mean 2 3 4
NaN sd 2 1 3
dog mean 8 1 2
NaN sd 2 1 3
bird mean 2 3 4
NaN sd 2 1 3
错误:
File "pandas/_libs/lib.pyx", line 805, in pandas._libs.lib.vec_compare
(pandas/_libs/lib.c:14288)
ValueError: Arrays were different lengths: 819 vs 9
我不确定如何检查数组长度,但在我的 excel 电子表格中,所有列和行的长度都相同。此错误是由我数据中的 NaN/empty 单元格引起的吗?
谢谢!
!=
在这里不起作用。使用 pd.Series.isin
获取掩码,然后您将使用它来过滤数据框。
m = ~df.stats.isin(['count', 'freq'])
print(m)
0 True
1 True
2 False
3 False
4 True
5 True
6 False
7 False
8 True
9 True
10 False
11 False
Name: stats, dtype: bool
print(df[m])
var stats A B C
0 cat mean 2.0 3.0 4.0
1 NaN sd 2.0 1.0 3.0
4 dog mean 8.0 1.0 2.0
5 NaN sd 2.0 1.0 3.0
8 bird mean 2.0 3.0 4.0
9 NaN sd 2.0 1.0 3.0
你可以使用类似SQL的query()
方法:
In [163]: df.query("stats not in ['count','freq']")
Out[163]:
var stats A B C
0 cat mean 2.0 3.0 4.0
1 NaN sd 2.0 1.0 3.0
4 dog mean 8.0 1.0 2.0
5 NaN sd 2.0 1.0 3.0
8 bird mean 2.0 3.0 4.0
9 NaN sd 2.0 1.0 3.0
或使用您的 rows
变量:
In [164]: df.query("stats not in @rows")
Out[164]:
var stats A B C
0 cat mean 2.0 3.0 4.0
1 NaN sd 2.0 1.0 3.0
4 dog mean 8.0 1.0 2.0
5 NaN sd 2.0 1.0 3.0
8 bird mean 2.0 3.0 4.0
9 NaN sd 2.0 1.0 3.0
为了好玩!
rows = ['count', 'freq']
df.merge(pd.DataFrame(dict(stats=np.setdiff1d(df.stats, rows))))
var stats A B C
0 cat mean 2.0 3.0 4.0
1 dog mean 8.0 1.0 2.0
2 bird mean 2.0 3.0 4.0
3 NaN sd 2.0 1.0 3.0
4 NaN sd 2.0 1.0 3.0
5 NaN sd 2.0 1.0 3.0
另一种有趣的方式 index
和 drop
df.set_index('stats').drop(rows).reset_index()
stats var A B C
0 mean cat 2.0 3.0 4.0
1 sd NaN 2.0 1.0 3.0
2 mean dog 8.0 1.0 2.0
3 sd NaN 2.0 1.0 3.0
4 mean bird 2.0 3.0 4.0
5 sd NaN 2.0 1.0 3.0
哈哈:)
df[[x not in rows for x in df.stats]]
Out[520]:
var stats A B C
0 cat mean 2.0 3.0 4.0
1 NaN sd 2.0 1.0 3.0
4 dog mean 8.0 1.0 2.0
5 NaN sd 2.0 1.0 3.0
8 bird mean 2.0 3.0 4.0
9 NaN sd 2.0 1.0 3.0