在 pandas 数据帧上应用条件以过滤数组时的 FutureWarning

FutureWarning when applying a condition on a pandas dataframe to filter an array

我已将 PCA 应用于大约 1000 个观测值的数组,但如果原始数组中的某个特征=某物,我只想将观测值保留在新数组中。

我有一个 numpy 数组 df2 和一个数据框 df。我想找到 df2 中的所有行,其中 df.PositionCDM

我的实际数据:

df2

[[ -6.00987823e+00   4.46585005e+00]
 [ -7.09055159e+00   1.89437600e+00]
 [ -5.91044431e+00  -1.97888707e+00]
 [ -4.85698965e+00  -1.09936724e+00]
 [ -4.01780368e-01  -2.57178392e+00]
 [ -2.97351215e+00  -3.15940358e+00]
 [ -4.27973589e+00   2.82707326e+00]
 [  3.95086576e+00   1.08281922e+00]
 [ -2.94075361e+00  -1.95544661e+00]
 [ -4.83788056e+00   2.32369496e+00]
 [ -5.00473716e+00  -3.37680552e-01]
 [ -4.88905829e+00  -1.55527476e+00]
 [ -3.38202709e+00  -1.04402867e+00]
 [ -2.14261510e+00  -5.30757477e-01]
 [  3.00813803e-01  -2.11010985e+00]
 [ -2.67824986e+00  -1.83303905e+00]
 [ -1.64547049e+00  -2.48056250e+00]
 [ -2.92550543e+00  -3.02363170e+00]
 [ -4.01116933e+00   2.90363840e+00]
 [ -1.04571206e+00   7.58064433e-01]
 [  2.34068739e-01  -2.33981296e+00]
 [  3.15597517e+00   1.09429188e+00]
 [ -3.83828970e+00   1.14195305e-01]
 [ -7.33794066e-01  -3.70152816e+00]
 [  8.21789967e-01  -4.77818413e-01]
 [ -3.29257688e+00  -1.61887349e+00]
 [ -4.24297171e+00   2.27187714e+00]
 [  1.45714199e+00  -3.56024788e+00]
 [  1.79855738e+00  -3.71818328e-01]
 [  3.68171085e-01  -3.52961707e+00]
 [  3.77585412e+00  -3.01627595e-01]
 [ -4.21740128e+00  -1.30913719e+00]
 [ -3.85041585e+00  -1.05515969e+00]
 [ -5.01752378e+00   4.67348167e-01]
 [  3.65943448e+00   9.21016483e-01]
 [  3.12159896e+00  -1.25707872e-01]
 [ -4.50219722e+00  -4.06752784e+00]
 [ -3.92172250e+00  -2.88567430e+00]
 [ -2.68908475e-01  -2.17506629e+00]
 [ -1.13728112e+00  -2.66843007e+00]
 [ -8.73467957e-01  -1.24389494e+00]
 [  3.21966300e+00  -1.35271239e-01]
 [ -4.31060796e+00  -1.90505910e+00]
 [  3.73904981e+00   7.70228802e-01]
 [  1.02646986e+00  -5.91828676e-01]
 [  8.43840480e-01  -1.49636218e+00]
 [  1.54065978e+00  -1.65086030e+00]
 [  2.96602068e+00  -7.41024474e-01]
 [  6.53636345e-01   3.04647288e-01]
 [  2.59236989e+00  -6.70435261e-02]
 [  2.00184665e-01  -1.55230314e+00]
 [ -7.29533092e-01  -2.73390749e+00]
 [ -2.93578745e+00  -2.18118257e+00]
 [ -4.37481195e+00   1.02701222e+00]
 [  1.00713302e+00  -1.39943282e+00]
...]


df

(只是在 football/soccer 中的位置 - FB、CB、CDM、CM、AM、FW)

Position
FW
FW
FW
FW
FB
AM
FW
CB
AM
FW
AM
FW
AM
CM
FB
AM
CM
CM
FW
CM
CDM
CB
AM
FB
CDM
FW
FW
CDM
FB
CDM
CB
AM
...
AM

过滤时,我得到这个输出(连同 FutureWarning):

我哪里出错了,我该如何适当地过滤数据?

FutureWarning 可能是您的 numpypandas 版本已过时的结果。您可以使用以下方式升级它们:

pip install --upgrade numpy pandas 

至于过滤,有很多选择。在这里,我用一些虚拟数据提到了每一个。


设置

df
    name colour  a  b  c  d  e  f
0   john    red  1  2  3  4  5  6
1  james    red  2  3  4  5  6  7
2   jane   blue  1  2  3  5  7  8

df2
       0      1
0  0.122  0.222
1  0.343  0.345
2  0.345  0.563

选项 1
boolean indexing

df2[df.colour == 'red']
Out[726]: 
       0       1
0  0.122   0.222
1  0.343   0.345

选项 2
df.eval

df2[df.eval('colour == "red"')]
Out[732]: 
       0       1
0  0.122   0.222
1  0.343   0.345

请注意,即使 df2 是以下形式的 numpy 数组,这两个选项都有效:

array([[ 0.122,  0.222],
       [ 0.343,  0.345],
       [ 0.345,  0.563]])

对于您的实际数据,您需要按照相同的方式做一些事情:

df2

array([[-6.01 ,  4.466],
       [-7.091,  1.894],
       [-5.91 , -1.979],
       [-4.857, -1.099],
       [-0.402, -2.572],
       [-2.974, -3.159],
       [-4.28 ,  2.827],
       [ 3.951,  1.083],
       [-2.941, -1.955],
       [-4.838,  2.324],
       [-5.005, -0.338],
       [-4.889, -1.555],
       [-3.382, -1.044],
       [-2.143, -0.531],
       [ 0.301, -2.11 ],
       [-2.678, -1.833],
       [-1.645, -2.481],
       [-2.926, -3.024],
       [-4.011,  2.904],
       [-1.046,  0.758],
       [ 0.234, -2.34 ],
       [ 3.156,  1.094],
       [-3.838,  0.114],
       [-0.734, -3.702],
       [ 0.822, -0.478],
       [-3.293, -1.619],
       [-4.243,  2.272],
       [ 1.457, -3.56 ],
       [ 1.799, -0.372],
       [ 0.368, -3.53 ],
       [ 3.776, -0.302],
       [-4.217, -1.309]])

df

   Position
0        FW
1        FW
2        FW
3        FW
4        FB
5        AM
6        FW
7        CB
8        AM
9        FW
10       AM
11       FW
12       AM
13       CM
14       FB
15       AM
16       CM
17       CM
18       FW
19       CM
20      CDM
21       CB
22       AM
23       FB
24      CDM
25       FW
26       FW
27      CDM
28       FB
29      CDM
30       CB
31       AM

df2[df.Position == 'CDM']

array([[ 0.234, -2.34 ],
       [ 0.822, -0.478],
       [ 1.457, -3.56 ],
       [ 0.368, -3.53 ]])

我觉得你需要boolean indexing:

from sklearn.decomposition import PCA
import pandas as pd

d = {'d': [4, 5, 5],
     'a': [1, 2, 1], 
     'name': ['john', 'james', 'jane'], 
     'e': [5, 6, 7],
     'f': [6, 7, 8], 'c': [3, 4, 3], 
     'b': [2, 3, 2], 
     'colour': ['red', 'red', 'blue']}
cols = ['name', 'colour', 'a', 'b', 'c', 'd', 'e', 'f']
df = pd.DataFrame(d, columns = cols)
print (df)
    name colour  a  b  c  d  e  f
0   john    red  1  2  3  4  5  6
1  james    red  2  3  4  5  6  7
2   jane   blue  1  2  3  5  7  8

#create mask by condition
mask = df['colour'] == 'red'
#for multiple values
#mask = df['colour'].isin(['red', 'green', 'blue'])
print (mask)
0     True
1     True
2    False
Name: colour, dtype: bool

#filter only numeric values and convert to numpy array
arr = df.drop(['name','colour'], axis=1).values
print (arr)
[[1 2 3 4 5 6]
 [2 3 4 5 6 7]
 [1 2 3 5 7 8]]

pca = PCA(n_components=5)
pca.fit(arr)
print (pca.components_ )
[[-0.0463861  -0.0463861  -0.0463861  -0.35279184 -0.65919758 -0.65919758]
 [ 0.55515147  0.55515147  0.55515147  0.21897879 -0.11719389 -0.11719389]
 [ 0.62531284 -0.13184966 -0.136648   -0.71363037  0.17840759  0.17840759]]

#filter by condition
arr1 = pca.components_ [mask]
print (arr1)
[[-0.0463861  -0.0463861  -0.0463861  -0.35279184 -0.65919758 -0.65919758]
 [ 0.55515147  0.55515147  0.55515147  0.21897879 -0.11719389 -0.11719389]]