在 pandas 数据帧上应用条件以过滤数组时的 FutureWarning
FutureWarning when applying a condition on a pandas dataframe to filter an array
我已将 PCA 应用于大约 1000 个观测值的数组,但如果原始数组中的某个特征=某物,我只想将观测值保留在新数组中。
我有一个 numpy 数组 df2
和一个数据框 df
。我想找到 df2
中的所有行,其中 df.Position
是 CDM
。
我的实际数据:
df2
[[ -6.00987823e+00 4.46585005e+00]
[ -7.09055159e+00 1.89437600e+00]
[ -5.91044431e+00 -1.97888707e+00]
[ -4.85698965e+00 -1.09936724e+00]
[ -4.01780368e-01 -2.57178392e+00]
[ -2.97351215e+00 -3.15940358e+00]
[ -4.27973589e+00 2.82707326e+00]
[ 3.95086576e+00 1.08281922e+00]
[ -2.94075361e+00 -1.95544661e+00]
[ -4.83788056e+00 2.32369496e+00]
[ -5.00473716e+00 -3.37680552e-01]
[ -4.88905829e+00 -1.55527476e+00]
[ -3.38202709e+00 -1.04402867e+00]
[ -2.14261510e+00 -5.30757477e-01]
[ 3.00813803e-01 -2.11010985e+00]
[ -2.67824986e+00 -1.83303905e+00]
[ -1.64547049e+00 -2.48056250e+00]
[ -2.92550543e+00 -3.02363170e+00]
[ -4.01116933e+00 2.90363840e+00]
[ -1.04571206e+00 7.58064433e-01]
[ 2.34068739e-01 -2.33981296e+00]
[ 3.15597517e+00 1.09429188e+00]
[ -3.83828970e+00 1.14195305e-01]
[ -7.33794066e-01 -3.70152816e+00]
[ 8.21789967e-01 -4.77818413e-01]
[ -3.29257688e+00 -1.61887349e+00]
[ -4.24297171e+00 2.27187714e+00]
[ 1.45714199e+00 -3.56024788e+00]
[ 1.79855738e+00 -3.71818328e-01]
[ 3.68171085e-01 -3.52961707e+00]
[ 3.77585412e+00 -3.01627595e-01]
[ -4.21740128e+00 -1.30913719e+00]
[ -3.85041585e+00 -1.05515969e+00]
[ -5.01752378e+00 4.67348167e-01]
[ 3.65943448e+00 9.21016483e-01]
[ 3.12159896e+00 -1.25707872e-01]
[ -4.50219722e+00 -4.06752784e+00]
[ -3.92172250e+00 -2.88567430e+00]
[ -2.68908475e-01 -2.17506629e+00]
[ -1.13728112e+00 -2.66843007e+00]
[ -8.73467957e-01 -1.24389494e+00]
[ 3.21966300e+00 -1.35271239e-01]
[ -4.31060796e+00 -1.90505910e+00]
[ 3.73904981e+00 7.70228802e-01]
[ 1.02646986e+00 -5.91828676e-01]
[ 8.43840480e-01 -1.49636218e+00]
[ 1.54065978e+00 -1.65086030e+00]
[ 2.96602068e+00 -7.41024474e-01]
[ 6.53636345e-01 3.04647288e-01]
[ 2.59236989e+00 -6.70435261e-02]
[ 2.00184665e-01 -1.55230314e+00]
[ -7.29533092e-01 -2.73390749e+00]
[ -2.93578745e+00 -2.18118257e+00]
[ -4.37481195e+00 1.02701222e+00]
[ 1.00713302e+00 -1.39943282e+00]
...]
df
(只是在 football/soccer 中的位置 - FB、CB、CDM、CM、AM、FW)
Position
FW
FW
FW
FW
FB
AM
FW
CB
AM
FW
AM
FW
AM
CM
FB
AM
CM
CM
FW
CM
CDM
CB
AM
FB
CDM
FW
FW
CDM
FB
CDM
CB
AM
...
AM
过滤时,我得到这个输出(连同 FutureWarning
):
我哪里出错了,我该如何适当地过滤数据?
FutureWarning
可能是您的 numpy
和 pandas
版本已过时的结果。您可以使用以下方式升级它们:
pip install --upgrade numpy pandas
至于过滤,有很多选择。在这里,我用一些虚拟数据提到了每一个。
设置
df
name colour a b c d e f
0 john red 1 2 3 4 5 6
1 james red 2 3 4 5 6 7
2 jane blue 1 2 3 5 7 8
df2
0 1
0 0.122 0.222
1 0.343 0.345
2 0.345 0.563
选项 1
boolean indexing
df2[df.colour == 'red']
Out[726]:
0 1
0 0.122 0.222
1 0.343 0.345
选项 2
df.eval
df2[df.eval('colour == "red"')]
Out[732]:
0 1
0 0.122 0.222
1 0.343 0.345
请注意,即使 df2
是以下形式的 numpy
数组,这两个选项都有效:
array([[ 0.122, 0.222],
[ 0.343, 0.345],
[ 0.345, 0.563]])
对于您的实际数据,您需要按照相同的方式做一些事情:
df2
array([[-6.01 , 4.466],
[-7.091, 1.894],
[-5.91 , -1.979],
[-4.857, -1.099],
[-0.402, -2.572],
[-2.974, -3.159],
[-4.28 , 2.827],
[ 3.951, 1.083],
[-2.941, -1.955],
[-4.838, 2.324],
[-5.005, -0.338],
[-4.889, -1.555],
[-3.382, -1.044],
[-2.143, -0.531],
[ 0.301, -2.11 ],
[-2.678, -1.833],
[-1.645, -2.481],
[-2.926, -3.024],
[-4.011, 2.904],
[-1.046, 0.758],
[ 0.234, -2.34 ],
[ 3.156, 1.094],
[-3.838, 0.114],
[-0.734, -3.702],
[ 0.822, -0.478],
[-3.293, -1.619],
[-4.243, 2.272],
[ 1.457, -3.56 ],
[ 1.799, -0.372],
[ 0.368, -3.53 ],
[ 3.776, -0.302],
[-4.217, -1.309]])
df
Position
0 FW
1 FW
2 FW
3 FW
4 FB
5 AM
6 FW
7 CB
8 AM
9 FW
10 AM
11 FW
12 AM
13 CM
14 FB
15 AM
16 CM
17 CM
18 FW
19 CM
20 CDM
21 CB
22 AM
23 FB
24 CDM
25 FW
26 FW
27 CDM
28 FB
29 CDM
30 CB
31 AM
df2[df.Position == 'CDM']
array([[ 0.234, -2.34 ],
[ 0.822, -0.478],
[ 1.457, -3.56 ],
[ 0.368, -3.53 ]])
我觉得你需要boolean indexing:
from sklearn.decomposition import PCA
import pandas as pd
d = {'d': [4, 5, 5],
'a': [1, 2, 1],
'name': ['john', 'james', 'jane'],
'e': [5, 6, 7],
'f': [6, 7, 8], 'c': [3, 4, 3],
'b': [2, 3, 2],
'colour': ['red', 'red', 'blue']}
cols = ['name', 'colour', 'a', 'b', 'c', 'd', 'e', 'f']
df = pd.DataFrame(d, columns = cols)
print (df)
name colour a b c d e f
0 john red 1 2 3 4 5 6
1 james red 2 3 4 5 6 7
2 jane blue 1 2 3 5 7 8
#create mask by condition
mask = df['colour'] == 'red'
#for multiple values
#mask = df['colour'].isin(['red', 'green', 'blue'])
print (mask)
0 True
1 True
2 False
Name: colour, dtype: bool
#filter only numeric values and convert to numpy array
arr = df.drop(['name','colour'], axis=1).values
print (arr)
[[1 2 3 4 5 6]
[2 3 4 5 6 7]
[1 2 3 5 7 8]]
pca = PCA(n_components=5)
pca.fit(arr)
print (pca.components_ )
[[-0.0463861 -0.0463861 -0.0463861 -0.35279184 -0.65919758 -0.65919758]
[ 0.55515147 0.55515147 0.55515147 0.21897879 -0.11719389 -0.11719389]
[ 0.62531284 -0.13184966 -0.136648 -0.71363037 0.17840759 0.17840759]]
#filter by condition
arr1 = pca.components_ [mask]
print (arr1)
[[-0.0463861 -0.0463861 -0.0463861 -0.35279184 -0.65919758 -0.65919758]
[ 0.55515147 0.55515147 0.55515147 0.21897879 -0.11719389 -0.11719389]]
我已将 PCA 应用于大约 1000 个观测值的数组,但如果原始数组中的某个特征=某物,我只想将观测值保留在新数组中。
我有一个 numpy 数组 df2
和一个数据框 df
。我想找到 df2
中的所有行,其中 df.Position
是 CDM
。
我的实际数据:
df2
[[ -6.00987823e+00 4.46585005e+00]
[ -7.09055159e+00 1.89437600e+00]
[ -5.91044431e+00 -1.97888707e+00]
[ -4.85698965e+00 -1.09936724e+00]
[ -4.01780368e-01 -2.57178392e+00]
[ -2.97351215e+00 -3.15940358e+00]
[ -4.27973589e+00 2.82707326e+00]
[ 3.95086576e+00 1.08281922e+00]
[ -2.94075361e+00 -1.95544661e+00]
[ -4.83788056e+00 2.32369496e+00]
[ -5.00473716e+00 -3.37680552e-01]
[ -4.88905829e+00 -1.55527476e+00]
[ -3.38202709e+00 -1.04402867e+00]
[ -2.14261510e+00 -5.30757477e-01]
[ 3.00813803e-01 -2.11010985e+00]
[ -2.67824986e+00 -1.83303905e+00]
[ -1.64547049e+00 -2.48056250e+00]
[ -2.92550543e+00 -3.02363170e+00]
[ -4.01116933e+00 2.90363840e+00]
[ -1.04571206e+00 7.58064433e-01]
[ 2.34068739e-01 -2.33981296e+00]
[ 3.15597517e+00 1.09429188e+00]
[ -3.83828970e+00 1.14195305e-01]
[ -7.33794066e-01 -3.70152816e+00]
[ 8.21789967e-01 -4.77818413e-01]
[ -3.29257688e+00 -1.61887349e+00]
[ -4.24297171e+00 2.27187714e+00]
[ 1.45714199e+00 -3.56024788e+00]
[ 1.79855738e+00 -3.71818328e-01]
[ 3.68171085e-01 -3.52961707e+00]
[ 3.77585412e+00 -3.01627595e-01]
[ -4.21740128e+00 -1.30913719e+00]
[ -3.85041585e+00 -1.05515969e+00]
[ -5.01752378e+00 4.67348167e-01]
[ 3.65943448e+00 9.21016483e-01]
[ 3.12159896e+00 -1.25707872e-01]
[ -4.50219722e+00 -4.06752784e+00]
[ -3.92172250e+00 -2.88567430e+00]
[ -2.68908475e-01 -2.17506629e+00]
[ -1.13728112e+00 -2.66843007e+00]
[ -8.73467957e-01 -1.24389494e+00]
[ 3.21966300e+00 -1.35271239e-01]
[ -4.31060796e+00 -1.90505910e+00]
[ 3.73904981e+00 7.70228802e-01]
[ 1.02646986e+00 -5.91828676e-01]
[ 8.43840480e-01 -1.49636218e+00]
[ 1.54065978e+00 -1.65086030e+00]
[ 2.96602068e+00 -7.41024474e-01]
[ 6.53636345e-01 3.04647288e-01]
[ 2.59236989e+00 -6.70435261e-02]
[ 2.00184665e-01 -1.55230314e+00]
[ -7.29533092e-01 -2.73390749e+00]
[ -2.93578745e+00 -2.18118257e+00]
[ -4.37481195e+00 1.02701222e+00]
[ 1.00713302e+00 -1.39943282e+00]
...]
df
(只是在 football/soccer 中的位置 - FB、CB、CDM、CM、AM、FW)
Position
FW
FW
FW
FW
FB
AM
FW
CB
AM
FW
AM
FW
AM
CM
FB
AM
CM
CM
FW
CM
CDM
CB
AM
FB
CDM
FW
FW
CDM
FB
CDM
CB
AM
...
AM
过滤时,我得到这个输出(连同 FutureWarning
):
我哪里出错了,我该如何适当地过滤数据?
FutureWarning
可能是您的 numpy
和 pandas
版本已过时的结果。您可以使用以下方式升级它们:
pip install --upgrade numpy pandas
至于过滤,有很多选择。在这里,我用一些虚拟数据提到了每一个。
设置
df
name colour a b c d e f
0 john red 1 2 3 4 5 6
1 james red 2 3 4 5 6 7
2 jane blue 1 2 3 5 7 8
df2
0 1
0 0.122 0.222
1 0.343 0.345
2 0.345 0.563
选项 1
boolean indexing
df2[df.colour == 'red']
Out[726]:
0 1
0 0.122 0.222
1 0.343 0.345
选项 2
df.eval
df2[df.eval('colour == "red"')]
Out[732]:
0 1
0 0.122 0.222
1 0.343 0.345
请注意,即使 df2
是以下形式的 numpy
数组,这两个选项都有效:
array([[ 0.122, 0.222],
[ 0.343, 0.345],
[ 0.345, 0.563]])
对于您的实际数据,您需要按照相同的方式做一些事情:
df2
array([[-6.01 , 4.466],
[-7.091, 1.894],
[-5.91 , -1.979],
[-4.857, -1.099],
[-0.402, -2.572],
[-2.974, -3.159],
[-4.28 , 2.827],
[ 3.951, 1.083],
[-2.941, -1.955],
[-4.838, 2.324],
[-5.005, -0.338],
[-4.889, -1.555],
[-3.382, -1.044],
[-2.143, -0.531],
[ 0.301, -2.11 ],
[-2.678, -1.833],
[-1.645, -2.481],
[-2.926, -3.024],
[-4.011, 2.904],
[-1.046, 0.758],
[ 0.234, -2.34 ],
[ 3.156, 1.094],
[-3.838, 0.114],
[-0.734, -3.702],
[ 0.822, -0.478],
[-3.293, -1.619],
[-4.243, 2.272],
[ 1.457, -3.56 ],
[ 1.799, -0.372],
[ 0.368, -3.53 ],
[ 3.776, -0.302],
[-4.217, -1.309]])
df
Position
0 FW
1 FW
2 FW
3 FW
4 FB
5 AM
6 FW
7 CB
8 AM
9 FW
10 AM
11 FW
12 AM
13 CM
14 FB
15 AM
16 CM
17 CM
18 FW
19 CM
20 CDM
21 CB
22 AM
23 FB
24 CDM
25 FW
26 FW
27 CDM
28 FB
29 CDM
30 CB
31 AM
df2[df.Position == 'CDM']
array([[ 0.234, -2.34 ],
[ 0.822, -0.478],
[ 1.457, -3.56 ],
[ 0.368, -3.53 ]])
我觉得你需要boolean indexing:
from sklearn.decomposition import PCA
import pandas as pd
d = {'d': [4, 5, 5],
'a': [1, 2, 1],
'name': ['john', 'james', 'jane'],
'e': [5, 6, 7],
'f': [6, 7, 8], 'c': [3, 4, 3],
'b': [2, 3, 2],
'colour': ['red', 'red', 'blue']}
cols = ['name', 'colour', 'a', 'b', 'c', 'd', 'e', 'f']
df = pd.DataFrame(d, columns = cols)
print (df)
name colour a b c d e f
0 john red 1 2 3 4 5 6
1 james red 2 3 4 5 6 7
2 jane blue 1 2 3 5 7 8
#create mask by condition
mask = df['colour'] == 'red'
#for multiple values
#mask = df['colour'].isin(['red', 'green', 'blue'])
print (mask)
0 True
1 True
2 False
Name: colour, dtype: bool
#filter only numeric values and convert to numpy array
arr = df.drop(['name','colour'], axis=1).values
print (arr)
[[1 2 3 4 5 6]
[2 3 4 5 6 7]
[1 2 3 5 7 8]]
pca = PCA(n_components=5)
pca.fit(arr)
print (pca.components_ )
[[-0.0463861 -0.0463861 -0.0463861 -0.35279184 -0.65919758 -0.65919758]
[ 0.55515147 0.55515147 0.55515147 0.21897879 -0.11719389 -0.11719389]
[ 0.62531284 -0.13184966 -0.136648 -0.71363037 0.17840759 0.17840759]]
#filter by condition
arr1 = pca.components_ [mask]
print (arr1)
[[-0.0463861 -0.0463861 -0.0463861 -0.35279184 -0.65919758 -0.65919758]
[ 0.55515147 0.55515147 0.55515147 0.21897879 -0.11719389 -0.11719389]]