尝试随机化数据帧的列时出现 KeyError
KeyError when trying to randomize a column of a dataframe
最小示例:
考虑这个数据框 temp
:
temp = pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],"B":[2,3,4,5,6,7,8,9,10,11],"C":[3,4,5,6,7,8,9,10,11,12]})
>>> temp
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9 10 11 12
现在,尝试在 for 循环中一次打乱每一列。
>>> for i in temp.columns:
... np.random.shuffle(temp.loc[:,i])
... print(temp)
...
A B C
0 8 2 3
1 3 3 4
2 9 4 5
3 6 5 6
4 4 6 7
5 10 7 8
6 7 8 9
7 1 9 10
8 2 10 11
9 5 11 12
A B C
0 8 7 3
1 3 9 4
2 9 8 5
3 6 10 6
4 4 4 7
5 10 11 8
6 7 5 9
7 1 3 10
8 2 2 11
9 5 6 12
A B C
0 8 7 6
1 3 9 8
2 9 8 4
3 6 10 10
4 4 4 7
5 10 11 11
6 7 5 5
7 1 3 3
8 2 2 12
9 5 6 9
这非常有效。
具体示例:
现在,如果我想获取此数据框的一部分,用于训练和测试目的,那么我将使用 sklearn.model_selection
中的 train_test_split
函数。
>>> from sklearn.model_selection import train_test_split
>>> temp = pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],"B":[2,3,4,5,6,7,8,9,10,11],"C":[3,4,5,6,7,8,9,10,11,12]})
>>> y = [i for i in range(16,26)]
>>> len(y)
10
>>> X_train,X_test,y_train,y_test = train_test_split(temp,y,test_size=0.2)
>>> X_train
A B C
2 3 4 5
6 7 8 9
8 9 10 11
0 1 2 3
7 8 9 10
3 4 5 6
1 2 3 4
9 10 11 12
现在,我们已经获得了 X_train
数据框。为了洗牌它的每一列:
>>> for i in X_train.columns:
... np.random.shuffle(X_train.loc[:,i])
... print(X_train)
...
不幸的是,这会导致错误。
错误:
sys:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "mtrand.pyx", line 4852, in mtrand.RandomState.shuffle
File "mtrand.pyx", line 4855, in mtrand.RandomState.shuffle
File "C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\series.py", line 623, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 2560, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas\_libs\index.pyx", line 83, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 91, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 811, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 817, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 4
跟踪问题及其解决方案:
在 SettingWithCopyWarning
下,我找到了 this 问题,第一个答案下面有一行:
However it could create a copy which updates a copy of data['amount']
which you would not see. Then you would be wondering why it is not
updating.
但是,如果是这种情况,那么为什么代码对第一种情况有效?
答案中还给出了:
Pandas returns a copy of an object in almost all method calls. The
inplace operations are a convience operation which work, but in
general are not clear that data is being modified and could
potentially work on copies.
因此,我们可以使用 np.random.permutation
而不是 np.random.shuffle
,如 this 答案所示。所以:
>>> for i in X_train.columns:
... X_train.loc[:,i] = np.random.permutation(X_train.loc[:,i])
... print(X_train)
...
但是,我又得到了 SettingWithCopyWarning
,也得到了答案。
C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexing.py:621: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item_labels[indexer[info_axis]]] = value
A B C
2 10 4 5
6 9 8 9
8 2 10 11
0 8 2 3
7 1 9 10
3 3 5 6
1 4 3 4
9 7 11 12
A B C
2 10 5 5
6 9 11 9
8 2 4 11
0 8 9 3
7 1 3 10
3 3 8 6
1 4 10 4
9 7 2 12
A B C
2 10 5 10
6 9 11 5
8 2 4 11
0 8 9 3
7 1 3 4
3 3 8 6
1 4 10 12
9 7 2 9
这可能是一个解决方法。
问题:
- 当我使用
train_test_split
时,为什么代码适用于第一种情况,而不适用于第二种情况?
- 为什么我没有使用就地洗牌器
np.random.shuffle
时仍然得到 SettingWithCopyWarning
?
请求建议:
- 是否有更好的(易于 use/error free/faster)的方法来进行列改组?
1.Why does the code work for the first case, and not the second case, when I use train_test_split
?
因为 train_test_split 打乱了 X_train
的行。因此每列的索引不是一个范围而是一组值
你可以通过检查 temp
和 X_train
的索引看到这一点
X_train.index
Int64Index([6, 8, 9, 5, 0, 2, 3, 4], dtype='int64')
temp.index
RangeIndex(start=0, stop=10, step=1)
在第一种情况下,与第二种情况不同,可以将列安全地视为数组。如果您将第二种情况下的代码更改为
for i in X_train.columns:
np.random.shuffle(X_train.loc[:,i].values)
print(X_train)
这不会导致错误。
请注意,在您提供的案例中,洗牌将导致每列的洗牌不同。即数据点会混淆。
2.Why do I still get the SettingWithCopyWarning
when I'm not using the inplace shuffler np.random.shuffle
?
我在使用最新版本 pandas (0.22.0)
时没有收到警告
Requests for Suggestions:
- Is there a better (easy to use/error free/faster) method to do column shuffling?
我建议在axis=1
时使用sample,它会打乱列数,samples数应该是列数。即 X_train.shape[1]
X_train = X_train.sample(X_train.shape[1],axis=1)
In []: X_train.sample(X_train.shape[1],axis=1)
Out[]:
B A C
6 8 7 9
9 11 10 12
8 10 9 11
4 6 5 7
5 7 6 8
0 2 1 3
2 4 3 5
3 5 4 6
我也 运行 和 train_test_split 一起解决了这个问题。我改用这个:
np.random.shuffle(x.iloc[:, i].values)
不确定它为什么有效,但它似乎解决了问题
最小示例:
考虑这个数据框 temp
:
temp = pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],"B":[2,3,4,5,6,7,8,9,10,11],"C":[3,4,5,6,7,8,9,10,11,12]})
>>> temp
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9 10 11 12
现在,尝试在 for 循环中一次打乱每一列。
>>> for i in temp.columns:
... np.random.shuffle(temp.loc[:,i])
... print(temp)
...
A B C
0 8 2 3
1 3 3 4
2 9 4 5
3 6 5 6
4 4 6 7
5 10 7 8
6 7 8 9
7 1 9 10
8 2 10 11
9 5 11 12
A B C
0 8 7 3
1 3 9 4
2 9 8 5
3 6 10 6
4 4 4 7
5 10 11 8
6 7 5 9
7 1 3 10
8 2 2 11
9 5 6 12
A B C
0 8 7 6
1 3 9 8
2 9 8 4
3 6 10 10
4 4 4 7
5 10 11 11
6 7 5 5
7 1 3 3
8 2 2 12
9 5 6 9
这非常有效。
具体示例:
现在,如果我想获取此数据框的一部分,用于训练和测试目的,那么我将使用 sklearn.model_selection
中的 train_test_split
函数。
>>> from sklearn.model_selection import train_test_split
>>> temp = pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],"B":[2,3,4,5,6,7,8,9,10,11],"C":[3,4,5,6,7,8,9,10,11,12]})
>>> y = [i for i in range(16,26)]
>>> len(y)
10
>>> X_train,X_test,y_train,y_test = train_test_split(temp,y,test_size=0.2)
>>> X_train
A B C
2 3 4 5
6 7 8 9
8 9 10 11
0 1 2 3
7 8 9 10
3 4 5 6
1 2 3 4
9 10 11 12
现在,我们已经获得了 X_train
数据框。为了洗牌它的每一列:
>>> for i in X_train.columns:
... np.random.shuffle(X_train.loc[:,i])
... print(X_train)
...
不幸的是,这会导致错误。
错误:
sys:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "mtrand.pyx", line 4852, in mtrand.RandomState.shuffle
File "mtrand.pyx", line 4855, in mtrand.RandomState.shuffle
File "C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\series.py", line 623, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 2560, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas\_libs\index.pyx", line 83, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 91, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 811, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 817, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 4
跟踪问题及其解决方案:
在 SettingWithCopyWarning
下,我找到了 this 问题,第一个答案下面有一行:
However it could create a copy which updates a copy of
data['amount']
which you would not see. Then you would be wondering why it is not updating.
但是,如果是这种情况,那么为什么代码对第一种情况有效?
答案中还给出了:
Pandas returns a copy of an object in almost all method calls. The inplace operations are a convience operation which work, but in general are not clear that data is being modified and could potentially work on copies.
因此,我们可以使用 np.random.permutation
而不是 np.random.shuffle
,如 this 答案所示。所以:
>>> for i in X_train.columns:
... X_train.loc[:,i] = np.random.permutation(X_train.loc[:,i])
... print(X_train)
...
但是,我又得到了 SettingWithCopyWarning
,也得到了答案。
C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexing.py:621: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item_labels[indexer[info_axis]]] = value
A B C
2 10 4 5
6 9 8 9
8 2 10 11
0 8 2 3
7 1 9 10
3 3 5 6
1 4 3 4
9 7 11 12
A B C
2 10 5 5
6 9 11 9
8 2 4 11
0 8 9 3
7 1 3 10
3 3 8 6
1 4 10 4
9 7 2 12
A B C
2 10 5 10
6 9 11 5
8 2 4 11
0 8 9 3
7 1 3 4
3 3 8 6
1 4 10 12
9 7 2 9
这可能是一个解决方法。
问题:
- 当我使用
train_test_split
时,为什么代码适用于第一种情况,而不适用于第二种情况? - 为什么我没有使用就地洗牌器
np.random.shuffle
时仍然得到SettingWithCopyWarning
?
请求建议:
- 是否有更好的(易于 use/error free/faster)的方法来进行列改组?
1.Why does the code work for the first case, and not the second case, when I use
train_test_split
?
因为 train_test_split 打乱了 X_train
的行。因此每列的索引不是一个范围而是一组值
你可以通过检查 temp
和 X_train
X_train.index
Int64Index([6, 8, 9, 5, 0, 2, 3, 4], dtype='int64')
temp.index
RangeIndex(start=0, stop=10, step=1)
在第一种情况下,与第二种情况不同,可以将列安全地视为数组。如果您将第二种情况下的代码更改为
for i in X_train.columns:
np.random.shuffle(X_train.loc[:,i].values)
print(X_train)
这不会导致错误。
请注意,在您提供的案例中,洗牌将导致每列的洗牌不同。即数据点会混淆。
2.Why do I still get the
SettingWithCopyWarning
when I'm not using the inplace shufflernp.random.shuffle
?
我在使用最新版本 pandas (0.22.0)
时没有收到警告Requests for Suggestions:
- Is there a better (easy to use/error free/faster) method to do column shuffling?
我建议在axis=1
时使用sample,它会打乱列数,samples数应该是列数。即 X_train.shape[1]
X_train = X_train.sample(X_train.shape[1],axis=1)
In []: X_train.sample(X_train.shape[1],axis=1)
Out[]:
B A C
6 8 7 9
9 11 10 12
8 10 9 11
4 6 5 7
5 7 6 8
0 2 1 3
2 4 3 5
3 5 4 6
我也 运行 和 train_test_split 一起解决了这个问题。我改用这个:
np.random.shuffle(x.iloc[:, i].values)
不确定它为什么有效,但它似乎解决了问题