当我使用 pd.crosstab 它一直显示 AssertionError
When I use pd.crosstab it keeps showing AssertionError
当我使用pd.crosstab
构建混淆矩阵时,它一直显示
AssertionError: arrays and names must have the same length
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
import random
df = pd.read_csv('C:\Users\liukevin\Desktop\winequality-red.csv',sep=';', usecols=['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol','quality'])
Q=[]
for i in range(len(df)):
if df['quality'][i]<=5:
Q.append('Low')
else:
Q.append('High')
del df['quality']
test_number=sorted(random.sample(xrange(len(df)), int(len(df)*0.25)))
train_number=[]
temp=[]
for i in range(len(df)):
temp.append(i)
train_number=list(set(temp)-set(test_number))
distance_all=[]
for i in range(len(test_number)):
distance_sep=[]
for j in range(len(train_number)):
distance=pow(df['fixed acidity'][test_number[i]]-df['fixed acidity'][train_number[j]],2)+\
pow(df['volatile acidity'][test_number[i]]-df['volatile acidity'][train_number[j]],2)+\
pow(df['citric acid'][test_number[i]]-df['citric acid'][train_number[j]],2)+\
pow(df['residual sugar'][test_number[i]]-df['residual sugar'][train_number[j]],2)+\
pow(df['chlorides'][test_number[i]]-df['chlorides'][train_number[j]],2)+\
pow(df['free sulfur dioxide'][test_number[i]]-df['free sulfur dioxide'][train_number[j]],2)+\
pow(df['total sulfur dioxide'][test_number[i]]-df['total sulfur dioxide'][train_number[j]],2)+\
pow(df['density'][test_number[i]]-df['density'][train_number[j]],2)+\
pow(df['pH'][test_number[i]]-df['pH'][train_number[j]],2)+\
pow(df['sulphates'][test_number[i]]-df['sulphates'][train_number[j]],2)+\
pow(df['alcohol'][test_number[i]]-df['alcohol'][train_number[j]],2)
distance_sep.append(distance)
distance_all.append(distance_sep)
for round in range(5):
K=2*round+1
select_neighbor_all=[]
for i in range(len(test_number)):
select_neighbor_sep=np.argsort(distance_all[i])[:K]
select_neighbor_all.append(select_neighbor_sep)
prediction=[]
Q_test=[]
for i in range(len(test_number)):
Q_test.append(Q[test_number[i]])
#original data
Low_count=0
for j in range(K):
if Q[train_number[select_neighbor_all[i][j]]]=='Low':
Low_count+=1
if Low_count>(K/2):
prediction.append('Low')
else:
prediction.append('High')
print pd.crosstab(Q_test, prediction, rownames=['Actual'], colnames=['Predicted'], margins=True)
但是Q_test
和prediction
的长度不一样吗?
我想这可能是 "names" must have the same length
的问题,因为我不太确定它是什么意思。
(在 Q_test
和 prediction
数组中,只有二进制元素 'Low'
和 'High'
。)
select_neighbor_all
是我对 ith
测试数据的 select K 个最近邻居所做的。
看来您可能没有提供 pd.crosstab 执行必要计算所需的所有数据:
看看这个例子。这里我们提供了一个索引和两个列类别以及行名和列名:
>>> index = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
... "bar", "bar", "foo", "foo", "foo"], dtype=object)
>>> col_category_1 = np.array(["one", "one", "one", "two", "one", "one",
... "one", "two", "two", "two", "one"], dtype=object)
>>> col_category_2 = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
... "shiny", "dull", "shiny", "shiny", "shiny"],
... dtype=object)
# Notice the index AND the columns provided as a list
>>> pd.crosstab(index, [col_category_1, col_category_2],
rownames=['a'], colnames=['b', 'c'])
...
col_category_1 one two
col_category_2 dull shiny dull shiny
index
bar 1 2 1 0
foo 2 2 1 2
有关详细信息,请参阅 pd.crosstab
的 pandas documentation:
index : array-like, Series, or list of arrays/Series
Values to group by in the rows
columns : array-like, Series, or list of arrays/Series
Values to group by in the columns
rownames : sequence, default None
If passed, must match number of row arrays passed
colnames : sequence, default None
If passed, must match number of column arrays passed
如果您编辑以下行并包含正确的输入,应该可以解决您的问题...
# You will need to provide an index and columns...
# Here, 'Q_test' is being interpreted as your index
# 'prediction' is being used as a column...
pd.crosstab(Q_test, prediction,
rownames=['Actual'],
colnames=['Predicted'],
margins=True)
我只是花了一些时间来解决这个问题。在我的例子中,pandas 交叉表似乎不适用于列表。
如果您将列表转换为 numpy 数组,它应该可以正常工作。
所以在你的情况下会是:
pd.crosstab(np.array(Q_test), np.array(prediction), rownames=['Actual'],
colnames=['Predicted'], margins=True)
一个例子:
>>> import pandas as pd
>>> import numpy as np
>>> classifications = ['foo', 'bar', 'foo', 'bar']
>>> predictions = ['foo', 'foo', 'bar', 'bar']
>>> pd.crosstab(classifications, predictions, rownames=['Actual'], colnames=['Predicted'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bastian/miniconda3/envs/machine_learning/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 563, in crosstab
rownames = _get_names(index, rownames, prefix="row")
File "/home/bastian/miniconda3/envs/machine_learning/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 703, in _get_names
raise AssertionError("arrays and names must have the same length")
AssertionError: arrays and names must have the same length
>>> pd.crosstab(np.array(classifications), np.array(predictions), rownames=['Actual'], colnames=['Predicted'])
Predicted bar foo
Actual
bar 1 1
foo 1 1
发生这种情况是因为我认为某些运算(例如乘法)对列表的影响与对 numpy 数组的影响不同。
当我使用pd.crosstab
构建混淆矩阵时,它一直显示
AssertionError: arrays and names must have the same length
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
import random
df = pd.read_csv('C:\Users\liukevin\Desktop\winequality-red.csv',sep=';', usecols=['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol','quality'])
Q=[]
for i in range(len(df)):
if df['quality'][i]<=5:
Q.append('Low')
else:
Q.append('High')
del df['quality']
test_number=sorted(random.sample(xrange(len(df)), int(len(df)*0.25)))
train_number=[]
temp=[]
for i in range(len(df)):
temp.append(i)
train_number=list(set(temp)-set(test_number))
distance_all=[]
for i in range(len(test_number)):
distance_sep=[]
for j in range(len(train_number)):
distance=pow(df['fixed acidity'][test_number[i]]-df['fixed acidity'][train_number[j]],2)+\
pow(df['volatile acidity'][test_number[i]]-df['volatile acidity'][train_number[j]],2)+\
pow(df['citric acid'][test_number[i]]-df['citric acid'][train_number[j]],2)+\
pow(df['residual sugar'][test_number[i]]-df['residual sugar'][train_number[j]],2)+\
pow(df['chlorides'][test_number[i]]-df['chlorides'][train_number[j]],2)+\
pow(df['free sulfur dioxide'][test_number[i]]-df['free sulfur dioxide'][train_number[j]],2)+\
pow(df['total sulfur dioxide'][test_number[i]]-df['total sulfur dioxide'][train_number[j]],2)+\
pow(df['density'][test_number[i]]-df['density'][train_number[j]],2)+\
pow(df['pH'][test_number[i]]-df['pH'][train_number[j]],2)+\
pow(df['sulphates'][test_number[i]]-df['sulphates'][train_number[j]],2)+\
pow(df['alcohol'][test_number[i]]-df['alcohol'][train_number[j]],2)
distance_sep.append(distance)
distance_all.append(distance_sep)
for round in range(5):
K=2*round+1
select_neighbor_all=[]
for i in range(len(test_number)):
select_neighbor_sep=np.argsort(distance_all[i])[:K]
select_neighbor_all.append(select_neighbor_sep)
prediction=[]
Q_test=[]
for i in range(len(test_number)):
Q_test.append(Q[test_number[i]])
#original data
Low_count=0
for j in range(K):
if Q[train_number[select_neighbor_all[i][j]]]=='Low':
Low_count+=1
if Low_count>(K/2):
prediction.append('Low')
else:
prediction.append('High')
print pd.crosstab(Q_test, prediction, rownames=['Actual'], colnames=['Predicted'], margins=True)
但是Q_test
和prediction
的长度不一样吗?
我想这可能是 "names" must have the same length
的问题,因为我不太确定它是什么意思。
(在 Q_test
和 prediction
数组中,只有二进制元素 'Low'
和 'High'
。)
select_neighbor_all
是我对 ith
测试数据的 select K 个最近邻居所做的。
看来您可能没有提供 pd.crosstab 执行必要计算所需的所有数据:
看看这个例子。这里我们提供了一个索引和两个列类别以及行名和列名:
>>> index = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
... "bar", "bar", "foo", "foo", "foo"], dtype=object)
>>> col_category_1 = np.array(["one", "one", "one", "two", "one", "one",
... "one", "two", "two", "two", "one"], dtype=object)
>>> col_category_2 = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
... "shiny", "dull", "shiny", "shiny", "shiny"],
... dtype=object)
# Notice the index AND the columns provided as a list
>>> pd.crosstab(index, [col_category_1, col_category_2],
rownames=['a'], colnames=['b', 'c'])
...
col_category_1 one two
col_category_2 dull shiny dull shiny
index
bar 1 2 1 0
foo 2 2 1 2
有关详细信息,请参阅 pd.crosstab
的 pandas documentation:
index : array-like, Series, or list of arrays/Series Values to group by in the rows
columns : array-like, Series, or list of arrays/Series Values to group by in the columns
rownames : sequence, default None If passed, must match number of row arrays passed
colnames : sequence, default None If passed, must match number of column arrays passed
如果您编辑以下行并包含正确的输入,应该可以解决您的问题...
# You will need to provide an index and columns...
# Here, 'Q_test' is being interpreted as your index
# 'prediction' is being used as a column...
pd.crosstab(Q_test, prediction,
rownames=['Actual'],
colnames=['Predicted'],
margins=True)
我只是花了一些时间来解决这个问题。在我的例子中,pandas 交叉表似乎不适用于列表。
如果您将列表转换为 numpy 数组,它应该可以正常工作。
所以在你的情况下会是:
pd.crosstab(np.array(Q_test), np.array(prediction), rownames=['Actual'],
colnames=['Predicted'], margins=True)
一个例子:
>>> import pandas as pd
>>> import numpy as np
>>> classifications = ['foo', 'bar', 'foo', 'bar']
>>> predictions = ['foo', 'foo', 'bar', 'bar']
>>> pd.crosstab(classifications, predictions, rownames=['Actual'], colnames=['Predicted'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bastian/miniconda3/envs/machine_learning/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 563, in crosstab
rownames = _get_names(index, rownames, prefix="row")
File "/home/bastian/miniconda3/envs/machine_learning/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 703, in _get_names
raise AssertionError("arrays and names must have the same length")
AssertionError: arrays and names must have the same length
>>> pd.crosstab(np.array(classifications), np.array(predictions), rownames=['Actual'], colnames=['Predicted'])
Predicted bar foo
Actual
bar 1 1
foo 1 1
发生这种情况是因为我认为某些运算(例如乘法)对列表的影响与对 numpy 数组的影响不同。