当我使用 pd.crosstab 它一直显示 AssertionError

When I use pd.crosstab it keeps showing AssertionError

当我使用pd.crosstab构建混淆矩阵时,它一直显示

AssertionError: arrays and names must have the same length

import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
import random

df = pd.read_csv('C:\Users\liukevin\Desktop\winequality-red.csv',sep=';', usecols=['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol','quality'])

Q=[]

for i in range(len(df)):
    if df['quality'][i]<=5:
        Q.append('Low')
    else:
        Q.append('High')

del df['quality']
test_number=sorted(random.sample(xrange(len(df)), int(len(df)*0.25)))
train_number=[]
temp=[]
for i in range(len(df)):
    temp.append(i)
train_number=list(set(temp)-set(test_number))

distance_all=[]
for i in range(len(test_number)):
    distance_sep=[]
    for j in range(len(train_number)):
        distance=pow(df['fixed acidity'][test_number[i]]-df['fixed acidity'][train_number[j]],2)+\
        pow(df['volatile acidity'][test_number[i]]-df['volatile acidity'][train_number[j]],2)+\
        pow(df['citric acid'][test_number[i]]-df['citric acid'][train_number[j]],2)+\
        pow(df['residual sugar'][test_number[i]]-df['residual sugar'][train_number[j]],2)+\
        pow(df['chlorides'][test_number[i]]-df['chlorides'][train_number[j]],2)+\
        pow(df['free sulfur dioxide'][test_number[i]]-df['free sulfur dioxide'][train_number[j]],2)+\
        pow(df['total sulfur dioxide'][test_number[i]]-df['total sulfur dioxide'][train_number[j]],2)+\
        pow(df['density'][test_number[i]]-df['density'][train_number[j]],2)+\
        pow(df['pH'][test_number[i]]-df['pH'][train_number[j]],2)+\
        pow(df['sulphates'][test_number[i]]-df['sulphates'][train_number[j]],2)+\
        pow(df['alcohol'][test_number[i]]-df['alcohol'][train_number[j]],2)
        distance_sep.append(distance)
    distance_all.append(distance_sep)

for round in range(5):
    K=2*round+1

    select_neighbor_all=[]
    for i in range(len(test_number)):
        select_neighbor_sep=np.argsort(distance_all[i])[:K]
        select_neighbor_all.append(select_neighbor_sep)

    prediction=[]
    Q_test=[]
    for i in range(len(test_number)):
        Q_test.append(Q[test_number[i]])
        #original data
        Low_count=0
        for j in range(K):
            if Q[train_number[select_neighbor_all[i][j]]]=='Low':
                Low_count+=1
        if Low_count>(K/2):
            prediction.append('Low')
        else:
            prediction.append('High')

    print pd.crosstab(Q_test, prediction, rownames=['Actual'], colnames=['Predicted'], margins=True)

但是Q_testprediction的长度不一样吗? 我想这可能是 "names" must have the same length 的问题,因为我不太确定它是什么意思。 (在 Q_testprediction 数组中,只有二进制元素 'Low''High'。) select_neighbor_all 是我对 ith 测试数据的 select K 个最近邻居所做的。

看来您可能没有提供 pd.crosstab 执行必要计算所需的所有数据:

看看这个例子。这里我们提供了一个索引和两个列类别以及行名和列名:

>>> index = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
...                   "bar", "bar", "foo", "foo", "foo"], dtype=object)
>>> col_category_1 = np.array(["one", "one", "one", "two", "one", "one",
...                            "one", "two", "two", "two", "one"], dtype=object)
>>> col_category_2 = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
...                            "shiny", "dull", "shiny", "shiny", "shiny"],
...                            dtype=object)


# Notice the index AND the columns provided as a list    
>>> pd.crosstab(index, [col_category_1, col_category_2], 
                    rownames=['a'], colnames=['b', 'c'])
... 
col_category_1   one        two
col_category_2   dull shiny dull shiny
index
bar              1     2    1     0
foo              2     2    1     2

有关详细信息,请参阅 pd.crosstabpandas documentation

index : array-like, Series, or list of arrays/Series Values to group by in the rows

columns : array-like, Series, or list of arrays/Series Values to group by in the columns

rownames : sequence, default None If passed, must match number of row arrays passed

colnames : sequence, default None If passed, must match number of column arrays passed

如果您编辑以下行并包含正确的输入,应该可以解决您的问题...

# You will need to provide an index and columns...
# Here, 'Q_test' is being interpreted as your index
# 'prediction' is being used as a column... 
pd.crosstab(Q_test, prediction, 
            rownames=['Actual'], 
            colnames=['Predicted'],
            margins=True)

我只是花了一些时间来解决这个问题。在我的例子中,pandas 交叉表似乎不适用于列表。

如果您将列表转换为 numpy 数组,它应该可以正常工作。

所以在你的情况下会是:

pd.crosstab(np.array(Q_test), np.array(prediction), rownames=['Actual'],
            colnames=['Predicted'], margins=True)

一个例子:

>>> import pandas as pd
>>> import numpy as np
>>> classifications = ['foo', 'bar', 'foo', 'bar']
>>> predictions = ['foo', 'foo', 'bar', 'bar']
>>> pd.crosstab(classifications, predictions, rownames=['Actual'], colnames=['Predicted'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bastian/miniconda3/envs/machine_learning/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 563, in crosstab
    rownames = _get_names(index, rownames, prefix="row")
  File "/home/bastian/miniconda3/envs/machine_learning/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 703, in _get_names
    raise AssertionError("arrays and names must have the same length")
AssertionError: arrays and names must have the same length
>>> pd.crosstab(np.array(classifications), np.array(predictions), rownames=['Actual'], colnames=['Predicted'])
Predicted  bar  foo
Actual             
bar          1    1
foo          1    1

发生这种情况是因为我认为某些运算(例如乘法)对列表的影响与对 numpy 数组的影响不同。