在 csv 中进行主要分析后连接列时出现形状错误

Question

我正在我的 csv 数据中应用 PCA。归一化后，似乎 PCA 正在工作。我想通过制作 4 个组件来绘制投影。但我遇到了这个错误：

 type         x         y  ...             fx             fy   fz
0     0 -0.639547 -1.013450  ... -8.600000e-231 -1.390000e-230  0.0
0     1 -0.497006 -2.311890  ...   0.000000e+00   0.000000e+00  0.0
1     0  0.154376 -0.873189  ...  1.150000e-228 -1.480000e-226  0.0
1     1 -0.342055 -2.179370  ...   0.000000e+00   0.000000e+00  0.0
2     0  0.312719 -0.872756  ... -2.370000e-221  2.420000e-221  0.0

[5 rows x 10 columns]

(1047064, 10)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-28-0b631a51ce61> in <module>()
     33 
     34 
---> 35 finalDf = pd.concat([principalDf, df[['type']]], axis = 1)

4 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/managers.py in _verify_integrity(self)
    327         for block in self.blocks:
    328             if block.shape[1:] != mgr_shape[1:]:
--> 329                 raise construction_error(tot_items, block.shape[1:], self.axes)
    330         if len(self.items) != tot_items:
    331             raise AssertionError(

ValueError: Shape of passed values is (2617660, 5), indices imply (1570596, 5)

这是我的代码：

import sys
import pandas as pd
import pylab as pl
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


df1=pd.read_csv('./data/1.csv')
df2=pd.read_csv('./data/2.csv')
df = pd.concat([df1, df2], axis=0).sort_index()
print(df.head())
print(df.shape)

features = ['x', 'y', 'z', 'vx', 'vy', 'vz', 'fx', 'fy', 'fz']
# Separating out the features
x = df.loc[:, features].values
# Separating out the target
y = df.loc[:,['type']].values
# Standardizing the features
x = StandardScaler().fit_transform(x)

pca = PCA(n_components=4)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['pcc1','pcc2','pcc3', 'pcc4'])


finalDf = pd.concat([principalDf, df[['type']]], axis = 1)

我想我在连接我的组件和 df['type'] 时出错了。

我可以想办法消除这个错误吗？

谢谢。

Answer 1

df 中的索引与 principalDf 中的索引不同。我们有（使用您数据的简短版本）

df.index
Int64Index([0, 0, 1, 1, 2, 2, 3, 3, 4, 4], dtype='int64')

和

principalDf.index
RangeIndex(start=0, stop=10, step=1)

因此 concat 越来越糊涂了。您可以通过尽早重置索引来解决此问题：

...
df = pd.concat([df1, df2], axis=0).sort_index().reset_index() # note reset_index() added
...

在 csv 中进行主要分析后连接列时出现形状错误

shape error while concating columns after Principal Analysis in csv

python

pca

python-3.x

pandas

data-science