如何用 pandas 和 numpy 一起处理 Series 和 Array?

How to handle Series and Array with pandas and numpy together?

我是 Python 的新手,我对所有这些数据类型(如系列、数组、列表等)感到非常困惑。这可能是一个非常开放的问题。我希望在 python 中编码以进行数据分析时对一般做法有所了解。

大量阅读资料表明 numpy 和 pandas 是我进行数据分析所需的两个模块。但是,我发现它很难而且很奇怪,因为它们是两种不同数据类型的 operating/generating 数据,即系列和数组。 normal/natural 是否需要在任何类型的数据操作之前将其中一种数据类型转换为另一种数据类型?想知道你会做什么吗?非常感谢。

例如:

 import pandas as pd
 import numpy as np

 # create some data
 df = pd.DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'c'])
 x = np.random.randn(10, 1)

 # data manipulation
 A = df['a']

 # Question 1:
 # If I want to perform a element by element addition between x and A
 # How should I do?  Simple x + A doesn't work but it seems strange to 
 # me that if I have to convert the data type everytime 

 # Question 2:
 # I'd like to combine to two columns together
 # concatenate or hstack both don't work

另外,您的 arrays/Series 应具有相同的尺寸:

In [98]: A.shape
Out[98]: (10,)

In [99]: x.shape
Out[99]: (10, 1)

您可以使用 reshape(-1) 将向量转换为数组:

In [100]: x.reshape(-1).shape
Out[100]: (10,)

然后你可以添加 pd.Series A:

In [61]: A + x.reshape(-1)
Out[61]:
0   -1.186957
1   -0.165563
2    0.882490
3    4.544357
4    2.698414
5    0.396110
6   -0.199209
7    3.282942
8    2.448213
9   -0.543727
Name: a, dtype: float64

对于你的第二个问题,你需要为向量重塑你的 A Series。你可以用 reshape:

In [97]: np.hstack([A.values.reshape(A.size,1), x])
Out[97]:
array([[ 0.3158111 , -1.50276813],
       [-1.09532212,  0.92975954],
       [-0.77048623,  1.65297592],
       [ 2.14690242,  2.39745455],
       [ 1.63367806,  1.06473634],
       [ 0.09134512,  0.3047644 ],
       [ 0.02019805, -0.21940726],
       [ 0.87008192,  2.41286007],
       [ 1.25315724,  1.19505578],
       [-0.60156045,  0.05783343]])

如果你想得到 pd.DataFrame 你可以使用 pd.concat:

In [108]: pd.concat([A, pd.Series(x.reshape(-1))], axis=1)
Out[108]:
          a         0
0  0.315811 -1.502768
1 -1.095322  0.929760
2 -0.770486  1.652976
3  2.146902  2.397455
4  1.633678  1.064736
5  0.091345  0.304764
6  0.020198 -0.219407
7  0.870082  2.412860
8  1.253157  1.195056
9 -0.601560  0.057833

编辑

来自 docs reshape(-1):

newshape : int or tuple of ints
The new shape should be compatible with the original shape. If an integer, then the result will be a 1-D array of that length. One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions.

Is it normal/natural that one needs to convert either one of the data type to another one before any kind of data manipulation?

有时需要,有时不需要。有疑问就去做。

就是说,记住Python的禅宗:

  • 显式优于隐式。
  • 面对歧义,拒绝猜测。

即使某些 API 会尽力为您转换类型(numpy 和 pandas 在这方面做得很好),显式类型转换可以使您的代码更具可读性和更易于调试。

Question 1: If I want to perform a element by element addition between x and A How should I do? Simple x + A doesn't work but it seems strange to me that if I have to convert the data type everytime

在这种情况下您不必转换数据类型,但您需要兼容的形状。

>>> print(A.shape)
(10,)
>>> print(x.shape)
(10, 1)
>>> print(A + x.reshape(10))
0   -0.207131
1   -2.117012
2    0.925545
3   -2.187705
4    1.226458
5    2.144904
6   -0.956781
7    1.956246
8    0.060132
9    1.332417
Name: a, dtype: float64

Question 2: I'd like to combine to two columns together concatenate or hstack both don't work

不清楚所需的输出是什么,但我认为这又是一个形状问题,而不是类型问题。这是 pandas 方式的一个选项:

>>> print(pd.concat([A, pd.Series(x.reshape(10))], axis=1))
          a         0
0 -0.158667 -0.048463
1 -0.847246 -1.269765
2 -0.128232  1.053778
3 -1.316113 -0.871593
4  1.057044  0.169414
5  3.188343 -1.043439
6 -0.032524 -0.924257
7  1.412443  0.543803
8 -0.730386  0.790519
9  0.289796  1.042621