如何将 pandas DataFrame 的列解压为多个变量

Question

如果维度匹配，列表或 numpy 数组可以解包为多个变量。对于 3xN 数组，以下将起作用：

import numpy as np 
a,b =          [[1,2,3],[4,5,6]]
a,b = np.array([[1,2,3],[4,5,6]])
# result: a=[1,2,3],   b=[4,5,6]

如何实现 pandas DataFrame 的列的类似行为？扩展上面的例子：

import pandas as pd 
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.columns = ['A','B','C']    # Rename cols and
df.index = ['i', 'ii']        # rows for clarity

以下未按预期工作：

a,b = df.T
# result: a='i',   b='ii'
a,b,c = df
# result: a='A',   b='B',   c='C'

但是，我想得到的是：

a,b,c = unpack(df)
result: a=df['A'], b=df['B'], c=df['C']

功能 unpack 是否已在 pandas 中可用？还是可以通过简单的方式模仿？

Answer 1

我只是想出以下效果，这已经接近我尝试实现的目标：

a,b,c = df.T.values        # Common
a,b,c = df.T.to_numpy()    # Recommended
# a,b,c = df.T.as_matrix() # Deprecated

详情： 一如既往，事情比人们想象的要复杂一些。请注意 pd.DataFrame stores columns separately in Series. Calling df.values (or better: df.to_numpy()) is potentially expensive, as it combines the columns in a single ndarray，这可能涉及复制操作和类型转换。此外，生成的容器有一个 dtype 能够容纳数据框中的所有数据。

总而言之，上述方法丢失了每列 dtype 信息并且可能很昂贵。以下列方式之一迭代列在技术上更清洁（有更多选项）：

# The following alternatives create VIEWS!
a,b,c = (v for _,v in df.items())      # returns pd.Series
a,b,c = (df[c] for c in df)            # returns pd.Series

请注意，上面的代码创建了 views！修改数据可能会触发 SettingWithCopyWarning.

a.iloc[0] = "blabla"    # raises SettingWithCopyWarning

如果要修改解压缩的变量，则必须复制列。

# The following alternatives create COPIES!
a,b,c = (v.copy() for _,v in df.items())      # returns pd.Series
a,b,c = (df[c].copy() for c in df)            # returns pd.Series
a,b,c = (df[c].to_numpy() for c in df)        # returns np.ndarray

虽然这样更简洁，但需要更多字符。我个人不建议将上述方法用于生产代码。但为了避免打字（例如，在交互式 shell 会话中），它仍然是一个不错的选择...

# More verbose and explicit alternatives
a,b,c = df["the first col"], df["the second col"], df["the third col"]
a,b,c = df.iloc[:,0], df.iloc[:,1], df.iloc[:,2]

如何将 pandas DataFrame 的列解压为多个变量

How to unpack the columns of a pandas DataFrame to multiple variables

python

syntactic-sugar

pandas