动态访问 pandas 数据框列

Question

考虑这个简单的例子

import pandas as pd

df = pd.DataFrame({'one' : [1,2,3],
                   'two' : [1,0,0]})

df 
Out[9]: 
   one  two
0    1    1
1    2    0
2    3    0

我想编写一个函数，将数据框 df 和列 mycol 作为输入。

现在可以了：

df.groupby('one').two.sum()
Out[10]: 
one
1    1
2    0
3    0
Name: two, dtype: int64

这也有效：

 def okidoki(df,mycol):
    return df.groupby('one')[mycol].sum()

okidoki(df, 'two')
Out[11]: 
one
1    1
2    0
3    0
Name: two, dtype: int64

但是这个失败了

def megabug(df,mycol):
    return df.groupby('one').mycol.sum()

megabug(df, 'two')
 AttributeError: 'DataFrameGroupBy' object has no attribute 'mycol'

这里有什么问题？

我担心 okidoki 使用一些链接可能会产生一些细微的错误 (https://pandas.pydata.org/pandas-docs/stable/indexing.html#why-does-assignment-fail-when-using-chained-indexing)。

我怎样才能保持语法 groupby('one').mycol？可以将 mycol 字符串转换为可能以这种方式工作的内容吗？谢谢！

Answer 1

我认为你需要 [] 用于 select 列按列名什么是 selecting 列的一般解决方案，因为 select 按属性有很多 exceptions:

You can use this access only if the index element is a valid python identifier, e.g. s.1 is not allowed. See here for an explanation of valid identifiers.

The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.

Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.

In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.

def megabug(df,mycol):
    return df.groupby('one')[mycol].sum()

print (megabug(df, 'two'))

one
1    1
2    0
3    0
Name: two, dtype: int64

Answer 2

您传递一个字符串作为第二个参数。实际上，您正在尝试执行以下操作：

df.'two'

这是无效语法。如果您尝试动态访问列，则需要使用索引表示法 [...]，因为 dot/attribute 访问器表示法不适用于动态访问。

可以自行进行动态访问。例如，你可以使用getattr（但我不推荐这个，它是一个反模式）：

In [674]: df
Out[674]: 
   one  two
0    1    1
1    2    0
2    3    0

In [675]: getattr(df, 'one')
Out[675]: 
0    1
1    2
2    3
Name: one, dtype: int64

可以从 groupby 调用中按属性动态选择，例如：

In [677]: getattr(df.groupby('one'), mycol).sum() 
Out[677]: 
one
1    1
2    0
3    0
Name: two, dtype: int64

但是不要这样做。这是一个可怕的反模式，比 df.groupby('one')[mycol].sum().

更难读

动态访问 pandas 数据框列

Dynamically accessing a pandas dataframe column

python

dynamic

accessor

dataframe

pandas