Dataframe 列表理解 "zip(...)":仅使用列名称字符串列表有效地遍历选定的 df 列

Dataframe list comprehension "zip(...)": loop through chosen df columns efficiently with just a list of column name strings

这只是一道吹毛求疵的语法题...

我有一个数据框,我想使用列表理解来评估一个使用大量列的函数。

我知道我能做到

df['result_col'] = [some_func(*var) for var in zip(df['col_1'], df['col_2'],... ,df['col_n'])]

我想做这样的事情

df['result_col'] = [some_func(*var) for var in zip(df[['col_1', 'col_2',... ,'col_n']])]

即不必写 df n 次。我一辈子都搞不懂语法。

这应该可行,但老实说,OP 自己也想到了,所以 +1 OP :)

df['result_col'] = [some_func(*var) for var in zip(*[df[col] for col in ['col_1', 'col_2',... ,'col_n']])]

如上面评论中所述,您应该改用apply

df['reult_col'] = df.apply(lambda x: some_func(*tuple(x.values)), axis=1)

df.apply() 几乎和 df.iterrows() 一样慢,两者都不推荐,参见 How to iterate over rows in a DataFrame in Pandas --> 搜索 @cs95a 的“一个明显的例子”和看对比图。由于最快的方法(矢量化、Cython 例程)不容易实现,第三个最好的也是通常最好的解决方案是列表理解:

# print 3rd col
def some_func(row):
    print(row[2])


df['result_col'] = [some_func(*row) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]

# print 3rd col
def some_func(row):
    print(row[2])

df['result_col'] = [some_func(row[0]) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]

# print 3rd col
def some_func(x):
    print(x)

df['result_col'] = [some_func(row[0][2]) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]

另请参阅:

  • Memory efficient way for list comprehension of pandas dataframe using multiple columns

编辑:

请使用 df.iloc 和 df.loc 而不是 df[[...]],参见 Selecting multiple columns in a pandas dataframe