Dataframe 列表理解 "zip(...)":仅使用列名称字符串列表有效地遍历选定的 df 列
Dataframe list comprehension "zip(...)": loop through chosen df columns efficiently with just a list of column name strings
这只是一道吹毛求疵的语法题...
我有一个数据框,我想使用列表理解来评估一个使用大量列的函数。
我知道我能做到
df['result_col'] = [some_func(*var) for var in zip(df['col_1'], df['col_2'],... ,df['col_n'])]
我想做这样的事情
df['result_col'] = [some_func(*var) for var in zip(df[['col_1', 'col_2',... ,'col_n']])]
即不必写 df
n
次。我一辈子都搞不懂语法。
这应该可行,但老实说,OP 自己也想到了,所以 +1 OP :)
df['result_col'] = [some_func(*var) for var in zip(*[df[col] for col in ['col_1', 'col_2',... ,'col_n']])]
如上面评论中所述,您应该改用apply
:
df['reult_col'] = df.apply(lambda x: some_func(*tuple(x.values)), axis=1)
df.apply() 几乎和 df.iterrows() 一样慢,两者都不推荐,参见 How to iterate over rows in a DataFrame in Pandas --> 搜索 @cs95a 的“一个明显的例子”和看对比图。由于最快的方法(矢量化、Cython 例程)不容易实现,第三个最好的也是通常最好的解决方案是列表理解:
# print 3rd col
def some_func(row):
print(row[2])
df['result_col'] = [some_func(*row) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]
或
# print 3rd col
def some_func(row):
print(row[2])
df['result_col'] = [some_func(row[0]) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]
或
# print 3rd col
def some_func(x):
print(x)
df['result_col'] = [some_func(row[0][2]) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]
另请参阅:
- Memory efficient way for list comprehension of pandas dataframe using multiple columns
编辑:
请使用 df.iloc 和 df.loc 而不是 df[[...]],参见 Selecting multiple columns in a pandas dataframe
这只是一道吹毛求疵的语法题...
我有一个数据框,我想使用列表理解来评估一个使用大量列的函数。
我知道我能做到
df['result_col'] = [some_func(*var) for var in zip(df['col_1'], df['col_2'],... ,df['col_n'])]
我想做这样的事情
df['result_col'] = [some_func(*var) for var in zip(df[['col_1', 'col_2',... ,'col_n']])]
即不必写 df
n
次。我一辈子都搞不懂语法。
这应该可行,但老实说,OP 自己也想到了,所以 +1 OP :)
df['result_col'] = [some_func(*var) for var in zip(*[df[col] for col in ['col_1', 'col_2',... ,'col_n']])]
如上面评论中所述,您应该改用apply
:
df['reult_col'] = df.apply(lambda x: some_func(*tuple(x.values)), axis=1)
df.apply() 几乎和 df.iterrows() 一样慢,两者都不推荐,参见 How to iterate over rows in a DataFrame in Pandas --> 搜索 @cs95a 的“一个明显的例子”和看对比图。由于最快的方法(矢量化、Cython 例程)不容易实现,第三个最好的也是通常最好的解决方案是列表理解:
# print 3rd col
def some_func(row):
print(row[2])
df['result_col'] = [some_func(*row) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]
或
# print 3rd col
def some_func(row):
print(row[2])
df['result_col'] = [some_func(row[0]) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]
或
# print 3rd col
def some_func(x):
print(x)
df['result_col'] = [some_func(row[0][2]) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]
另请参阅:
- Memory efficient way for list comprehension of pandas dataframe using multiple columns
编辑:
请使用 df.iloc 和 df.loc 而不是 df[[...]],参见 Selecting multiple columns in a pandas dataframe