Python 数据表 - 将 lambda 应用于多列

Question

我希望将函数应用于 Python 中的 datatable 的多个列。对于 R 的 data.table，一个人会：

# columns to apply function to
x <- c('col_1', 'col_2')

# apply
df[, (x) := lapply(.SD, function(x) as.Date(x, "%Y-%m-%d")), .SDcols=x]

如何使用 Python 的 datatable 来做同样的事情？我对 apply 和 lambda 以及 pandas 有一些了解，例如：

# create dummy data
df = pd.DataFrame({'col_1': ['2021-12-01']
                   , 'col_2': ['2021-12-02']
                   , 'col_3': ['foobar']
                   }
                  )

# columns to apply function to
x = ['col_1', 'col_2']

# apply
df[x] = df[x].apply(lambda x: pd.to_datetime(x, format='%Y-%m-%d'))

但是它在 Python 的 datatable 中的等价物是什么？这是假设我坚持使用 apply 和 lambda。谢谢。

edit* 我已经从 UDF 更改为标准函数 pd.to_datetime 正如我们中的一些人提到的，前者是不可能的，而后者是。请随意使用任何示例来说明 apply 与 datatable 的结合。谢谢

Answer 1

我最近做了一个 PR 展示了在 datatable 中转换列的方法；它应该很快合并。请随时评论和更新它。

题，可以直接赋值，也可以用update的方法：

from datatable import dt, f, update, Type, as_type

DT0 = dt.Frame({'col_1': ['2021-12-01']
                   , 'col_2': ['2021-12-02']
                   , 'col_3': ['foobar']
                   }
                  )

cols = ['col_1', 'col_2']

DT0
   | col_1       col_2       col_3 
   | str32       str32       str32 
-- + ----------  ----------  ------
 0 | 2021-12-01  2021-12-02  foobar
[1 row x 3 columns]

通过重新分配：

DT = DT0.copy()

DT[:, cols] = DT[:, as_type(f[cols], Type.date32)]

DT
   | col_1       col_2       col_3 
   | date32      date32      str32 
-- + ----------  ----------  ------
 0 | 2021-12-01  2021-12-02  foobar
[1 row x 3 columns]

使用直接赋值，可以将f表达式赋值给列；这仅适用于单个作业：

DT = DT0.copy()

DT['col_1'] = as_type(f.col_1, Type.date32)

DT['col_2'] = as_type(f.col_2, Type.date32)

DT
 
   | col_1       col_2       col_3 
   | date32      date32      str32 
-- + ----------  ----------  ------
 0 | 2021-12-01  2021-12-02  foobar
[1 row x 3 columns]

update 函数同样有效；我喜欢这个功能，特别是对于 SQL window 之类的操作，我不希望更改列的顺序（执行 groupby 时数据表排序）：

DT = DT0.copy()

DT[:, update(col_1 = dt.as_type(f.col_1, Type.date32), 
             col_2 = dt.as_type(f.col_2, Type.date32))]
DT
   | col_1       col_2       col_3 
   | date32      date32      str32 
-- + ----------  ----------  ------
 0 | 2021-12-01  2021-12-02  foobar
[1 row x 3 columns]

注意update是就地；不需要重新分配。对于多列，字典可以帮助自动化该过程：

columns = {col : as_type(f[col], Type.date32) for col in cols}

print(columns)
{'col_1': FExpr<as_type(f['col_1'], date32)>,
 'col_2': FExpr<as_type(f['col_2'], date32)>}

# unpack the dictionary within the datatable brackets
DT = DT0.copy()
DT[:, update(**columns)]

DT
   | col_1       col_2       col_3 
   | date32      date32      str32 
-- + ----------  ----------  ------
 0 | 2021-12-01  2021-12-02  foobar
[1 row x 3 columns]

Python 数据表 - 将 lambda 应用于多列

Python datatable - apply lambda to multiple columns

python

lambda

apply

py-datatable