如何使用 sklearn 转换器将 pandas 数据帧内的数组类型展平?

How to flatten array types inside pandas dataframe with an sklearn transformer?

我有一个包含标量列和数组列的 pandas 数据框,例如

df = pd.DataFrame({
  "scalar": [1, 2, 3, 4],
  "array": [[10,20], [30,40], [50, 60], [70, 80]],
})

我想写一个sklearn transformer来压平它,这样

transformer = ???
transformer.fit_transform(df)
===>
[[1 10 20
  2 30 40
  3 50 60
  4 70 80]]

我怎样才能做到这一点?

由于这是无状态转换,您可以使用 FunctionTransformer 从函数定义转换器。

import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer

df = pd.DataFrame({
  "scalar": [1, 2, 3, 4],
  "array": [[10,20], [30,40], [50, 60], [70, 80]],
})


def flatten_df_rows(df):
    def flatten(row):
        # flatten lists recursively 
        for val in row:
            if isinstance(val, list):
                yield from flatten(val)
            else:
                yield val
    # flatten each row of the df recursively           
    return np.array([list(flatten(row)) for row in df.values.tolist()])

transform = FunctionTransformer(flatten_df_rows)
out = transform.fit_transform(df)

输出:

>>> out 

array([[ 1, 10, 20],
       [ 2, 30, 40],
       [ 3, 50, 60],
       [ 4, 70, 80]])