在 Pandas 中,如何将 DataFrame 按两列分箱,而其他列更改为这些分箱内的均值?

In Pandas, how can a DataFrame be binned by two columns, with the other columns changed to the means within those bins?

我已经使用 UMAP 将标准鸢尾花数据集向下投影到二维,并将 2D 图的 x 和 y 位置的 UMAP 维度添加为数据框的列:

import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_iris
import umap # pip install umap-learn

iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = pd.Series(iris.target).map(dict(zip(range(3), iris.target_names)))

_umap = umap.UMAP().fit_transform(iris.data)
iris_df['UMAP_x'] = _umap[:,0]
iris_df['UMAP_y'] = _umap[:,1]
iris_df.head()

我想将 UMAP_xUMAP_y 列都放入 25 个 bin 中,然后数据框中的其他列更改为每个列中列的平均值垃圾箱。如何做到这一点?感觉 cut 或重采样可能会得出答案,但我不确定如何。

您可以使用 cut 定义 bin,然后使用 groupbytransform 计算每个 bin 的平均值。

import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_iris
import umap

iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = pd.Series(iris.target).map(dict(zip(range(3), iris.target_names)))

_umap = umap.UMAP().fit_transform(iris.data)
iris_df['UMAP_x'] = _umap[:,0]
iris_df['UMAP_y'] = _umap[:,1]

# Define bins for UMAP_x and UMAP_y params
iris_df['UMAP_x_bin'] = pd.cut(iris_df['UMAP_x'], bins=25)
iris_df['UMAP_y_bin'] = pd.cut(iris_df['UMAP_y'], bins=25)

# Calculate mean value for each bin
iris_df['UMAP_x_mean'] = iris_df.groupby('UMAP_x_bin')['UMAP_x'].transform('mean')
iris_df['UMAP_y_mean'] = iris_df.groupby('UMAP_y_bin')['UMAP_y'].transform('mean')

iris_df.head()