如何在 pandas DataFrame 中查找特定列的重复行，并通过添加计数器修改值？

Question

为了简单起见，考虑一个包含 2 列的数据框。第一列是 label，它对数据集中的某些观察值具有相同的值。

示例数据集：

import pandas as pd
  
data = [('A', 28),
        ('B', 32),
        ('B', 32),
        ('C', 25),
        ('D', 25),
        ('D', 40),
        ('E', 32) ]

data_df = pd.DataFrame(data, columns = ['label', 'num'])

对于列 label，我想查找具有相似值的行。并将其值转换为 value_counter，如下所示：

label   num
A        28
B_1      32 
B_2      32
C        25
D_1      25
D_2      40
E        32

我尝试使用 pandas groupby，但我不知道我必须使用哪个 transform。

感谢您的帮助。

Answer 1

您可以创建一个空的 dictionary，您可以在其后附加标签和计数（分别为 keys 和 values）。然后根据标签是新的还是存在，您可以增加它的值或 return 保持不变。

最后一步是使用这个新的 list 作为新的标签列：

labels = data_df['label'].tolist()
new_labels = []
label_c = {}

# iterate through your labels list
for val in labels:
    if val not in label_c:     # if label not the new label list
        label_c[val] = 0       # add it to dictionary
        new_labels.append(val) # add it to the output as is
    else:                      # if it's not new
        label_c[val] += 1      # increment its count
        new_labels.append(f"{val}_{label_c[val]}") # add it to the output along with its count

data_df['label'] = new_labels

回印：

>>> print(data_df)

  label  num
0     A   28
1     B   32
2   B_1   32
3     C   25
4     D   25
5   D_1   40
6     E   32

Answer 2

您可以使用：

s = data_df.groupby('label').cumcount()+1
data_df['label'] = np.where(data_df.duplicated(subset='label',  keep=False),
                             data_df['label'] + '_' + s.astype(str), data_df['label'])

OUTPUT

  label  num
0     A   28
1   B_1   32
2   B_2   32
3     C   25
4   D_1   25
5   D_2   40
6     E   32

如何在 pandas DataFrame 中查找特定列的重复行，并通过添加计数器修改值？

How to find repeated rows in pandas DataFrame for specific columns, and modify values by adding counter?

python

transform

repeat

dataframe

pandas