替换数据框中的重复值

Question

我有以下数据框：

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
print(df)

给出：

我想用尚未使用的最低值替换 'ID' 列中的重复值。然而，相应的相同值应该被视为一个组，并且它们的值应该以相同的方式改变。例如：前两个值都是 1。它们是连续的，因此它们是一个组，因此第二个“1”不应替换为“2”。第 14-16 行是三个连续的三分球。值 3 已经被用来替换上面的值，因此需要替换这三个值。但它们是相应的，因此是一个组，并且应该获得相同的替换值。预期结果如下，会更清楚：

Answer 1

我想出了一种使用 for 循环和字典来获得结果的方法。公平地说，我希望更难，代码起初看起来有点复杂，但事实并非如此。可能有一种方法可以使用多个逻辑向量，但我不知道。

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
print(df)
####################
diffs = np.diff(df.ID) # differences ID(k) - ID(k-1)
uniq = sorted(pd.unique(df.ID)) # unique values in ID colums

# dict with range of numbers from min to max in ID col
d = {} # Empty dict
a = range(uniq[0],uniq[-1]*int(df.shape[0]/len(uniq))) # range values
d = {a[k]:False for k in range(len(a))} # Fill dict
d[df.ID[0]] = True # Set first value in col as True
     
for m in range(1,df.shape[0]):
    # Find a value different from previous one
    # therefore, beginning of new subgroup
    if diffs[m-1] != 0:
        # Check if value was before in the ID column
        if d[df.ID[m]] == True:
            # Get the lowest value which wasn't used
            lowest = [k for k, v in d.items() if v == False][0]
            # loop over the subgroup (which differences are 0)
            for n in range(m+1,df.shape[0]):
                if diffs[n-1] > 0: # If find a new subgroup
                    break         # then stop looping
            # Replace the subgroup with the lowest value
            df.ID[m:n] = lowest # n is the final index of the subgroup
            # *Exception in case last number is a subgroup itself
            # then previous for loop doesnt work
            if m == df.shape[0]-1:
                 df.ID[m] = lowest
    # Update dictionaries for values retrieved from ID column
    d[df.ID[m]] = True
    
print(df)

因此，您需要的是将您的列 ID 视为子组或不同的数组，然后检查不同的条件并进行不同的操作。您可以将列视为一组多个数组：

[1, 1 | 2 | 5, 5 | 6 | 1, 1 | 2, 2 | 5 | 9 | 1 | 2 | 3, 3, 3 | 5]

你需要做的是找到那个子群的极限并检查它们是否满足某些条件（1. 不是以前的数字，2. 我们没有使用的最小数字）。如果我们计算一个值与前一个值之间的差异，我们就可以知道子组

diffs = np.diff(df.ID) # differences ID(k) - ID(k-1)

我们可以使用字典知道条件，哪些键是我们可能需要的数组中的整数或更长的值，而值是我们是否使用过它们（真或假）。

为此，我们需要取 ID 列的最大值。但是，我们需要构建包含列中更多数字的字典（在您的示例中，max(input) = 9 和 max(output) = 12）。你可以随机做，我选择计算我们可能需要关注列中的行数和唯一值数的可能比例（a = 范围中的最后一个输入...）。

uniq = sorted(pd.unique(df.ID)) # unique values in ID colums
# dict with range of numbers from min to max in ID col
d = {}
a = range(uniq[0],uniq[-1]*int(df.shape[0]/len(uniq)))
d = {a[k]:False for k in range(len(a))}
d[df.ID[0]] = True # Set first value in col as True

代码的最后一部分是一个主要的 for 循环，里面有一些 If 和另一个 for，它的工作原理是：

# 1. Loop over ID column
# 2. Check if ID[m] value is different number from previous one (diff != 0)
# 3. Check if the ID[m] value is already in the ID column.
# 4. Calculate lowest value (first key == False in dict) and change the subset of
#   in the ID
# 5. How is made step 4, if last value is a subset itself, it doesn't work, so 
#  there's a little condition to check it will work.
# 6. Update the dict every time a new value shows up.

我相信有很多方法可以缩短这段代码。但是这项工作应该适用于更大的数据帧和相同的条件。

Answer 2

df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})


def fun():
    v, dub = 1, set()
    d = yield
    while True:
        num = d.iloc[0]['ID']
        if num in dub:
            while v in dub:
                v += 1
            d.ID = num = v
        dub.add(num)
        d = yield d


f = fun()
next(f)

df = df.groupby([df['ID'].diff().ne(0).cumsum(), 'ID'], as_index=False).apply(lambda x: f.send(x))
print(df)

输出：

替换数据框中的重复值

Replacing duplicate values in a dataframe

python

replace

duplicates

dataframe

pandas