动态删除多索引中的连续行

Delete sequential rows within a multi-index dynamically

我有一个 df:

          pageid
sid vid
 1  ABC     dog
    ABC     dog
    ABC     dog
    ABC     dog
 2  DEF     cat
    DEF     cat
    DEF     pig
    DEF     cat
 3  GHI     pig
    GHI     cat
    GHI     dog
    GHI     dog

构造函数:

import pandas as pd

i = pd.MultiIndex.from_arrays(
    [[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
     ['ABC', 'ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'DEF', 'DEF', 'GHI', 'GHI',
      'GHI', 'GHI']],
    names=('sid', 'vid')
)

df = pd.DataFrame({
    'pageid': ['dog', 'dog', 'dog', 'dog', 'cat', 'cat', 'pig', 'cat',
               'pig', 'cat', 'dog', 'dog']
}, index=i)

我想从 pageid 列中删除重复项,如果它们存在于一个会话中,sid 并且当且仅当它们连续 n 次。我发现的唯一示例使用 .shift() 如果我不必担心 n > 1 重复,它会很好用。不幸的是,在某些情况下,我会得到类似 n = 30 个连续重复项的结果。

之前:

          pageid
sid vid
 1  ABC     dog
    ABC     dog
    ABC     dog
    ABC     dog
 2  DEF     cat
    DEF     cat
    DEF     pig
    DEF     cat
 3  GHI     pig
    GHI     cat
    GHI     dog
    GHI     dog

之后:

           pageid
sid vid
 1  ABC     dog
 2  DEF     cat
    DEF     pig
    DEF     cat
 3  GHI     pig
    GHI     cat
    GHI     dog

全局重复

您可以 reset_index 并计算 duplicated:

df[~df.reset_index().duplicated().values]

输出:

        pageid
sid vid       
1   ABC    dog
2   DEF    cat
    DEF    pig
3   GHI    pig
    GHI    cat
    GHI    dog

连续重复

df2 = df[['pageid']].reset_index()
df[~df2.eq(df2.shift()).all(1).values]

输出:

        pageid
sid vid       
1   ABC    dog
2   DEF    cat
    DEF    pig
    DEF    cat
3   GHI    pig
    GHI    cat
    GHI    dog

具有阈值的连续重复

thresh = 3

df2 = df[['pageid']].reset_index()
m = df2.eq(df2.shift()).all(1).groupby(df.set_index('pageid', append=True).index).cumsum()
df.loc[m.lt(thresh).values]

输出(示例阈值:3):

        pageid
sid vid       
1   ABC    dog
    ABC    dog
    ABC    dog
2   DEF    cat
    DEF    cat
    DEF    pig
    DEF    cat
3   GHI    pig
    GHI    cat
    GHI    dog
    GHI    dog

我认为您可以在 groupby 上使用 shift,然后在 groupby 上再次使用 rolling().sum()

# lazy groupby
groups = df.groupby(level=[0,1])

# if this is equal to the previous data
df['shifted'] = groups['pageid'].shift() == df['pageid']

# threshold
thresh = 2
mask = groups['shifted'].rolling(thresh).sum().fillna(0) < thresh

df.loc[mask.values]

输出:

        pageid  shifted
sid vid                
1   ABC    dog    False
    ABC    dog     True
2   DEF    cat    False
    DEF    cat     True
    DEF    pig    False
    DEF    cat    False
3   GHI    pig    False
    GHI    cat    False
    GHI    dog    False
    GHI    dog     True

如果您可以计算出每个多索引中 pageid 的顺序,一种选择是逐步遍历每个元素并保留它是否与它之前的元素相同的状态。例如:

class Duplicated():
    def __init__(self):
        self.last = None
        
    def is_duplicate(self, x):
        if x == self.last:
            return True
        
        else:
            self.last = x
            return False
        
df=pd.MultiIndex.from_arrays([[1,1,1,1,2,2,2,2,3,3,3,3], 
['ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','GHI','GHI','GHI','GHI']],names= 
('sid','vid'));

df=pd.DataFrame({'pageid': 
['dog','dog','dog','dog','cat','cat','pig','cat','pig','cat','dog','dog']},index=df)
dupe_checker = Duplicated()

df['duped'] = [dupe_checker.is_duplicate(x) for x in df['pageid'].iteritems()]
df

然后您可以简单地删除重复的行。

df = df[~df['duped']]
df.drop(columns='duped', inplace=True)

给予

        pageid
sid vid       
1   ABC    dog
2   DEF    cat
    DEF    pig
    DEF    cat
3   GHI    pig
    GHI    cat
    GHI    dog