动态删除多索引中的连续行
Delete sequential rows within a multi-index dynamically
我有一个 df:
pageid
sid vid
1 ABC dog
ABC dog
ABC dog
ABC dog
2 DEF cat
DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
GHI dog
构造函数:
import pandas as pd
i = pd.MultiIndex.from_arrays(
[[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
['ABC', 'ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'DEF', 'DEF', 'GHI', 'GHI',
'GHI', 'GHI']],
names=('sid', 'vid')
)
df = pd.DataFrame({
'pageid': ['dog', 'dog', 'dog', 'dog', 'cat', 'cat', 'pig', 'cat',
'pig', 'cat', 'dog', 'dog']
}, index=i)
我想从 pageid
列中删除重复项,如果它们存在于一个会话中,sid
并且当且仅当它们连续 n
次。我发现的唯一示例使用 .shift() 如果我不必担心 n > 1 重复,它会很好用。不幸的是,在某些情况下,我会得到类似 n = 30 个连续重复项的结果。
之前:
pageid
sid vid
1 ABC dog
ABC dog
ABC dog
ABC dog
2 DEF cat
DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
GHI dog
之后:
pageid
sid vid
1 ABC dog
2 DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
全局重复
您可以 reset_index
并计算 duplicated
:
df[~df.reset_index().duplicated().values]
输出:
pageid
sid vid
1 ABC dog
2 DEF cat
DEF pig
3 GHI pig
GHI cat
GHI dog
连续重复
df2 = df[['pageid']].reset_index()
df[~df2.eq(df2.shift()).all(1).values]
输出:
pageid
sid vid
1 ABC dog
2 DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
具有阈值的连续重复
thresh = 3
df2 = df[['pageid']].reset_index()
m = df2.eq(df2.shift()).all(1).groupby(df.set_index('pageid', append=True).index).cumsum()
df.loc[m.lt(thresh).values]
输出(示例阈值:3):
pageid
sid vid
1 ABC dog
ABC dog
ABC dog
2 DEF cat
DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
GHI dog
我认为您可以在 groupby 上使用 shift
,然后在 groupby 上再次使用 rolling().sum()
:
# lazy groupby
groups = df.groupby(level=[0,1])
# if this is equal to the previous data
df['shifted'] = groups['pageid'].shift() == df['pageid']
# threshold
thresh = 2
mask = groups['shifted'].rolling(thresh).sum().fillna(0) < thresh
df.loc[mask.values]
输出:
pageid shifted
sid vid
1 ABC dog False
ABC dog True
2 DEF cat False
DEF cat True
DEF pig False
DEF cat False
3 GHI pig False
GHI cat False
GHI dog False
GHI dog True
如果您可以计算出每个多索引中 pageid 的顺序,一种选择是逐步遍历每个元素并保留它是否与它之前的元素相同的状态。例如:
class Duplicated():
def __init__(self):
self.last = None
def is_duplicate(self, x):
if x == self.last:
return True
else:
self.last = x
return False
df=pd.MultiIndex.from_arrays([[1,1,1,1,2,2,2,2,3,3,3,3],
['ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','GHI','GHI','GHI','GHI']],names=
('sid','vid'));
df=pd.DataFrame({'pageid':
['dog','dog','dog','dog','cat','cat','pig','cat','pig','cat','dog','dog']},index=df)
dupe_checker = Duplicated()
df['duped'] = [dupe_checker.is_duplicate(x) for x in df['pageid'].iteritems()]
df
然后您可以简单地删除重复的行。
df = df[~df['duped']]
df.drop(columns='duped', inplace=True)
给予
pageid
sid vid
1 ABC dog
2 DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
我有一个 df:
pageid
sid vid
1 ABC dog
ABC dog
ABC dog
ABC dog
2 DEF cat
DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
GHI dog
构造函数:
import pandas as pd
i = pd.MultiIndex.from_arrays(
[[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
['ABC', 'ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'DEF', 'DEF', 'GHI', 'GHI',
'GHI', 'GHI']],
names=('sid', 'vid')
)
df = pd.DataFrame({
'pageid': ['dog', 'dog', 'dog', 'dog', 'cat', 'cat', 'pig', 'cat',
'pig', 'cat', 'dog', 'dog']
}, index=i)
我想从 pageid
列中删除重复项,如果它们存在于一个会话中,sid
并且当且仅当它们连续 n
次。我发现的唯一示例使用 .shift() 如果我不必担心 n > 1 重复,它会很好用。不幸的是,在某些情况下,我会得到类似 n = 30 个连续重复项的结果。
之前:
pageid
sid vid
1 ABC dog
ABC dog
ABC dog
ABC dog
2 DEF cat
DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
GHI dog
之后:
pageid
sid vid
1 ABC dog
2 DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
全局重复
您可以 reset_index
并计算 duplicated
:
df[~df.reset_index().duplicated().values]
输出:
pageid
sid vid
1 ABC dog
2 DEF cat
DEF pig
3 GHI pig
GHI cat
GHI dog
连续重复
df2 = df[['pageid']].reset_index()
df[~df2.eq(df2.shift()).all(1).values]
输出:
pageid
sid vid
1 ABC dog
2 DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
具有阈值的连续重复
thresh = 3
df2 = df[['pageid']].reset_index()
m = df2.eq(df2.shift()).all(1).groupby(df.set_index('pageid', append=True).index).cumsum()
df.loc[m.lt(thresh).values]
输出(示例阈值:3):
pageid
sid vid
1 ABC dog
ABC dog
ABC dog
2 DEF cat
DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
GHI dog
我认为您可以在 groupby 上使用 shift
,然后在 groupby 上再次使用 rolling().sum()
:
# lazy groupby
groups = df.groupby(level=[0,1])
# if this is equal to the previous data
df['shifted'] = groups['pageid'].shift() == df['pageid']
# threshold
thresh = 2
mask = groups['shifted'].rolling(thresh).sum().fillna(0) < thresh
df.loc[mask.values]
输出:
pageid shifted
sid vid
1 ABC dog False
ABC dog True
2 DEF cat False
DEF cat True
DEF pig False
DEF cat False
3 GHI pig False
GHI cat False
GHI dog False
GHI dog True
如果您可以计算出每个多索引中 pageid 的顺序,一种选择是逐步遍历每个元素并保留它是否与它之前的元素相同的状态。例如:
class Duplicated():
def __init__(self):
self.last = None
def is_duplicate(self, x):
if x == self.last:
return True
else:
self.last = x
return False
df=pd.MultiIndex.from_arrays([[1,1,1,1,2,2,2,2,3,3,3,3],
['ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','GHI','GHI','GHI','GHI']],names=
('sid','vid'));
df=pd.DataFrame({'pageid':
['dog','dog','dog','dog','cat','cat','pig','cat','pig','cat','dog','dog']},index=df)
dupe_checker = Duplicated()
df['duped'] = [dupe_checker.is_duplicate(x) for x in df['pageid'].iteritems()]
df
然后您可以简单地删除重复的行。
df = df[~df['duped']]
df.drop(columns='duped', inplace=True)
给予
pageid
sid vid
1 ABC dog
2 DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog