在 pandas 个数据帧 Python 中连续查找重复项

Looking for duplicates and consecutively in pandas dataframes Python

我想在下面的 pandas 数据框中添加一个函数,它显示具有 open,high,low,close 值的行与 same.There 是 2 个实例在 High 列中有 2 个连续的重复项,一个从 2 开始到 3 结束,另一个从 4 开始到 5 结束。我该如何编码?

import pandas as pd
import numpy as np
import time
import datetime

A =[[1645661520000, 37352.0, 37376.5, 37352.0, 37376.0, 15.56119087], 
[1645661580000, 37376.0, 37414.0, 37376.0, 37414.0, 49.38248589], 
[1645661640000, 37414.0, 37414.0, 37350.0, 37350.0, 45.70306699], 
[1645661700000, 37350.0, 37374.0, 37350.0, 37373.5, 14.4306948], 
[1645661760000, 37373.5, 37388.0, 37373.5, 37388.0, 3.59340947], 
[1645661820000, 37388.0, 37388.0, 37388.0, 37388.0, 21.45525727]]

column_names = ["Unix","Open", "High","Low", "Close", "Volume"]
df = pd.DataFrame(A, columns=column_names)
#Dates = Local_timezone(df["Unix"].to_numpy()/1000)
df.insert(1,"Date", pd.to_datetime(df["Unix"].to_numpy()/1000,unit='s'))

预期输出

Rows with all duplicate values: 6

Consecutive duplicate values in columns:
Column   Max Duplicates     Start       End 
Open     0                  -           -
High     2                 [2,4]        [3,5]
Low      2                 3             4
Close    2                 4             5

这是我的方法,请记住,您可以通过使用 .ne.eqshift 来改变 start and 结束逻辑:

print(df)


            Unix                Date     Open     High      Low    Close  \
0  1645661520000 2022-02-24 00:12:00  37352.0  37376.5  37352.0  37376.0   
1  1645661580000 2022-02-24 00:13:00  37376.0  37414.0  37376.0  37414.0   
2  1645661640000 2022-02-24 00:14:00  37414.0  37414.0  37350.0  37350.0   
3  1645661700000 2022-02-24 00:15:00  37350.0  37374.0  37350.0  37373.5   
4  1645661760000 2022-02-24 00:16:00  37373.5  37388.0  37373.5  37388.0   
5  1645661820000 2022-02-24 00:17:00  37388.0  37388.0  37388.0  37388.0   

      Volume  
0  15.561191  
1  49.382486  
2  45.703067  
3  14.430695  
4   3.593409  
5  21.455257  


cols = ["Open", "High","Low", "Close"]
df2 = df.ne(df.shift()).cumsum()[cols].melt()
max_dups = \
pd.crosstab(df2['variable'], 
            df2['value'])\
.max(axis=1)\
.where(lambda x: x.gt(1), 0)\
.rename('Max Duplicates')
#variable
#Close     2
#High      2
#Low       2
#Open      0
#Volume    0
#Name: Max Duplicates, dtype: int64



start = df.ne(df.shift()) & df.eq(df.shift(-1))
end = df.eq(df.shift()) & df.ne(df.shift(-1))

f = lambda x: x.index[x].tolist() if np.any(x) else np.nan
df_result = \
pd.concat([start.apply(f).dropna().rename('Start'),
           end.apply(f).dropna().rename('End'),
           max_dups],axis=1)\
.reindex(cols).rename_axis(index='Column').fillna('-')
print(df_result)
#alternative to apply
#start.reset_index().melt('index').loc[lambda x: x['value']].groupby('variable')['index'].agg(list)


         Start     End  Max Duplicates
Column                                
Open         -       -               0
High    [1, 4]  [2, 5]               2
Low        [2]     [3]               2
Close      [4]     [5]               2