在 pandas 个数据帧 Python 中连续查找重复项
Looking for duplicates and consecutively in pandas dataframes Python
我想在下面的 pandas 数据框中添加一个函数,它显示具有 open,high,low,close
值的行与 same.There 是 2 个实例在 High
列中有 2 个连续的重复项,一个从 2 开始到 3 结束,另一个从 4 开始到 5 结束。我该如何编码?
import pandas as pd
import numpy as np
import time
import datetime
A =[[1645661520000, 37352.0, 37376.5, 37352.0, 37376.0, 15.56119087],
[1645661580000, 37376.0, 37414.0, 37376.0, 37414.0, 49.38248589],
[1645661640000, 37414.0, 37414.0, 37350.0, 37350.0, 45.70306699],
[1645661700000, 37350.0, 37374.0, 37350.0, 37373.5, 14.4306948],
[1645661760000, 37373.5, 37388.0, 37373.5, 37388.0, 3.59340947],
[1645661820000, 37388.0, 37388.0, 37388.0, 37388.0, 21.45525727]]
column_names = ["Unix","Open", "High","Low", "Close", "Volume"]
df = pd.DataFrame(A, columns=column_names)
#Dates = Local_timezone(df["Unix"].to_numpy()/1000)
df.insert(1,"Date", pd.to_datetime(df["Unix"].to_numpy()/1000,unit='s'))
预期输出
Rows with all duplicate values: 6
Consecutive duplicate values in columns:
Column Max Duplicates Start End
Open 0 - -
High 2 [2,4] [3,5]
Low 2 3 4
Close 2 4 5
这是我的方法,请记住,您可以通过使用 .ne
、.eq
和 shift
来改变 start
and
结束逻辑:
print(df)
Unix Date Open High Low Close \
0 1645661520000 2022-02-24 00:12:00 37352.0 37376.5 37352.0 37376.0
1 1645661580000 2022-02-24 00:13:00 37376.0 37414.0 37376.0 37414.0
2 1645661640000 2022-02-24 00:14:00 37414.0 37414.0 37350.0 37350.0
3 1645661700000 2022-02-24 00:15:00 37350.0 37374.0 37350.0 37373.5
4 1645661760000 2022-02-24 00:16:00 37373.5 37388.0 37373.5 37388.0
5 1645661820000 2022-02-24 00:17:00 37388.0 37388.0 37388.0 37388.0
Volume
0 15.561191
1 49.382486
2 45.703067
3 14.430695
4 3.593409
5 21.455257
cols = ["Open", "High","Low", "Close"]
df2 = df.ne(df.shift()).cumsum()[cols].melt()
max_dups = \
pd.crosstab(df2['variable'],
df2['value'])\
.max(axis=1)\
.where(lambda x: x.gt(1), 0)\
.rename('Max Duplicates')
#variable
#Close 2
#High 2
#Low 2
#Open 0
#Volume 0
#Name: Max Duplicates, dtype: int64
start = df.ne(df.shift()) & df.eq(df.shift(-1))
end = df.eq(df.shift()) & df.ne(df.shift(-1))
f = lambda x: x.index[x].tolist() if np.any(x) else np.nan
df_result = \
pd.concat([start.apply(f).dropna().rename('Start'),
end.apply(f).dropna().rename('End'),
max_dups],axis=1)\
.reindex(cols).rename_axis(index='Column').fillna('-')
print(df_result)
#alternative to apply
#start.reset_index().melt('index').loc[lambda x: x['value']].groupby('variable')['index'].agg(list)
Start End Max Duplicates
Column
Open - - 0
High [1, 4] [2, 5] 2
Low [2] [3] 2
Close [4] [5] 2
我想在下面的 pandas 数据框中添加一个函数,它显示具有 open,high,low,close
值的行与 same.There 是 2 个实例在 High
列中有 2 个连续的重复项,一个从 2 开始到 3 结束,另一个从 4 开始到 5 结束。我该如何编码?
import pandas as pd
import numpy as np
import time
import datetime
A =[[1645661520000, 37352.0, 37376.5, 37352.0, 37376.0, 15.56119087],
[1645661580000, 37376.0, 37414.0, 37376.0, 37414.0, 49.38248589],
[1645661640000, 37414.0, 37414.0, 37350.0, 37350.0, 45.70306699],
[1645661700000, 37350.0, 37374.0, 37350.0, 37373.5, 14.4306948],
[1645661760000, 37373.5, 37388.0, 37373.5, 37388.0, 3.59340947],
[1645661820000, 37388.0, 37388.0, 37388.0, 37388.0, 21.45525727]]
column_names = ["Unix","Open", "High","Low", "Close", "Volume"]
df = pd.DataFrame(A, columns=column_names)
#Dates = Local_timezone(df["Unix"].to_numpy()/1000)
df.insert(1,"Date", pd.to_datetime(df["Unix"].to_numpy()/1000,unit='s'))
预期输出
Rows with all duplicate values: 6
Consecutive duplicate values in columns:
Column Max Duplicates Start End
Open 0 - -
High 2 [2,4] [3,5]
Low 2 3 4
Close 2 4 5
这是我的方法,请记住,您可以通过使用 .ne
、.eq
和 shift
来改变 start
and
结束逻辑:
print(df)
Unix Date Open High Low Close \
0 1645661520000 2022-02-24 00:12:00 37352.0 37376.5 37352.0 37376.0
1 1645661580000 2022-02-24 00:13:00 37376.0 37414.0 37376.0 37414.0
2 1645661640000 2022-02-24 00:14:00 37414.0 37414.0 37350.0 37350.0
3 1645661700000 2022-02-24 00:15:00 37350.0 37374.0 37350.0 37373.5
4 1645661760000 2022-02-24 00:16:00 37373.5 37388.0 37373.5 37388.0
5 1645661820000 2022-02-24 00:17:00 37388.0 37388.0 37388.0 37388.0
Volume
0 15.561191
1 49.382486
2 45.703067
3 14.430695
4 3.593409
5 21.455257
cols = ["Open", "High","Low", "Close"]
df2 = df.ne(df.shift()).cumsum()[cols].melt()
max_dups = \
pd.crosstab(df2['variable'],
df2['value'])\
.max(axis=1)\
.where(lambda x: x.gt(1), 0)\
.rename('Max Duplicates')
#variable
#Close 2
#High 2
#Low 2
#Open 0
#Volume 0
#Name: Max Duplicates, dtype: int64
start = df.ne(df.shift()) & df.eq(df.shift(-1))
end = df.eq(df.shift()) & df.ne(df.shift(-1))
f = lambda x: x.index[x].tolist() if np.any(x) else np.nan
df_result = \
pd.concat([start.apply(f).dropna().rename('Start'),
end.apply(f).dropna().rename('End'),
max_dups],axis=1)\
.reindex(cols).rename_axis(index='Column').fillna('-')
print(df_result)
#alternative to apply
#start.reset_index().melt('index').loc[lambda x: x['value']].groupby('variable')['index'].agg(list)
Start End Max Duplicates
Column
Open - - 0
High [1, 4] [2, 5] 2
Low [2] [3] 2
Close [4] [5] 2