Python - Select 一天中每个小时都有数据的所有行
Python - Select all rows where there is data for each hour in a day
我有一个包含一系列日期的数据框:
Year Month Day Hour
2020 12 3 22
2021 1 1 0
2021 1 1 1
2021 1 1 2
...
2021 1 1 23
2021 1 2 1
2021 1 2 3
...
我想 return 包含一天中所有 24 小时信息的日期的所有行。在上面的示例中,我只想 return 行:
2021 1 1 0
2021 1 1 1
2021 1 1 2
...
2021 1 1 23
我的数据集很长。如有任何帮助,我将不胜感激。谢谢。
import pandas as pd
import random as rd
# generate dummy data
sz = 40000
df = pd.DataFrame()
df['Y'] = [rd.randint(2020, 2021) for _ in range(sz)]
df['M'] = [rd.randint(1, 12) for _ in range(sz)]
df['D'] = [rd.randint(1, 31) for _ in range(sz)]
df['H'] = [rd.randint(0, 23) for _ in range(sz)]
# make an ethalon hour sequence
h24 = [i for i in range(24)]
# group and check if we have 24 hours in the group
# if NaN then no 24 hours here - drop, explode the rest
df = df.groupby(by=['Y', 'M', 'D']).apply(lambda x: None if x.value_counts().size != 24 else h24). \
dropna(how='any').explode().reset_index().rename(columns={0: "H"})
print(df)
打印:
Y M D H
0 2020 1 3 0
1 2020 1 3 1
2 2020 1 3 2
3 2020 1 3 3
4 2020 1 3 4
... ... .. .. ..
1363 2021 12 11 19
1364 2021 12 11 20
1365 2021 12 11 21
1366 2021 12 11 22
1367 2021 12 11 23
[1368 rows x 4 columns]
我有一个包含一系列日期的数据框:
Year Month Day Hour
2020 12 3 22
2021 1 1 0
2021 1 1 1
2021 1 1 2
...
2021 1 1 23
2021 1 2 1
2021 1 2 3
...
我想 return 包含一天中所有 24 小时信息的日期的所有行。在上面的示例中,我只想 return 行:
2021 1 1 0
2021 1 1 1
2021 1 1 2
...
2021 1 1 23
我的数据集很长。如有任何帮助,我将不胜感激。谢谢。
import pandas as pd
import random as rd
# generate dummy data
sz = 40000
df = pd.DataFrame()
df['Y'] = [rd.randint(2020, 2021) for _ in range(sz)]
df['M'] = [rd.randint(1, 12) for _ in range(sz)]
df['D'] = [rd.randint(1, 31) for _ in range(sz)]
df['H'] = [rd.randint(0, 23) for _ in range(sz)]
# make an ethalon hour sequence
h24 = [i for i in range(24)]
# group and check if we have 24 hours in the group
# if NaN then no 24 hours here - drop, explode the rest
df = df.groupby(by=['Y', 'M', 'D']).apply(lambda x: None if x.value_counts().size != 24 else h24). \
dropna(how='any').explode().reset_index().rename(columns={0: "H"})
print(df)
打印:
Y M D H
0 2020 1 3 0
1 2020 1 3 1
2 2020 1 3 2
3 2020 1 3 3
4 2020 1 3 4
... ... .. .. ..
1363 2021 12 11 19
1364 2021 12 11 20
1365 2021 12 11 21
1366 2021 12 11 22
1367 2021 12 11 23
[1368 rows x 4 columns]