识别出现在特定年份而不是另一组年份的记录
Identify records that are present in particular set of years and not in the another set of years
我正在尝试根据 ID 和年份标记行,如果 ID 出现在 [2017,2018,2019] 年份并且 未出现 [2020,2021, 2022] 然后需要将其标记为 1 else 0.
df1 = pd.DataFrame({'ID': ['AX1', 'Ax1', 'AX1','AX1','AX1','AX1','AX2','AX2','AX2','AX3','AX3','AX4','AX4','AX4'],'year':[2017,2018,2019,2020,2021,2022,2019,2020,2022,2019,2020,2017,2018,2019]})
ID year
0 AX1 2017
1 Ax1 2018
2 AX1 2019
3 AX1 2020
4 AX1 2021
5 AX1 2022
6 AX2 2019
7 AX2 2020
8 AX2 2022
9 AX3 2019
10 AX3 2020
11 AX4 2017
12 AX4 2018
13 AX4 2019
预期输出:
ID year label
0 AX1 2017 0
1 Ax1 2018 0
2 AX1 2019 0
3 AX1 2020 0
4 AX1 2021 0
5 AX1 2022 0
6 AX2 2019 0
7 AX2 2020 0
8 AX2 2022 0
9 AX3 2019 0
10 AX3 2020 0
11 AX4 2017 1
12 AX4 2018 1
13 AX4 2019 1
在上面的例子中ID:AX4被标记为1,因为它是唯一出现在第一组年份[2017,2018,2019]而没有出现的ID第二组[2020,2021,2022].
如何实现?
使用
df1 = pd.DataFrame({'ID': ['AX1', 'AX1', 'AX1','AX1','AX1','AX1','AX2','AX2','AX2','AX3','AX3','AX4','AX4','AX4'],'year':[2017,2018,2019,2020,2021,2022,2019,2020,2022,2019,2020,2017,2018,2019]})
# find group level labels by checking if all of 2017-19 and none of 2020-22 exist for each ID
gr_lbl = df1.groupby('ID')['year'].apply(lambda g: {2017,2018,2019}.issubset(g) and not bool({2020,2021,2022}.intersection(g)))*1
# map group level labels to ID
df1['labels'] = df1.ID.map(gr_lbl)
df1
import pandas as pd
df1 = pd.DataFrame({'ID': ['AX1', 'Ax1', 'AX1','AX1','AX1','AX1','AX2','AX2','AX2','AX3','AX3','AX4','AX4','AX4'],'year':[2017,2018,2019,2020,2021,2022,2019,2020,2022,2019,2020,2017,2018,2019]})
include = set()
exclude = set()
for ID, year in zip(df1['ID'], df1['year']):
if year in [2017,2018,2019]:
include.add(ID.upper())
if year in [2020,2021,2022]:
exclude.add(ID.upper())
df1['label'] = [int(x.upper() in include - exclude) for x in df1['ID']]
print(df1)
通过聚合 set
创建 Series
,然后通过 set.issubset
进行比较,最后映射输出到新列:
y1 = set([2017,2018,2019])
y2 = set([2020,2021,2022])
s = df1.groupby('ID')['year'].agg(set)
df1['label'] = df1['ID'].map((s.map(y1.issubset) & ~s.map(y2.issubset)).astype(int))
print (df1)
ID year label
0 AX1 2017 0
1 Ax1 2018 0
2 AX1 2019 0
3 AX1 2020 0
4 AX1 2021 0
5 AX1 2022 0
6 AX2 2019 0
7 AX2 2020 0
8 AX2 2022 0
9 AX3 2019 0
10 AX3 2020 0
11 AX4 2017 1
12 AX4 2018 1
13 AX4 2019 1
详情:
print (df1.groupby('ID')['year'].agg(set))
ID
AX1 {2017, 2019, 2020, 2021, 2022}
AX2 {2019, 2020, 2022}
AX3 {2019, 2020}
AX4 {2017, 2018, 2019}
Ax1 {2018}
Name: year, dtype: object()
print ((s.map(y1.issubset) & ~s.map(y2.issubset)).astype(int))
ID
AX1 0
AX2 0
AX3 0
AX4 1
Ax1 0
Name: year, dtype: int32
我正在尝试根据 ID 和年份标记行,如果 ID 出现在 [2017,2018,2019] 年份并且 未出现 [2020,2021, 2022] 然后需要将其标记为 1 else 0.
df1 = pd.DataFrame({'ID': ['AX1', 'Ax1', 'AX1','AX1','AX1','AX1','AX2','AX2','AX2','AX3','AX3','AX4','AX4','AX4'],'year':[2017,2018,2019,2020,2021,2022,2019,2020,2022,2019,2020,2017,2018,2019]})
ID year
0 AX1 2017
1 Ax1 2018
2 AX1 2019
3 AX1 2020
4 AX1 2021
5 AX1 2022
6 AX2 2019
7 AX2 2020
8 AX2 2022
9 AX3 2019
10 AX3 2020
11 AX4 2017
12 AX4 2018
13 AX4 2019
预期输出:
ID year label
0 AX1 2017 0
1 Ax1 2018 0
2 AX1 2019 0
3 AX1 2020 0
4 AX1 2021 0
5 AX1 2022 0
6 AX2 2019 0
7 AX2 2020 0
8 AX2 2022 0
9 AX3 2019 0
10 AX3 2020 0
11 AX4 2017 1
12 AX4 2018 1
13 AX4 2019 1
在上面的例子中ID:AX4被标记为1,因为它是唯一出现在第一组年份[2017,2018,2019]而没有出现的ID第二组[2020,2021,2022].
如何实现?
使用
df1 = pd.DataFrame({'ID': ['AX1', 'AX1', 'AX1','AX1','AX1','AX1','AX2','AX2','AX2','AX3','AX3','AX4','AX4','AX4'],'year':[2017,2018,2019,2020,2021,2022,2019,2020,2022,2019,2020,2017,2018,2019]})
# find group level labels by checking if all of 2017-19 and none of 2020-22 exist for each ID
gr_lbl = df1.groupby('ID')['year'].apply(lambda g: {2017,2018,2019}.issubset(g) and not bool({2020,2021,2022}.intersection(g)))*1
# map group level labels to ID
df1['labels'] = df1.ID.map(gr_lbl)
df1
import pandas as pd
df1 = pd.DataFrame({'ID': ['AX1', 'Ax1', 'AX1','AX1','AX1','AX1','AX2','AX2','AX2','AX3','AX3','AX4','AX4','AX4'],'year':[2017,2018,2019,2020,2021,2022,2019,2020,2022,2019,2020,2017,2018,2019]})
include = set()
exclude = set()
for ID, year in zip(df1['ID'], df1['year']):
if year in [2017,2018,2019]:
include.add(ID.upper())
if year in [2020,2021,2022]:
exclude.add(ID.upper())
df1['label'] = [int(x.upper() in include - exclude) for x in df1['ID']]
print(df1)
通过聚合 set
创建 Series
,然后通过 set.issubset
进行比较,最后映射输出到新列:
y1 = set([2017,2018,2019])
y2 = set([2020,2021,2022])
s = df1.groupby('ID')['year'].agg(set)
df1['label'] = df1['ID'].map((s.map(y1.issubset) & ~s.map(y2.issubset)).astype(int))
print (df1)
ID year label
0 AX1 2017 0
1 Ax1 2018 0
2 AX1 2019 0
3 AX1 2020 0
4 AX1 2021 0
5 AX1 2022 0
6 AX2 2019 0
7 AX2 2020 0
8 AX2 2022 0
9 AX3 2019 0
10 AX3 2020 0
11 AX4 2017 1
12 AX4 2018 1
13 AX4 2019 1
详情:
print (df1.groupby('ID')['year'].agg(set))
ID
AX1 {2017, 2019, 2020, 2021, 2022}
AX2 {2019, 2020, 2022}
AX3 {2019, 2020}
AX4 {2017, 2018, 2019}
Ax1 {2018}
Name: year, dtype: object()
print ((s.map(y1.issubset) & ~s.map(y2.issubset)).astype(int))
ID
AX1 0
AX2 0
AX3 0
AX4 1
Ax1 0
Name: year, dtype: int32