检查一个数据框列是否是另一列的子集
Checking if one dataframe column is a subset of another column
我有一个包含列 Enrolled_Months
和 Eligible_Months
的数据框,描述如下:
month_list1 = [
[(1, 2018), (2, 2018), (3, 2019)],
[(7, 2018), (8, 2018), (10, 2018)],
[(4, 2018), (5, 2018), (7, 2018)],
[(1, 2019), (2, 2019), (4, 2019)]
]
month_list2 = [
[(2, 2018), (3, 2019)],
[(7, 2018), (8, 2018)],
[(2, 2018), (3, 2019)],
[(10, 2018), (11, 2019)]
]
EID = [1, 2, 3, 4]
df = pd.DataFrame({
'EID': EID,
'Enrolled_Months': month_list1,
'Eligible_Months': month_list2
})
df
Out[6]:
EID Enrolled_Months Eligible_Months
0 1 [(1, 2018), (2, 2018), (3, 2019)] [(2, 2018), (3, 2019)]
1 2 [(7, 2018), (8, 2018), (10, 2018)] [(7, 2018), (8, 2018)]
2 3 [(4, 2018), (5, 2018), (7, 2018)] [(2, 2018), (3, 2019)]
3 4 [(1, 2019), (2, 2019), (4, 2019)] [(10, 2018), (11, 2019)]
我想创建一个名为 Check
的新列,如果 Enrolled_Months
包含 Eligible_Months
的所有元素,则该列为真。我想要的输出如下:
Out[8]:
EID Enrolled_Months Eligible_Months Check
0 1 [(1, 2018), (2, 2018), (3, 2019)] [(2, 2018), (3, 2019)] True
1 2 [(7, 2018), (8, 2018), (10, 2018)] [(7, 2018), (8, 2018)] True
2 3 [(4, 2018), (5, 2018), (7, 2018)] [(2, 2018), (3, 2019)] False
3 4 [(1, 2019), (2, 2019), (4, 2019)] [(10, 2018), (11, 2019)] False
我试过以下方法:
df['Check'] = set(df['Eligible_Months']).issubset(df['Enrolled_Months'])
但最终得到错误 TypeError: unhashable type: 'list'
。
有什么想法可以实现吗?
旁注:Enrolled_Months
数据最初采用非常不同的格式,每个月都有自己的二进制列,并且有一个单独的 Year
列指定年份(imo 设计真的很糟糕) .我创建了列表列,因为我认为它更容易使用,但如果原始格式更适合我想要实现的目标,请告诉我。
您可以使用一些 explodes
然后 eval
和 any
:
df['Check'] = df.explode('Eligible_Months').explode('Enrolled_Months').eval('Enrolled_Months == Eligible_Months').groupby(level=0).any()
输出:
>>> df
EID Enrolled_Months Eligible_Months Check
0 1 [(1, 2018), (2, 2018), (3, 2019)] [(2, 2018), (3, 2019)] True
1 2 [(7, 2018), (8, 2018), (10, 2018)] [(7, 2018), (8, 2018)] True
2 3 [(4, 2018), (5, 2018), (7, 2018)] [(2, 2018), (3, 2019)] False
3 4 [(1, 2019), (2, 2019), (4, 2019)] [(10, 2018), (11, 2019)] False
您可以使用 df.apply()
创建新列:
df['Check'] = df.apply(
lambda row: set(row['Eligible_Months']).issubset(row['Enrolled_Months']), axis=1
)
这输出:
EID Enrolled_Months Eligible_Months Check
0 1 [(1, 2018), (2, 2018), (3, 2019)] [(2, 2018), (3, 2019)] True
1 2 [(7, 2018), (8, 2018), (10, 2018)] [(7, 2018), (8, 2018)] True
2 3 [(4, 2018), (5, 2018), (7, 2018)] [(2, 2018), (3, 2019)] False
3 4 [(1, 2019), (2, 2019), (4, 2019)] [(10, 2018), (11, 2019)] False
列表理解工作正常:
df.assign(check = [set(l).issuperset(r)
for l, r in
zip(df.Enrolled_Months, df.Eligible_Months)])
EID Enrolled_Months Eligible_Months check
0 1 [(1, 2018), (2, 2018), (3, 2019)] [(2, 2018), (3, 2019)] True
1 2 [(7, 2018), (8, 2018), (10, 2018)] [(7, 2018), (8, 2018)] True
2 3 [(4, 2018), (5, 2018), (7, 2018)] [(2, 2018), (3, 2019)] False
3 4 [(1, 2019), (2, 2019), (4, 2019)] [(10, 2018), (11, 2019)] False
我有一个包含列 Enrolled_Months
和 Eligible_Months
的数据框,描述如下:
month_list1 = [
[(1, 2018), (2, 2018), (3, 2019)],
[(7, 2018), (8, 2018), (10, 2018)],
[(4, 2018), (5, 2018), (7, 2018)],
[(1, 2019), (2, 2019), (4, 2019)]
]
month_list2 = [
[(2, 2018), (3, 2019)],
[(7, 2018), (8, 2018)],
[(2, 2018), (3, 2019)],
[(10, 2018), (11, 2019)]
]
EID = [1, 2, 3, 4]
df = pd.DataFrame({
'EID': EID,
'Enrolled_Months': month_list1,
'Eligible_Months': month_list2
})
df
Out[6]:
EID Enrolled_Months Eligible_Months
0 1 [(1, 2018), (2, 2018), (3, 2019)] [(2, 2018), (3, 2019)]
1 2 [(7, 2018), (8, 2018), (10, 2018)] [(7, 2018), (8, 2018)]
2 3 [(4, 2018), (5, 2018), (7, 2018)] [(2, 2018), (3, 2019)]
3 4 [(1, 2019), (2, 2019), (4, 2019)] [(10, 2018), (11, 2019)]
我想创建一个名为 Check
的新列,如果 Enrolled_Months
包含 Eligible_Months
的所有元素,则该列为真。我想要的输出如下:
Out[8]:
EID Enrolled_Months Eligible_Months Check
0 1 [(1, 2018), (2, 2018), (3, 2019)] [(2, 2018), (3, 2019)] True
1 2 [(7, 2018), (8, 2018), (10, 2018)] [(7, 2018), (8, 2018)] True
2 3 [(4, 2018), (5, 2018), (7, 2018)] [(2, 2018), (3, 2019)] False
3 4 [(1, 2019), (2, 2019), (4, 2019)] [(10, 2018), (11, 2019)] False
我试过以下方法:
df['Check'] = set(df['Eligible_Months']).issubset(df['Enrolled_Months'])
但最终得到错误 TypeError: unhashable type: 'list'
。
有什么想法可以实现吗?
旁注:Enrolled_Months
数据最初采用非常不同的格式,每个月都有自己的二进制列,并且有一个单独的 Year
列指定年份(imo 设计真的很糟糕) .我创建了列表列,因为我认为它更容易使用,但如果原始格式更适合我想要实现的目标,请告诉我。
您可以使用一些 explodes
然后 eval
和 any
:
df['Check'] = df.explode('Eligible_Months').explode('Enrolled_Months').eval('Enrolled_Months == Eligible_Months').groupby(level=0).any()
输出:
>>> df
EID Enrolled_Months Eligible_Months Check
0 1 [(1, 2018), (2, 2018), (3, 2019)] [(2, 2018), (3, 2019)] True
1 2 [(7, 2018), (8, 2018), (10, 2018)] [(7, 2018), (8, 2018)] True
2 3 [(4, 2018), (5, 2018), (7, 2018)] [(2, 2018), (3, 2019)] False
3 4 [(1, 2019), (2, 2019), (4, 2019)] [(10, 2018), (11, 2019)] False
您可以使用 df.apply()
创建新列:
df['Check'] = df.apply(
lambda row: set(row['Eligible_Months']).issubset(row['Enrolled_Months']), axis=1
)
这输出:
EID Enrolled_Months Eligible_Months Check
0 1 [(1, 2018), (2, 2018), (3, 2019)] [(2, 2018), (3, 2019)] True
1 2 [(7, 2018), (8, 2018), (10, 2018)] [(7, 2018), (8, 2018)] True
2 3 [(4, 2018), (5, 2018), (7, 2018)] [(2, 2018), (3, 2019)] False
3 4 [(1, 2019), (2, 2019), (4, 2019)] [(10, 2018), (11, 2019)] False
列表理解工作正常:
df.assign(check = [set(l).issuperset(r)
for l, r in
zip(df.Enrolled_Months, df.Eligible_Months)])
EID Enrolled_Months Eligible_Months check
0 1 [(1, 2018), (2, 2018), (3, 2019)] [(2, 2018), (3, 2019)] True
1 2 [(7, 2018), (8, 2018), (10, 2018)] [(7, 2018), (8, 2018)] True
2 3 [(4, 2018), (5, 2018), (7, 2018)] [(2, 2018), (3, 2019)] False
3 4 [(1, 2019), (2, 2019), (4, 2019)] [(10, 2018), (11, 2019)] False