逐行比较数据框中包含的集合

Compare Sets contained in dataframe row wise

我想知道你是否可以帮我解决下一个问题。

我有一个包含 2 列的数据框,其中包含字符串集 [第 1 列:参考字符串,第 2 列:要检查的字符串],我的目标是获得一个包含这些列之间差异的新列...意思是:新列应该只包含第 1 列中不存在的第 2 列中的字符串。

我的输入:

    import pandas as pd


    data = [['Aa', {'Fatty alcohols', 'Agarofuran sesquiterpenoids', 'Phenylalanine-derived alkaloids', 'Pyridine alkaloids'}, {'Luis', 'Polyamines','Fatty alcohols', 'Agarofuran sesquiterpenoids', 'Phenylalanine-derived alkaloids', 'Pyridine alkaloids'}], ['Bb', {'Agarofuran sesquiterpenoids', 'Lupane triterpenoids', 'Stigmastane steroids'}, {'Agarofuran sesquiterpenoids', 'Lupane triterpenoids', 'Stigmastane steroids', 'Miscellaneous meroterpenoids'}], ['Cc', {'Pyridine alkaloids'}, {'Luis', 'Pyridine alkaloids'}], ['Dd', {'Luis', 'Polyamines'}, {'Marco'}], ['Ee', {'Friedelane triterpenoids', 'Cucurbitane triterpenoids'}, {'Friedelane triterpenoids', 'Cucurbitane triterpenoids', 'Ansa macrolides'}]]
df = pd.DataFrame(data, columns=["Species", "Reference", "To_check"])
df

我想得到的是:

 data = [['Aa', {'Fatty alcohols', 'Agarofuran sesquiterpenoids', 'Phenylalanine-derived alkaloids', 'Pyridine alkaloids'}, {'Luis', 'Polyamines','Fatty alcohols', 'Agarofuran sesquiterpenoids', 'Phenylalanine-derived alkaloids', 'Pyridine alkaloids'}, {'Luis', 'Polyamines'}], ['Bb', {'Agarofuran sesquiterpenoids', 'Lupane triterpenoids', 'Stigmastane steroids'}, {'Agarofuran sesquiterpenoids', 'Lupane triterpenoids', 'Stigmastane steroids', 'Miscellaneous meroterpenoids'}, {'Miscellaneous meroterpenoids'}], ['Cc', {'Pyridine alkaloids'}, {'Luis', 'Pyridine alkaloids'}, {'Luis'}], ['Dd', {'Luis', 'Polyamines'}, {'Marco'}, {'Marco'}], ['Ee', {'Friedelane triterpenoids', 'Cucurbitane triterpenoids'}, {'Friedelane triterpenoids', 'Cucurbitane triterpenoids', 'Ansa macrolides'}, {'Ansa macrolides'}]]

df_out = pd.DataFrame(数据,列=[“物种”,“参考”,“To_check”,'New']) df_out

到目前为止,我尝试的是这个,但不是我想要的,在这里我请求你的帮助...

df_diff = df[~df['To_check'].isin(df['Reference'])]
df_diff

这给了我一个减少的数据帧,其中的行有差异,如果我尝试创建一个新的有差异的列,我会收到一个错误...

df['New'] = df[~df['To_check'].isin(df['Reference'])]
ValueError: Wrong number of items passed 3, placement implies 1

到目前为止,我使用集合来包含列中的字符串,但我想如果我使用列表可能是一样的。

那么,我怎样才能得到包含结果的新列?我也想知道 isin() 是否明智地进行比较行,或者是否有其他方法更合适。

我需要保持 DATAFRAME 原样...只需添加新列

谢谢!!

使用:

df["New"] = df["To_check"] - df["Reference"]
print(df)

输出

  Species     Reference      To_check        New
0      Aa  {B, A, C, D}  {A, C, D, E}        {E}
1      Bb  {B, A, C, D}  {B, A, C, D}         {}
2      Cc  {F, A, G, E}  {B, A, D, E}     {B, D}
3      Dd        {A, G}  {A, C, D, E}  {D, C, E}
4      Ee        {D, C}        {D, E}        {E}

表达式:

df["To_check"] - df["Reference"]

应用 df["To_check"]df["Reference"] 值之间的逐元素差异。类似于:

# notice that this could be an alternative solution
df["New"] = [check - reference for check, reference in zip(df["To_check"], df["Reference"])]

设定值的差值是(来自documentation):

Return a new set with elements in the set that are not in the others.