多长度系列

Series of Multiple Lengths

以下代码检查部分匹配并添加注释以判断是否存在部分匹配(效果很好!):

import pandas as pd
import numpy as np

x = {'Non-Suffix' : ['1234569', '1234554', '1234567', '1234568','Hello'], 'Suffix' : ['1234567:C', '1234568:VXCF', 'ABCDEFU', '1234569-01', '1234554-01:XC']}
x = pd.DataFrame({k: pd.Series(v) for k, v in x.items()})
x['"Non-Suffix" Partial Match in "Suffix"?'] = x['Non-Suffix'].apply(lambda v: x['Suffix'].str.contains(v).any()).replace({True: '--'}).replace({False: 'Add to Suffix'}).replace({np.nan: '--'})
x['"Suffix" Partial Match in "Non-Suffix"?'] = x['Suffix'].str.contains('|'.join(x['Non-Suffix'])).replace({True: '--'}).replace({False: 'Remove from Suffix'}).replace({np.nan: '--'})
x
#code breaks if anything is added to 'Suffix' column

然而,在实践中,不会总是比较相同长度的列。事实上,大多数时候,会比较不同长度的列。如果我向 Non-Suffix 列 ('WhosebugIsAwesome') 添加一个值,代码会中断:

x = {'Non-Suffix' : ['1234569', '1234554', '1234567', '1234568','Hello'], 'Suffix' : ['1234567:C', '1234568:VXCF', 'ABCDEFU', '1234569-01', '1234554-01:XC','HelloAdele']}
x = pd.DataFrame({k: pd.Series(v) for k, v in x.items()})
x['"Non-Suffix" Partial Match in "Suffix"?'] = x['Non-Suffix'].apply(lambda v: x['Suffix'].str.contains(v).any()).replace({True: '--'}).replace({False: 'Add to Suffix'}).replace({np.nan: '--'})
x['"Suffix" Partial Match in "Non-Suffix"?'] = x['Suffix'].str.contains('|'.join(x['Non-Suffix'])).replace({True: '--'}).replace({False: 'Remove from Suffix'}).replace({np.nan: '--'})
x
#code breaks if anything is added to 'Suffix' column

这是确认列之间长度不同的错误:

我希望能够添加一些东西(比如 Suffix 列中的 'HelloAdele' 并且没有代码中断。注意:我可以将值添加到 Non-Suffix 列,而不是 Suffix 列。非常感谢任何有关如何克服此问题的提示!

已更新 以确保 Non-Suffix 列中的 NaN 不会导致“添加到后缀”值。

我认为这样的事情应该可行:

import pandas as pd
import numpy as np

x = {'Non-Suffix' : ['1234569', '1234554', '1234567', '1234568','Hello'], 'Suffix' : ['1234567:C', '1234568:VXCF', 'ABCDEFU', '1234569-01', '1234554-01:XC','HelloAdele']}
x = pd.DataFrame({k: pd.Series(v) for k, v in x.items()})
print()
print(x)
x['"Non-Suffix" Partial Match in "Suffix"?'] = x['Non-Suffix'].apply(
    lambda v: np.nan if v is np.nan else x['Suffix'].str.contains(v).any()).replace({True: '--'}).replace({False: 'Add to Suffix'}).replace({np.nan: '--'})
x['"Suffix" Partial Match in "Non-Suffix"?'] = x['Suffix'].str.contains('|'.join(
    y for y in x['Non-Suffix'] if y is not np.nan)).replace({True: '--'}).replace({False: 'Remove from Suffix'}).replace({np.nan: '--'})
print(x)

我们基本上在 Non-Suffix 中 special-case NaN 并将结果设置为 np.nan(稍后被替换为 '--'),在 Suffix 中我们跳过构建要匹配的模式时为 NaN。

输入:

  Non-Suffix         Suffix
0    1234569      1234567:C
1    1234554   1234568:VXCF
2    1234567        ABCDEFU
3    1234568     1234569-01
4      Hello  1234554-01:XC
5        NaN     HelloAdele

输出:

  Non-Suffix         Suffix "Non-Suffix" Partial Match in "Suffix"? "Suffix" Partial Match in "Non-Suffix"?
0    1234569      1234567:C                                      --                                      --
1    1234554   1234568:VXCF                                      --                                      --
2    1234567        ABCDEFU                                      --                      Remove from Suffix
3    1234568     1234569-01                                      --                                      --
4      Hello  1234554-01:XC                                      --                                      --
5        NaN     HelloAdele                                      --                                      --