多长度系列
Series of Multiple Lengths
以下代码检查部分匹配并添加注释以判断是否存在部分匹配(效果很好!):
import pandas as pd
import numpy as np
x = {'Non-Suffix' : ['1234569', '1234554', '1234567', '1234568','Hello'], 'Suffix' : ['1234567:C', '1234568:VXCF', 'ABCDEFU', '1234569-01', '1234554-01:XC']}
x = pd.DataFrame({k: pd.Series(v) for k, v in x.items()})
x['"Non-Suffix" Partial Match in "Suffix"?'] = x['Non-Suffix'].apply(lambda v: x['Suffix'].str.contains(v).any()).replace({True: '--'}).replace({False: 'Add to Suffix'}).replace({np.nan: '--'})
x['"Suffix" Partial Match in "Non-Suffix"?'] = x['Suffix'].str.contains('|'.join(x['Non-Suffix'])).replace({True: '--'}).replace({False: 'Remove from Suffix'}).replace({np.nan: '--'})
x
#code breaks if anything is added to 'Suffix' column
然而,在实践中,不会总是比较相同长度的列。事实上,大多数时候,会比较不同长度的列。如果我向 Non-Suffix
列 ('WhosebugIsAwesome'
) 添加一个值,代码会中断:
x = {'Non-Suffix' : ['1234569', '1234554', '1234567', '1234568','Hello'], 'Suffix' : ['1234567:C', '1234568:VXCF', 'ABCDEFU', '1234569-01', '1234554-01:XC','HelloAdele']}
x = pd.DataFrame({k: pd.Series(v) for k, v in x.items()})
x['"Non-Suffix" Partial Match in "Suffix"?'] = x['Non-Suffix'].apply(lambda v: x['Suffix'].str.contains(v).any()).replace({True: '--'}).replace({False: 'Add to Suffix'}).replace({np.nan: '--'})
x['"Suffix" Partial Match in "Non-Suffix"?'] = x['Suffix'].str.contains('|'.join(x['Non-Suffix'])).replace({True: '--'}).replace({False: 'Remove from Suffix'}).replace({np.nan: '--'})
x
#code breaks if anything is added to 'Suffix' column
这是确认列之间长度不同的错误:
我希望能够添加一些东西(比如 Suffix
列中的 'HelloAdele'
并且没有代码中断。注意:我可以将值添加到 Non-Suffix
列,而不是 Suffix
列。非常感谢任何有关如何克服此问题的提示!
已更新 以确保 Non-Suffix
列中的 NaN 不会导致“添加到后缀”值。
我认为这样的事情应该可行:
import pandas as pd
import numpy as np
x = {'Non-Suffix' : ['1234569', '1234554', '1234567', '1234568','Hello'], 'Suffix' : ['1234567:C', '1234568:VXCF', 'ABCDEFU', '1234569-01', '1234554-01:XC','HelloAdele']}
x = pd.DataFrame({k: pd.Series(v) for k, v in x.items()})
print()
print(x)
x['"Non-Suffix" Partial Match in "Suffix"?'] = x['Non-Suffix'].apply(
lambda v: np.nan if v is np.nan else x['Suffix'].str.contains(v).any()).replace({True: '--'}).replace({False: 'Add to Suffix'}).replace({np.nan: '--'})
x['"Suffix" Partial Match in "Non-Suffix"?'] = x['Suffix'].str.contains('|'.join(
y for y in x['Non-Suffix'] if y is not np.nan)).replace({True: '--'}).replace({False: 'Remove from Suffix'}).replace({np.nan: '--'})
print(x)
我们基本上在 Non-Suffix
中 special-case NaN 并将结果设置为 np.nan(稍后被替换为 '--'),在 Suffix
中我们跳过构建要匹配的模式时为 NaN。
输入:
Non-Suffix Suffix
0 1234569 1234567:C
1 1234554 1234568:VXCF
2 1234567 ABCDEFU
3 1234568 1234569-01
4 Hello 1234554-01:XC
5 NaN HelloAdele
输出:
Non-Suffix Suffix "Non-Suffix" Partial Match in "Suffix"? "Suffix" Partial Match in "Non-Suffix"?
0 1234569 1234567:C -- --
1 1234554 1234568:VXCF -- --
2 1234567 ABCDEFU -- Remove from Suffix
3 1234568 1234569-01 -- --
4 Hello 1234554-01:XC -- --
5 NaN HelloAdele -- --
以下代码检查部分匹配并添加注释以判断是否存在部分匹配(效果很好!):
import pandas as pd
import numpy as np
x = {'Non-Suffix' : ['1234569', '1234554', '1234567', '1234568','Hello'], 'Suffix' : ['1234567:C', '1234568:VXCF', 'ABCDEFU', '1234569-01', '1234554-01:XC']}
x = pd.DataFrame({k: pd.Series(v) for k, v in x.items()})
x['"Non-Suffix" Partial Match in "Suffix"?'] = x['Non-Suffix'].apply(lambda v: x['Suffix'].str.contains(v).any()).replace({True: '--'}).replace({False: 'Add to Suffix'}).replace({np.nan: '--'})
x['"Suffix" Partial Match in "Non-Suffix"?'] = x['Suffix'].str.contains('|'.join(x['Non-Suffix'])).replace({True: '--'}).replace({False: 'Remove from Suffix'}).replace({np.nan: '--'})
x
#code breaks if anything is added to 'Suffix' column
然而,在实践中,不会总是比较相同长度的列。事实上,大多数时候,会比较不同长度的列。如果我向 Non-Suffix
列 ('WhosebugIsAwesome'
) 添加一个值,代码会中断:
x = {'Non-Suffix' : ['1234569', '1234554', '1234567', '1234568','Hello'], 'Suffix' : ['1234567:C', '1234568:VXCF', 'ABCDEFU', '1234569-01', '1234554-01:XC','HelloAdele']}
x = pd.DataFrame({k: pd.Series(v) for k, v in x.items()})
x['"Non-Suffix" Partial Match in "Suffix"?'] = x['Non-Suffix'].apply(lambda v: x['Suffix'].str.contains(v).any()).replace({True: '--'}).replace({False: 'Add to Suffix'}).replace({np.nan: '--'})
x['"Suffix" Partial Match in "Non-Suffix"?'] = x['Suffix'].str.contains('|'.join(x['Non-Suffix'])).replace({True: '--'}).replace({False: 'Remove from Suffix'}).replace({np.nan: '--'})
x
#code breaks if anything is added to 'Suffix' column
这是确认列之间长度不同的错误:
我希望能够添加一些东西(比如 Suffix
列中的 'HelloAdele'
并且没有代码中断。注意:我可以将值添加到 Non-Suffix
列,而不是 Suffix
列。非常感谢任何有关如何克服此问题的提示!
已更新 以确保 Non-Suffix
列中的 NaN 不会导致“添加到后缀”值。
我认为这样的事情应该可行:
import pandas as pd
import numpy as np
x = {'Non-Suffix' : ['1234569', '1234554', '1234567', '1234568','Hello'], 'Suffix' : ['1234567:C', '1234568:VXCF', 'ABCDEFU', '1234569-01', '1234554-01:XC','HelloAdele']}
x = pd.DataFrame({k: pd.Series(v) for k, v in x.items()})
print()
print(x)
x['"Non-Suffix" Partial Match in "Suffix"?'] = x['Non-Suffix'].apply(
lambda v: np.nan if v is np.nan else x['Suffix'].str.contains(v).any()).replace({True: '--'}).replace({False: 'Add to Suffix'}).replace({np.nan: '--'})
x['"Suffix" Partial Match in "Non-Suffix"?'] = x['Suffix'].str.contains('|'.join(
y for y in x['Non-Suffix'] if y is not np.nan)).replace({True: '--'}).replace({False: 'Remove from Suffix'}).replace({np.nan: '--'})
print(x)
我们基本上在 Non-Suffix
中 special-case NaN 并将结果设置为 np.nan(稍后被替换为 '--'),在 Suffix
中我们跳过构建要匹配的模式时为 NaN。
输入:
Non-Suffix Suffix
0 1234569 1234567:C
1 1234554 1234568:VXCF
2 1234567 ABCDEFU
3 1234568 1234569-01
4 Hello 1234554-01:XC
5 NaN HelloAdele
输出:
Non-Suffix Suffix "Non-Suffix" Partial Match in "Suffix"? "Suffix" Partial Match in "Non-Suffix"?
0 1234569 1234567:C -- --
1 1234554 1234568:VXCF -- --
2 1234567 ABCDEFU -- Remove from Suffix
3 1234568 1234569-01 -- --
4 Hello 1234554-01:XC -- --
5 NaN HelloAdele -- --