Pandas - 删除列中已存在于另一列中的部分字符串

Pandas - Remove part of string in column that is already in another column

我有这个数据框:

dfA = pd.DataFrame({
            'A': ['abc','ghi','mno', 'stu'],
            'B': ['abcdef', 'jklghi', 'mnopqr', 'vwxstu']
         })
dfA

我想得到这个数据框:

dfB = pd.DataFrame({
            'A': ['abc','ghi','mno', 'stu'],
            'B': ['abcdef', 'jklghi', 'mnopqr', 'vwxstu'],
            'C': ['def', 'jkl', 'pqr', 'vwx'],
         })
dfB

列 'C' 必须包含不在列 'A'.

中的列 'B' 的子字符串

我尝试将列 'B' 复制到 'C',然后使用 df.replace(),如下所示,但它不起作用:

dfA = pd.DataFrame({
            'A': ['abc','ghi','mno', 'stu'],
            'B': ['abcdef', 'jklghi', 'mnopqr', 'vwxstu']
         })
dfA.loc[:,'C'] = dfA['B']

dfA['C'].replace(dfA['B'], '', regex=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_1611271772080.py in <cell line: 7>()
      5 dfA.loc[:,'C'] = dfA['B']
      6 
----> 7 dfA['C'].replace(dfA['B'], '', regex=True)

~\Anaconda3\envs\py310\lib\site-packages\pandas\core\series.py in replace(self, to_replace, value, inplace, limit, regex, method)
   4958         method: str | lib.NoDefault = lib.no_default,
   4959     ):
-> 4960         return super().replace(
   4961             to_replace=to_replace,
   4962             value=value,

~\Anaconda3\envs\py310\lib\site-packages\pandas\core\generic.py in replace(self, to_replace, value, inplace, limit, regex, method)
   6677                     # Operate column-wise
   6678                     if self.ndim == 1:
-> 6679                         raise ValueError(
   6680                             "Series.replace cannot use dict-like to_replace "
   6681                             "and non-None value"

ValueError: Series.replace cannot use dict-like to_replace and non-None value

此外,'A' 中的字符串必须是 'B' 列的 pre/sufix,因此 'C' 列将是 su/prefix =30=] 字符串。所以,'B' = 'A'+'C' | 'C'+'A',我也尝试过使用-作为“decatenation”运算符,但是它不起作用。

你知道我应该怎么做吗?

你需要在这里循环。

您可以使用 re.sub:

import re
dfA['C'] = [re.sub(a, '', b) for a,b in zip(dfA['A'], dfA['B'])]

str.replace:

dfA['C'] = [b.replace(a, '') for a,b in zip(dfA['A'], dfA['B'])]

输出:

     A       B    C
0  abc  abcdef  def
1  ghi  jklghi  jkl
2  mno  mnopqr  pqr
3  stu  vwxstu  vwx