恢复原始换行符 pandas \n

Question

背景

我有以下样本 df

import pandas as pd
df = pd.DataFrame({'Text' : ['\n[STUFF]\nBut the here is \n\nBase ID : 00000 Date is Here \nfollow\n', 
                                   '\n[OTHER]\n\n\nFound Tom Dub \nhere\n  BATH # : E12-34567 MR # 000', 
                                   '\n[ANY]\nJane Ja So so \nBase ID : 11111 Date\n\n\n hey the \n\n  \n    \n\n\n'],
                    'Alt_Text' : ['[STUFF]But the here is Base ID : *A* Date is Here follow', 
                                   '[OTHER]Found *B* *B* here BATH # : *A* MR # *C*', 
                                   '[ANY]*B* *B*So so Base ID : *A* Date hey the '],


                      'ID': [1,2,3]

                     })

目标

1) 创建一个新列 New_Text 2) 重新获得 Text 列中存在的原始换行符 \n 但包含 Alt_Text 列中的内容

例子

Text 列，行 0:

\n[STUFF]\nBut the here is \n\nBase ID : 00000 Date is Here \nfollow\n

Alt_Text 列，行 0:

[STUFF]But the here is Base ID : *A* Date is Here follow

想要

\n[STUFF]\nBut the here is \n\nBase ID : *A*  Date is Here \nfollow\n

期望输出

   Text Alt_Text ID New_Text 
0                   \n[STUFF]\nBut the here is \n\nBase ID :  *A*  Date is Here \nfollow\n  
1                   \n[OTHER]\n\n\nFound *B* *B*  \nhere\n BATH # : *A*  MR # *C*   
2                   \n[ANY]\nJ*B* *B* So so \nBase ID : *A*  Date\n\n\n hey the \n\n \n \n\n\n

尝试过

我环顾四周，包括 and Read Excel data using Pandas and retaining the line break of a cell value 和许多其他人，none 似乎是我想要做的。

问题

如何实现我想要的输出？

Answer 1

我们 regex split Text 和 Alt_Text 在模式中使用捕获括号：

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

然后我们 zip 两个列表都将包含来自 Text 的换行符以及来自 Alt_Text 和 join 的任何其他结果列表放入 New_Text 中：

def insert_line_breaks(text, alt_text):
    regex = re.compile(r'([^ \n\[\]]+)')
    text = regex.split(text)
    alt_text = regex.split(alt_text)
    return ''.join([t if '\n' in t else a for t,a in zip(text,alt_text)])

df['New_Text'] = df.apply(lambda r: insert_line_breaks(r.Text, r.Alt_Text), axis=1)

我想 Alt_Text 最后一行的第二个 *B* 和 So 之间应该有一个 space 和第一个之前的 J所需输出中的 *B* 只是一个拼写错误。在这种情况下，我们得到：

>>> df.New_Text
0            \n[STUFF]\nBut the here is \n\nCase ID : *A* Date is Here \nfollow\n
1                    \n[OTHER]\n\n\nFound *B* *B* \nhere\n  BATH # : *A* MR # *C*
2    \n[ANY]\n*B* *B* So so \nCase ID : *A* Date\n\n\n hey the \n\n  \n    \n\n\n

恢复原始换行符 pandas \n

regaining original line breaks pandas \n

text

replace

line-breaks

python-3.x

pandas