Python 使用提取的正则表达式创建一个新列,直到 \n 来自数据框

Python create a new column with extracted regex until \n from a dataframe

我有一个如下所示的数据框:

data = {'c1':['Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n', 
              'Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n'],
        'c2':["one", "two", "three", "four"]}

我想创建:

最后我希望我的数据框是这样的:

data = {'c1':['Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3', 
              'Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1'],
        'c3':['this is problem type 01', 
              'this is problem type 01', 
              'this is problem type 02', 
              'this is problem type 04'],
        'c4':['Error executing the statement: error statement 1', 
              'Error executing the statement: error statement 3', 
              'Error executing the statement: error statement2', 
              'Error executing the statement: error statement1'],
        'c2':["one", "two", "three", "four"]}

这是我目前所拥有的,我试图从“Thrown: lib:”中提取直到第一个 \n,但它不起作用。

df = pd.DataFrame(data)
df['exception'] = df['c1'].str.extract(r'Thrown: lib: (.*(?:\r?\n.*)*)', expand=False)

我会使用 re 包:

data['c3'] = [re.findall("Thrown: lib: ([^\n]+)", x) for x in data['c1']]
data['c4'] = [re.split("\n", x)[3].strip() for x in data['c1']]
  • 第一个模式提取 Thrown: lib: 和第一个换行符
  • 之间的所有内容
  • 第二种模式假定相关消息始终是第 4 个标记,当被 \n 分割时,情况似乎是这样

跟进:以下问题。 data['c4'] 的模式基于这样一个事实,即消息总是在消息中的 4 个“\n”换行符之后。
现在,如果感兴趣的分隔符是“\n \t\n”,您可以修改模式:

data['c4'] = [re.split("\n \t\n", x)[1].strip() for x in data['c1']]

data['c4'] = [re.findall(".*?\n \t\n(.*)", x)[0].strip() for x in data['c1']]

最后一种方法更好,因为如果 split 在分隔符上失败,您将得到一个 IndexError

也许可以单行,但像这样:

import re
import pandas as pd


data = {'c1':['Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n', 
              'Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n'],
        'c2':["one", "two", "three", "four"]}



df = pd.DataFrame(data)

pattern1 = 'Thrown: lib: ([a-zA-Z\d\s]*)\n'
df['c3'] = df['c1'].str.extract(pattern1, expand=False).str.strip()

pattern2 = '(\n\s\t){1,}(.*)\n'
df['c4'] = df['c1'].str.extract(pattern2, expand=True)[1]

输出:

print(df.to_string())
                                                                                                                           c1     c2                       c3                                                c4
0  Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n    one  this is problem type 01  Error executing the statement: error statement 1
1  Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n    two  this is problem type 01  Error executing the statement: error statement 3
2   Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n  three  this is problem type 02   Error executing the statement: error statement2
3   Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n   four  this is problem type 04   Error executing the statement: error statement1