Python 使用提取的正则表达式创建一个新列,直到 \n 来自数据框
Python create a new column with extracted regex until \n from a dataframe
我有一个如下所示的数据框:
data = {'c1':['Level: LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n',
'Level: LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n'],
'c2':["one", "two", "three", "four"]}
我想创建:
一个正则表达式,它提取 Thrown: lib:
之后的任何内容,直到第一个 \n
。我将其称为“组 01”。所以我将在下面有这个:
data = {'c3':['this is problem type 01',
'this is problem type 01',
'this is problem type 02',
'this is problem type 04']}
然后我想创建一个正则表达式来提取“group 01”(之前的正则表达式)之后的所有内容,忽略句子之间的 \t
和 \n
直到下一个 \n
。所以我将在下面有这个:
data = {'c4':['Error executing the statement: error statement 1',
'Error executing the statement: error statement 3',
'Error executing the statement: error statement2',
'Error executing the statement: error statement1']}
最后我希望我的数据框是这样的:
data = {'c1':['Level: LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3',
'Level: LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1'],
'c3':['this is problem type 01',
'this is problem type 01',
'this is problem type 02',
'this is problem type 04'],
'c4':['Error executing the statement: error statement 1',
'Error executing the statement: error statement 3',
'Error executing the statement: error statement2',
'Error executing the statement: error statement1'],
'c2':["one", "two", "three", "four"]}
这是我目前所拥有的,我试图从“Thrown: lib:
”中提取直到第一个 \n
,但它不起作用。
df = pd.DataFrame(data)
df['exception'] = df['c1'].str.extract(r'Thrown: lib: (.*(?:\r?\n.*)*)', expand=False)
我会使用 re
包:
data['c3'] = [re.findall("Thrown: lib: ([^\n]+)", x) for x in data['c1']]
data['c4'] = [re.split("\n", x)[3].strip() for x in data['c1']]
- 第一个模式提取
Thrown: lib:
和第一个换行符 之间的所有内容
- 第二种模式假定相关消息始终是第 4 个标记,当被
\n
分割时,情况似乎是这样
跟进:以下问题。
data['c4']
的模式基于这样一个事实,即消息总是在消息中的 4 个“\n
”换行符之后。
现在,如果感兴趣的分隔符是“\n \t\n
”,您可以修改模式:
data['c4'] = [re.split("\n \t\n", x)[1].strip() for x in data['c1']]
或
data['c4'] = [re.findall(".*?\n \t\n(.*)", x)[0].strip() for x in data['c1']]
最后一种方法更好,因为如果 split
在分隔符上失败,您将得到一个 IndexError
。
也许可以单行,但像这样:
import re
import pandas as pd
data = {'c1':['Level: LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n',
'Level: LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n'],
'c2':["one", "two", "three", "four"]}
df = pd.DataFrame(data)
pattern1 = 'Thrown: lib: ([a-zA-Z\d\s]*)\n'
df['c3'] = df['c1'].str.extract(pattern1, expand=False).str.strip()
pattern2 = '(\n\s\t){1,}(.*)\n'
df['c4'] = df['c1'].str.extract(pattern2, expand=True)[1]
输出:
print(df.to_string())
c1 c2 c3 c4
0 Level: LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n one this is problem type 01 Error executing the statement: error statement 1
1 Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n two this is problem type 01 Error executing the statement: error statement 3
2 Level: LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n three this is problem type 02 Error executing the statement: error statement2
3 Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n four this is problem type 04 Error executing the statement: error statement1
我有一个如下所示的数据框:
data = {'c1':['Level: LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n',
'Level: LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n'],
'c2':["one", "two", "three", "four"]}
我想创建:
一个正则表达式,它提取
Thrown: lib:
之后的任何内容,直到第一个\n
。我将其称为“组 01”。所以我将在下面有这个:data = {'c3':['this is problem type 01', 'this is problem type 01', 'this is problem type 02', 'this is problem type 04']}
然后我想创建一个正则表达式来提取“group 01”(之前的正则表达式)之后的所有内容,忽略句子之间的
\t
和\n
直到下一个\n
。所以我将在下面有这个:data = {'c4':['Error executing the statement: error statement 1', 'Error executing the statement: error statement 3', 'Error executing the statement: error statement2', 'Error executing the statement: error statement1']}
最后我希望我的数据框是这样的:
data = {'c1':['Level: LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3',
'Level: LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1'],
'c3':['this is problem type 01',
'this is problem type 01',
'this is problem type 02',
'this is problem type 04'],
'c4':['Error executing the statement: error statement 1',
'Error executing the statement: error statement 3',
'Error executing the statement: error statement2',
'Error executing the statement: error statement1'],
'c2':["one", "two", "three", "four"]}
这是我目前所拥有的,我试图从“Thrown: lib:
”中提取直到第一个 \n
,但它不起作用。
df = pd.DataFrame(data)
df['exception'] = df['c1'].str.extract(r'Thrown: lib: (.*(?:\r?\n.*)*)', expand=False)
我会使用 re
包:
data['c3'] = [re.findall("Thrown: lib: ([^\n]+)", x) for x in data['c1']]
data['c4'] = [re.split("\n", x)[3].strip() for x in data['c1']]
- 第一个模式提取
Thrown: lib:
和第一个换行符 之间的所有内容
- 第二种模式假定相关消息始终是第 4 个标记,当被
\n
分割时,情况似乎是这样
跟进:以下问题。
data['c4']
的模式基于这样一个事实,即消息总是在消息中的 4 个“\n
”换行符之后。
现在,如果感兴趣的分隔符是“\n \t\n
”,您可以修改模式:
data['c4'] = [re.split("\n \t\n", x)[1].strip() for x in data['c1']]
或
data['c4'] = [re.findall(".*?\n \t\n(.*)", x)[0].strip() for x in data['c1']]
最后一种方法更好,因为如果 split
在分隔符上失败,您将得到一个 IndexError
。
也许可以单行,但像这样:
import re
import pandas as pd
data = {'c1':['Level: LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n',
'Level: LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n'],
'c2':["one", "two", "three", "four"]}
df = pd.DataFrame(data)
pattern1 = 'Thrown: lib: ([a-zA-Z\d\s]*)\n'
df['c3'] = df['c1'].str.extract(pattern1, expand=False).str.strip()
pattern2 = '(\n\s\t){1,}(.*)\n'
df['c4'] = df['c1'].str.extract(pattern2, expand=True)[1]
输出:
print(df.to_string())
c1 c2 c3 c4
0 Level: LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n one this is problem type 01 Error executing the statement: error statement 1
1 Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n two this is problem type 01 Error executing the statement: error statement 3
2 Level: LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n three this is problem type 02 Error executing the statement: error statement2
3 Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n four this is problem type 04 Error executing the statement: error statement1