如果下一行匹配相同的模式，如何删除带有模式的行？

Question

我有一个数据框，其中有一列包含每行票证的日志。这是日志的示例：

99645,
\Submitted',
 '\Modifications made 2015/01/01',
 'x_change0:   -->  info0',
 'y_status1:   -->  info1',
 'z_change2:   -->  info2',
 'y_change3:   -->  info3',
 '\Modifications made 2015/01/03',
 '\Modifications made 2015/01/05',
 '\Modifications made 2015/01/07',
 'w_change0:   -->  info0',
 'a_status1:   -->  info1',
 '\Modifications made 2015/01/07',
.
.
.

我想删除所有后面没有更改的行。以下正则表达式匹配我要查找的内容 RegEx101:

pattern = '(?sm)Modifications\s*((?!Modifications\s*).)*'
re.findall(pattern, dataframe['log'])

数据帧中每个单元格的预期结果['log']：

Modifications made 2015/01/01',
'change0:   -->  info0',
'change1:   -->  info1',
'change2:   -->  info2',
'change3:   -->  info3',
'Modifications made 2015/01/07',
'change0:   -->  info0',
'change1:   -->  info1',
'

如何删除单元格中不需要的行？或者如何用过滤后的字符串替换单元格内的字符串？

Answer 1

使用 pd.Series.shift 和 str.startswith 函数进行复杂过滤。

初始数据帧：

In [87]: df                                                                                                    
Out[87]: 
                                  log
0   '\Modifications made 2015/01/01',
1            'change0:   -->  info0',
2            'change1:   -->  info1',
3            'change2:   -->  info2',
4            'change3:   -->  info3',
5   '\Modifications made 2015/01/03',
6   '\Modifications made 2015/01/05',
7   '\Modifications made 2015/01/07',
8            'change0:   -->  info0',
9            'change1:   -->  info1',
10  '\Modifications made 2015/01/07',

根据条件删除行（添加 inplace=True papam 以修改就地）：

In [88]: df.drop(df[(df.log.str.startswith("'\Modifications")) & ((df.log.shift(-1).str.startswith("'\Modificat
    ...: ions")) | (~df.log.shift(-1).str.startswith("'change", na=False)) | df.log.shift(-1).isna())].index)  
Out[88]: 
                                 log
0  '\Modifications made 2015/01/01',
1           'change0:   -->  info0',
2           'change1:   -->  info1',
3           'change2:   -->  info2',
4           'change3:   -->  info3',
7  '\Modifications made 2015/01/07',
8           'change0:   -->  info0',
9           'change1:   -->  info1',

Answer 2

使用@Code Maniac 的 RegEx 解决方案解决： (?sm)Modifications[^,]+,(?:(?!^\s*'\Modifications).)*\b.

用以下循环替换单元格字符串：

pattern = r"(?sm)Modifications[^,]+,(?:(?!^\s*'\Modifications).)*\b"
pattern = re.compile(pattern=pattern)
df['tickethist'] = ""

for i in range(len(df['log'])):
    search = []
    log = df.at[i, 'log']
    for match in pattern.findall(str(log)):
        search.append(match)
    df.at[i, 'tickethist'] = search

如果下一行匹配相同的模式，如何删除带有模式的行？

How to remove lines with a pattern if the next line matches the same pattern?

python

regex

text

text-mining

pandas