删除字符串列中缩写字母之间的 space

Question

我有一个熊猫数据框如下：

import pandas as pd
import numpy as np

d = {'col1': ['I called the c. i. a', 'the house is e. m',
 'this is an e. u. call!','how is the p. o. r going?']}
df = pd.DataFrame(data=d)

我已经删除了标点符号并删除了缩写字母之间的空格：

df['col1'] = df['col1'].str.replace('[^\w\s]','')
df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)','')

输出是（例如'I called the cia'）但是我想要发生的是以下（'I called the CIA'）。所以我基本上喜欢缩写是大写的。我尝试了以下方法，但没有结果

df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)'.upper(),'')

或

df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)',''.upper())

Answer 1

pandas.Series.str.replace 允许根据 re.sub 的第二个参数的要求调用第二个参数。使用它，您可能首先将缩写大写如下：

import pandas as pd
def make_upper(m):  # where m is re.Match object
    return m.group(0).upper()
d = {'col1': ['I called the c. i. a', 'the house is e. m', 'this is an e. u. call!','how is the p. o. r going?']}
df = pd.DataFrame(data=d)
df['col1'] = df['col1'].str.replace(r'\b\w\.?\b', make_upper)
print(df)

输出

                        col1
0       I called the C. I. A
1          the house is E. M
2     this is an E. U. call!
3  how is the P. O. R going?

然后您可以使用已有的代码进一步处理

df['col1'] = df['col1'].str.replace('[^\w\s]','')
df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)','')
print(df)

输出

               col1
0      I called the CIA
1       the house is EM
2    this is an EU call
3  how is the POR going

如果您遇到它没有涵盖的情况，您可能会选择改进我使用的模式 (r'\b\w\.?\b')。我使用了单词边界和文字点 (\.)，因此它确实找到了任何单个单词字符 (\w) 可选 (?) 后跟点。

Answer 2

您需要使用函数进行替换。试试这个来制作大写字母并替换首字母缩略词的空格和标点符号：

def my_replace(match):
    match = match.group()
    return match.replace('.', '').replace(' ', '').upper()

df['col1'].str.replace(r'\b[\w](\.\s[\w])+\b[\.]*', my_replace)

删除字符串列中缩写字母之间的 space

Remove space between abbreviated letters in a string column

python

abbreviation

uppercase

pandas