pandas 用正则表达式修改列

Question

我想修复 pandas 系列中的一些字符串条目，这样所有具有模式“0x.202”（缺少年份的最后一位）的值都将在末尾附加一个零（所以它是 'mm.yyyy' 格式的完整日期）。这是我得到的模式：

pattern = '\d*\.202(?:$|\W)'

精确匹配以点分隔的2位数字，最后精确匹配202。你能帮我看看如何在保留原始索引的同时替换系列中的字符串吗？

我目前的做法是：

date = df['Calendar Year/Month'].astype('str')
pattern = re.compile('\d*\.202(?:$|\W)')
date.str.replace(pattern, pattern.pattern + '0', regex=True)

但我得到一个错误：

error: bad escape \d at position 0

编辑：很抱歉缺少详细信息，我忘了提及日期被 pandas 误解为浮点数，所以这就是为什么没有完全显示 2020 年的日期（5.2020 四舍五入为 5.202，因为例子）。所以我使用的表达式：

date = df['Year/Month'].astype('str')
date = date.apply(lambda _: _ if _[-1] == '1' or _[-1] == '9' else f'{_}0')

因此只有 'xx.202' 被编辑，'xx.2021' 和 'xx.2019' 等日期被省略。谢谢大家的帮助！

Answer 1

你必须在这里使用正则表达式吗？如果不是，这将起作用（如果字符串的长度为 x，则添加 0）。

df["Calendar Year/Month"].apply(lambda _: _ if len(_)==7 else f'{_}0')

或者可能是这样（如果最后一位数字是 2，则添加 0）：

df["Calendar Year/Month"].apply(lambda _: _ if _[-1] == 0 else f'{_}0')

Answer 2

我会做一个 str.replace:

df = pd.DataFrame({'Year/Month':['10.202 abc', 'abc 1.202']})
df['Year/Month'].str.replace(r'(\d*\.202)\b', r'\g<1>0')

输出：

0    10.2020 abc
1    abc 1.2020
Name: Year/Month, dtype: object

pandas column modification with regular expression