Python

Question

基本上我已经将大约 17000 行的 csv 导入到 pandas 数据框中。有一个日期列已导入为 int64，因为数据质量很差。日期示例包括 11969、12132001、1022013 等。所以我想我想做的是从日期列中检索最后 4 个数字。

所以我使用的代码是：

test_str = str(df['Date'])
flags = re.MULTILINE
p = r'\d{4}$'
result = re.findall(p, test_str, flags)

当我 print(result) 时，17000 个值中仅返回 60 个。我假设它只评估独特性，但经过长时间的谷歌搜索后我无法弄清楚。关于如何解决这个问题的任何想法？

Answer 1

看来您的方法确实有效（至少对于您提供的示例而言）：

import pandas as pd
rng = pd.Series([11969, 12132001, 1022013, 1022013])
test_str = str(rng)
flags = re.MULTILINE
p = r'\d{4}$'
result = re.findall(p, test_str, flags)
print(result)
# ['1969', '2001', '2013', '2013'] # not just unique values

但是这种将 pandas 系列转换为字符串的方法是一种奇怪的方法，没有利用 pandas 固有结构。

您可以考虑这样做：

df['year_int'] = df['Date'] % 10000

如果 df['Date'] 是 int64，则获取最后四位数字。或者这个：

df['year_str'] = df['Date'].apply(lambda x: str(x)[-4:])

如果您希望转换为字符串然后取最后四个字符。

print(df)
#        Date  year_int year_str
# 0     11969      1969     1969
# 1  12132001      2001     2001
# 2   1022013      2013     2013
# 3   1022013      2013     2013

Python - 基于正则表达式检索和替换

Python - Retrieve and replace based on a regex

python-2.7

pandas

jupyter-notebook