如何从日期中拆分年份并创建一个新列；如何应对闰年

Question

我对编码还很陌生（这是我写的第一个代码）。

我有多个 csv 文件，都具有相同的 headers。这些文件对应于一年中每一天的每小时臭氧浓度，每个文件都是一个单独的年份 [范围从 2009 年到 2020 年]。我有一个包含 year-month-day 的列 'date'，并且我有一个时间列 (0-23)。我想将年份与 month-day 分开，将小时与 month-day 合并并将其作为索引，然后将其他 csv 文件合并到一个数据框中。

此外，我需要对所有 10 年中每天每小时的数据值进行平均，但是，我的三个文件包括闰日（额外的 24 个值）。对于如何计算闰年的任何建议，我将不胜感激。我假设我需要在没有它的情况下将闰日添加到文件中，然后提供空值，然后删除空值（但这似乎是循环的）。

此外，如果您有任何关于如何简化我的流程的提示，请随时分享！

在此先感谢您的帮助。

更新：我尝试了下面 Rookie 的建议，但是在导入 csv 数据后，我收到一条错误消息：

import pandas as pd
import os

path = "C:/Users/heath/Documents/ARB project Spring2020/ozone/SJV/SKNP"

df = pd.DataFrame()
for file in os.listdir(path):
    df_temp = pd.read_csv(os.path.join(path, file))
    df = pd.concat((df, df_temp), axis = 0)

首先，我收到一条错误消息 OSError: Initializing from file failed。我试图通过根据的建议添加 engine = 'python' 来解决问题，但现在我得到 PermissionError: [Errno 13] Permission denied: 'C:/Users/heath/Documents/ARB project Spring2020/ozone/SJV/SKNP\.ipynb_checkpoints'

请帮忙，我不知道还能做什么。我编辑了权限，以便每个人都具有读写权限。但是，当我在 Windows.

上导入 csv 时，我仍然遇到 "permission denied" 错误

Answer 1

首先，一旦列在 pandas DataFrame 中，您想确定要处理的列的类型。这可以通过 dtypes 方法来完成。例如，如果您的 DataFrame 是 df，您可以执行 df.dtypes，这会让您知道列类型是什么。如果您看到 object 类型，这将告诉您 pandas 正在将对象解释为字符串（字符序列而不是实际日期或时间值）。如果您看到 datetime64[ns]，pandas 就知道这是一个日期时间值（日期和时间相结合）。如果您看到 timedelta[ns]，pandas 就知道这是时差（稍后会详细介绍）。

如果 dtype 是 objects，让我们将它们转换为 datetime64[ns] 类型，这样我们就可以让 pandas 知道我们正在处理 date/time 值.这可以通过简单的重新分配来完成。比如日期的格式是YYYY-mm-dd（2020-06-04），那么我们可以通过下面的方法来转换日期列（假设你的日期列的名字是"Date"）。请参考 strftime 了解不同的格式。

df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d")

时间栏有点棘手。 Pandas 不知道时间，所以我们需要将时间转换为 timedelta[64]。如果时间格式是hh:mm:ss（即“21:02:24”），我们可以使用follow方法将object类型转换为

df["Time"] = pd.to_timedelta(df["Time"])

如果格式不同，您需要将字符串格式转换为hh:mm:ss格式。

现在要组合这些列，我们现在可以简单地添加它们：

df["DateTime"] = df["Date"] + df["Time"]

要创建您提到的格式化日期时间列，您可以通过创建一个字符串格式的新列来实现。下面将给出“06-04 21”，表示 6 月 4 日晚上 9 点。 strftime 可以指导任何你想要的格式。

df["Formatted_DateTime"] = df["DateTime"].dt.strftime("%m-%d %H")

您需要为每个文件执行此操作。我建议在这里使用 for 循环。下面是一个完整的代码片段。这显然会因您的列类型、文件名等而异。

import os # module to iterate over the files
import pandas as pd

base_path = "path/to/directory" # This is the directory path where all your files are stored

# It will be faster to read in all files at once THEN format the date
df = pd.DataFrame()
for file in os.listdir(base_path):
    df_temp = pd.read_csv(os.path.join(base_path, file)) # This will read every file in the base_path directory
    df = pd.concat((df, df_temp), axis=0) # Concatenating (merging) the files

# Formatting the data
df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d") # Date conversion
df["Time"] = pd.to_timedelta(df["Time"]) # Time conversion
df["DateTime"] = df["Date"] + df["Time"] # Combine date and time to single column
df["Formatted_DateTime"] = df["DateTime"].dt.strftime("%m-%d %H") # Format the datetime values

现在一切都格式化了，平均部分很容易。由于您只对每个月-日小时的平均值感兴趣，我们可以使用 groupby 功能。

df_group = df.groupby(["Formatted_DateTime"]) # This will group you data by unique values of the "Formatted_DateTime" column
df_average = df_group.mean() # This will average your data within each group (accounting for the leap years)

检查你的工作总是好的！

print(df_average.head(5)) # This will print the first 5 days averaged values

如何从日期中拆分年份并创建一个新列；如何应对闰年

How to split year from date and make a new column; how to deal with leap years

csv

datetime

merging-data

leap-year

pandas