向每个 excel 文件添加一个月份列，然后将所有文件合并到一个 .csv 文件中

Question

我是新手 python，出于工作目的，我在这里寻求您的帮助。

我在同一个文件夹中每月有 12 个 excel 个文件，其中包含如下列：Product_Name、数量和 Total_Value

那么，我想做但不知道怎么做的是：

在文件名中包含相同日期的每个文件上添加月份列
将那些 excel 个文件合并到一个唯一的文件中

例如：

1 月-21.xls:

Product_Name (type:string)	Quantity (type:float)	Total_Value (type:float)	Month (type:Date)
Product A	10	250	"File Name" (January-21)
Product B	20	500	"File Name" (January-21)
Product C	15	400	"File Name" (January-21)

二月-21.xls:

Product_Name (type:string)	Quantity (type:float)	Total_Value (type:float)	Month (type:Date)
Product A	40	800	"File Name" (February-21)
Product B	25	700	"File Name" (February-21)
Product C	30	500	"File Name" (February-21)

合并后：

Product_Name (type:string)	Quantity (type:float)	Total_Value (type:float)	Month (type:Date)
Product A	10	250	"File Name" (January-21)
Product B	20	500	"File Name" (January-21)
Product C	15	400	"File Name" (January-21)
Product A	40	800	"File Name" (February-21)
Product B	25	700	"File Name" (February-21)
Product C	30	500	"File Name" (February-21)

可能吗？抱歉我的英语不好，我不是母语人士。

非常感谢您的帮助！

编辑.1

这就是我合并、创建 csv 文件并使用 pandas 转换为数据帧的方式：


import pandas as pd
import os

path = "/content/drive/MyDrive/Colab_Notebooks/sq_datas"
files = [file for file in os.listdir(path) if not file.startswith('.')] # Ignore hidden files

all_months_data = pd.DataFrame()

for file in files:
    current_data = pd.read_excel(path+"/"+file)
    all_months_data = pd.concat([all_months_data, current_data])
    
all_months_data.to_csv("/content/drive/MyDrive/Colab_Notebooks/sq_datas/all_months.csv", index=False)

所以，对我来说主要的问题是创建一个循环以在将所有这些文件合并为一个之前添加月份列。

Answer 1

在基本层面上，您首先需要阅读 Excel 文件，例如 pandas.read_excel:

import pandas as pd

jan21_df = pd.read_excel('January-21.xls')
feb21_df = pd.read_excel('February-21.xls')

您为月份栏填写了 type:Date。向每个数据框添加日期列：

jan21_df['Month'] = pd.to_datetime('2021-01-01')
feb21_df['Month'] = pd.to_datetime('2021-02-01')

但是如果你想要文件名作为字符串：

jan21_df['Month'] = "File Name (January-21)"
feb21_df['Month'] = "File Name (February-21)"

然后合并两个数据帧：

combined = pd.concat([jan21_df, feb21_df])

这是概念验证。有一些方法可以根据要求进一步自动化。

EDIT：基于 OP 中的编辑，循环中的少量添加：

for file in files:
    current_data = pd.read_excel(path+"/"+file)
    current_data['Month'] = file
    all_months_data = pd.concat([all_months_data, current_data])

Answer 2

这和我在日常工作中所做的非常相似。以下是我将如何解决您的问题：

from pathlib import Path

path = Path("/content/drive/MyDrive/Colab_Notebooks/sq_datas")
all_data = []

for file in path.glob("*.xls"):
    # Parse the month from the file's name
    # month will be something like "January" and "February"
    # year will be something like "20" and "21"
    # date will be something like pd.Timestamp("2021-01-01")
    month, year = file.stem.split("-")
    date = pd.Timestamp(f"{month} 1, 20{year}")
    
    # Read data from the current file
    current_data = pd.read_excel(file).assign(Month=date)

    # Append the data to the list
    all_data.append(current_data)

# Combine all data from the list into a single DataFrame
all_data = pd.concat(all_data)

向每个 excel 文件添加一个月份列，然后将所有文件合并到一个 .csv 文件中

Add a month column to each excel file and then merge all files into a .csv

python

csv

glob

date

pandas