遍历文件并通过函数传递每个文件

Question

我正在尝试构建一个预算计算器来练习 python。目前我正在尝试遍历目录中的文件，然后通过一个函数传递每个文件以将我需要的数据提取到 DataFrame（准备好对其执行计算）。

我已经成功地创建了清理数据的函数，以及一个循环遍历文件的 for 循环。但是，我无法弄清楚如何为每次迭代附加 DataFrame。

#Where to look
os.chdir(r"C:\relevant\directory")
cwd = os.getcwd()

#key variables
main_df = pd.DataFrame()
pay_slip = {}
master_df = pd.DataFrame()

#Iterate over files
for file in os.listdir():
    slip_content = read_pdf(file)
    pay_slip[file] = slip_content

#Data clean up function
def get_key_info(pay_slip):
    read_dictionary = pay_slip.get(file)
    salary_str = read_dictionary["Employee"].iloc[2]
    pay_after_tax_str = read_dictionary["Tax Period"].iloc[14]
    date_format = read_dictionary["Pay Date"].iloc[0]
    salary = int(float(salary_str[1:].replace(",", "")))
    pay = int(float(pay_after_tax_str[1:].replace(",", "")))
    deductions = (salary - pay)
    df = pd.DataFrame([
        [date_format, salary, pay, deductions]
        ],
        columns=["Payment date", "Salary before tax", "take home pay", "total deductions"])
    return df

print(get_key_info(pay_slip))

当我运行这段代码时，只有一个文件被添加到 DataFrame，而不是所有文件。

在此先感谢您的帮助

Answer 1

您不会循环遍历 pay_slip 词典。


for file in os.listdir(): 
    slip_content = read_pdf(file) 
    pay_slip[file] = slip_content 

#Data clean up function
def get_key_info(pay_slip): 
    read_dictionary = pay_slip.get(file) #<= where is file variable assign?

Answer 2

感谢 Florian 的帮助，我已经像你说的那样修复了我的目录循环。

但是，我无法遍历字典，因为它不可哈希。

我将 post 我的代码放在下面，以防其他人遇到与我相同的问题。

    #Where to look
os.chdir(r"C:\relevant\directory")
cwd = os.getcwd()

#key variables
master_df = pd.DataFrame()


#Data clean up function
def get_key_info(x):
    salary_str = get_data["Employee"].iloc[2]
    pay_after_tax_str = get_data["Tax Period"].iloc[14]
    date_format = get_data["Pay Date"].iloc[0]
    salary = int(float(salary_str[1:].replace(",", "")))
    pay = int(float(pay_after_tax_str[1:].replace(",", "")))
    deductions = (salary - pay)
    df = pd.DataFrame([
        [date_format, salary, pay, deductions]
        ],
        columns=["Payment date", "Salary before tax", "take home pay", "total deductions"])
    return df

#Iterate over files
for f in os.listdir():
    get_data = read_pdf(f)
    master_df = master_df.append(get_key_info(f), ignore_index = True)

print(master_df)

这里我设置了变量get_data来改变for循环的每一次迭代，然后.append()the master_df

遍历文件并通过函数传递每个文件

Iterating through files and passing each one through a function

python

pandas

tabula