如何创建不覆盖输出 xml 的 for 循环？

Question

我有一个巨大的 pdf 文件目录，我需要将其解析为 xml 文件。然后需要将这些 xml 文件转换为 xlsx（使用 pandas df）。我已经为后者编写了代码并且它正在运行，但我仍然坚持弄清楚这个 for-loop.

这是循环：

import io
from xml.etree import ElementTree
from pprint import pprint
import os
from os.path import isfile, join
import pandas as pd
from os import listdir

directory = '/jupyter/pdf_script_test/pdf_files_test'
i = 1
for filename in os.listdir(directory):
    print(filename)
    if filename.endswith('.pdf'):
        pathname = os.path.join(directory, filename)
        # attempt to assign variable name
        filename = 'new_output%s' %i
        os.system('dumppdf.py -a' + pathname + '>new_output.xml')
        i = i + 1
else:
    print('invalid pdf file')

所以我可以很快看到每次循环迭代时，它都会用以前的 pdf 文件覆盖 "new_output.xml"。我试图找到一种方法来分配一个变量名，或者创建一个有助于解决问题的嵌套循环。我最大的问题是如何将 dumppdf.py 合并到这个循环中。

可能是一个看起来像这样的嵌套循环：

# code from above here...
data = os.system('dumppdf.py -a' + pathname) # etc..
with open('data' + str(i) + '.xml', 'w') as outfile:
    f.write()

Answer 1

以下是我最终解决问题的方法：

import io
from xml.etree import ElementTree
from pprint import pprint
import os
from os.path import isfile, join
import pandas as pd
from os import listdir

directory = '/jupyter/pdf_script_test/pdf_files_test/'
patient_number = 1 #variable name 

#loop over pdf files and write a new .xml for each pdf file. 
for filename in os.listdir(directory):
    print(filename)
    if filename.endswith(".pdf"):
        pathname = os.path.join(directory, filename)
#Run dumppdf.py on the pdf file and write the .xml using the assigned variable
        data = os.system('dumppdf.py -a ' + pathname + ' > ' + str(patient_number) + '.xml') 
        patient_number = patient_number + 1
 
    else:
        print("invaild pdf file")

如何创建不覆盖输出 xml 的 for 循环？

How to create for-loop that does not overwrite the output xml?

python

pdf

for-loop

xml-parsing

python-3.x