如何创建不覆盖输出 xml 的 for 循环?
How to create for-loop that does not overwrite the output xml?
我有一个巨大的 pdf 文件目录,我需要将其解析为 xml 文件。然后需要将这些 xml 文件转换为 xlsx
(使用 pandas df
)。我已经为后者编写了代码并且它正在运行,但我仍然坚持弄清楚这个 for
-loop.
这是循环:
import io
from xml.etree import ElementTree
from pprint import pprint
import os
from os.path import isfile, join
import pandas as pd
from os import listdir
directory = '/jupyter/pdf_script_test/pdf_files_test'
i = 1
for filename in os.listdir(directory):
print(filename)
if filename.endswith('.pdf'):
pathname = os.path.join(directory, filename)
# attempt to assign variable name
filename = 'new_output%s' %i
os.system('dumppdf.py -a' + pathname + '>new_output.xml')
i = i + 1
else:
print('invalid pdf file')
所以我可以很快看到每次循环迭代时,它都会用以前的 pdf 文件覆盖 "new_output.xml"
。我试图找到一种方法来分配一个变量名,或者创建一个有助于解决问题的嵌套循环。我最大的问题是如何将 dumppdf.py
合并到这个循环中。
可能是一个看起来像这样的嵌套循环:
# code from above here...
data = os.system('dumppdf.py -a' + pathname) # etc..
with open('data' + str(i) + '.xml', 'w') as outfile:
f.write()
以下是我最终解决问题的方法:
import io
from xml.etree import ElementTree
from pprint import pprint
import os
from os.path import isfile, join
import pandas as pd
from os import listdir
directory = '/jupyter/pdf_script_test/pdf_files_test/'
patient_number = 1 #variable name
#loop over pdf files and write a new .xml for each pdf file.
for filename in os.listdir(directory):
print(filename)
if filename.endswith(".pdf"):
pathname = os.path.join(directory, filename)
#Run dumppdf.py on the pdf file and write the .xml using the assigned variable
data = os.system('dumppdf.py -a ' + pathname + ' > ' + str(patient_number) + '.xml')
patient_number = patient_number + 1
else:
print("invaild pdf file")
我有一个巨大的 pdf 文件目录,我需要将其解析为 xml 文件。然后需要将这些 xml 文件转换为 xlsx
(使用 pandas df
)。我已经为后者编写了代码并且它正在运行,但我仍然坚持弄清楚这个 for
-loop.
这是循环:
import io
from xml.etree import ElementTree
from pprint import pprint
import os
from os.path import isfile, join
import pandas as pd
from os import listdir
directory = '/jupyter/pdf_script_test/pdf_files_test'
i = 1
for filename in os.listdir(directory):
print(filename)
if filename.endswith('.pdf'):
pathname = os.path.join(directory, filename)
# attempt to assign variable name
filename = 'new_output%s' %i
os.system('dumppdf.py -a' + pathname + '>new_output.xml')
i = i + 1
else:
print('invalid pdf file')
所以我可以很快看到每次循环迭代时,它都会用以前的 pdf 文件覆盖 "new_output.xml"
。我试图找到一种方法来分配一个变量名,或者创建一个有助于解决问题的嵌套循环。我最大的问题是如何将 dumppdf.py
合并到这个循环中。
可能是一个看起来像这样的嵌套循环:
# code from above here...
data = os.system('dumppdf.py -a' + pathname) # etc..
with open('data' + str(i) + '.xml', 'w') as outfile:
f.write()
以下是我最终解决问题的方法:
import io
from xml.etree import ElementTree
from pprint import pprint
import os
from os.path import isfile, join
import pandas as pd
from os import listdir
directory = '/jupyter/pdf_script_test/pdf_files_test/'
patient_number = 1 #variable name
#loop over pdf files and write a new .xml for each pdf file.
for filename in os.listdir(directory):
print(filename)
if filename.endswith(".pdf"):
pathname = os.path.join(directory, filename)
#Run dumppdf.py on the pdf file and write the .xml using the assigned variable
data = os.system('dumppdf.py -a ' + pathname + ' > ' + str(patient_number) + '.xml')
patient_number = patient_number + 1
else:
print("invaild pdf file")