如何读取不同目录下的txt文件内容并根据重命名其他文件
How read contents of txt files in different directories and rename other files according to
我刚开始用 Python 3 和 运行 解决了以下问题:
我为我的论文从不同的期刊下载了大量的 PDF,但它们都是以他们的 DOI 命名的,而不是“作者(年份)- 标题”的格式。
文献根据期刊名称和卷数分别保存在不同的目录中,例如:
/Journal 1/
/Vol. 1/
file1.pdf
file1.txt
file2.pdf
file2.txt
filen.pdf
filen.txt
/Vol. 2/
file1.pdf
file1.txt
/Journal 2/
...
因为我不知道如何使用 Python 阅读 PDF 的内容,我写了一个非常短的 bash 脚本,将 PDF 转换为简单的 TXT 文件。 pdf 和 txt 文件具有相同的名称,但文件扩展名不同。
我想重命名所有的 PDF 文件,幸运的是每个文件的连续文本中都有一个字符串,我可以使用。此可变字符串位于两个静态字符串之间:
"Cite this article as: " AUTHOR/YEAR/TITLE ", Journal name".
如何让 Python 进入每个目录,读取 TXT/PDF 的内容,提取两个固定字符串之间的可变字符串,然后重命名适当的 PDF 文件?
如果有人知道如何用 Python 3 做到这一点,我将不胜感激。
终于开始工作了:
#__author__ = 'Telefonmann'
# -*- coding: utf-8 -*-
import os, re, ntpath, shutil
for root, dirs, files in os.walk(os.getcwd()):
for file in files: # loops through directories and files
if file.endswith(('.txt')): # only processes txt files
full_path = ntpath.splitdrive(ntpath.join(root, file))[1]
# builds correct path under Win 7 (and probably other NT-systems
with open(full_path, 'r', encoding='utf-8') as f:
content = f.read().replace('\n', '') # remove newline
r = re.compile('To\s*cite\s*this\s*article:\s*(.*?),\s*Journal\s*of\s*Quantitative\s*Linguistics\s*,')
m = r.search(content)
# finds substring inbetween "To cite this article: " and "Journal of Quantitative Linguistics,"
# also finds typos like "Journal ofQuantitative ..."
if m:
full_title = m.group(1)
print("full_title: {0}".format(full_title))
full_title = (full_title.replace('<','') # removes/replaces forbidden characters in Windows file names
.replace('>','')
.replace(':',' -')
.replace('"','')
.replace('/','')
.replace('\','')
.replace('|','')
.replace('?','')
.replace('*',''))
pdf_name = full_path.replace('txt','pdf')
# since txt and pdf files only differ in their format extension I simply replace .txt with .pdf
# to get the right name
print('File: '+ file)
print('Full Path: ' + full_path)
print('Full Title: ' + full_title)
print('PDF Name: ' + pdf_name)
print('....................................')
# for trouble shooting
dirname = ntpath.dirname(pdf_name)
new_path = ntpath.join(dirname, "{0}.pdf".format(full_title))
if ntpath.exists(full_path):
print("all paths found")
shutil.copy(pdf_name, new_path)
# makes a copy of the pdf file with the new name in the respective directory
我刚开始用 Python 3 和 运行 解决了以下问题:
我为我的论文从不同的期刊下载了大量的 PDF,但它们都是以他们的 DOI 命名的,而不是“作者(年份)- 标题”的格式。 文献根据期刊名称和卷数分别保存在不同的目录中,例如:
/Journal 1/
/Vol. 1/
file1.pdf
file1.txt
file2.pdf
file2.txt
filen.pdf
filen.txt
/Vol. 2/
file1.pdf
file1.txt
/Journal 2/
...
因为我不知道如何使用 Python 阅读 PDF 的内容,我写了一个非常短的 bash 脚本,将 PDF 转换为简单的 TXT 文件。 pdf 和 txt 文件具有相同的名称,但文件扩展名不同。
我想重命名所有的 PDF 文件,幸运的是每个文件的连续文本中都有一个字符串,我可以使用。此可变字符串位于两个静态字符串之间:
"Cite this article as: " AUTHOR/YEAR/TITLE ", Journal name".
如何让 Python 进入每个目录,读取 TXT/PDF 的内容,提取两个固定字符串之间的可变字符串,然后重命名适当的 PDF 文件?
如果有人知道如何用 Python 3 做到这一点,我将不胜感激。
终于开始工作了:
#__author__ = 'Telefonmann'
# -*- coding: utf-8 -*-
import os, re, ntpath, shutil
for root, dirs, files in os.walk(os.getcwd()):
for file in files: # loops through directories and files
if file.endswith(('.txt')): # only processes txt files
full_path = ntpath.splitdrive(ntpath.join(root, file))[1]
# builds correct path under Win 7 (and probably other NT-systems
with open(full_path, 'r', encoding='utf-8') as f:
content = f.read().replace('\n', '') # remove newline
r = re.compile('To\s*cite\s*this\s*article:\s*(.*?),\s*Journal\s*of\s*Quantitative\s*Linguistics\s*,')
m = r.search(content)
# finds substring inbetween "To cite this article: " and "Journal of Quantitative Linguistics,"
# also finds typos like "Journal ofQuantitative ..."
if m:
full_title = m.group(1)
print("full_title: {0}".format(full_title))
full_title = (full_title.replace('<','') # removes/replaces forbidden characters in Windows file names
.replace('>','')
.replace(':',' -')
.replace('"','')
.replace('/','')
.replace('\','')
.replace('|','')
.replace('?','')
.replace('*',''))
pdf_name = full_path.replace('txt','pdf')
# since txt and pdf files only differ in their format extension I simply replace .txt with .pdf
# to get the right name
print('File: '+ file)
print('Full Path: ' + full_path)
print('Full Title: ' + full_title)
print('PDF Name: ' + pdf_name)
print('....................................')
# for trouble shooting
dirname = ntpath.dirname(pdf_name)
new_path = ntpath.join(dirname, "{0}.pdf".format(full_title))
if ntpath.exists(full_path):
print("all paths found")
shutil.copy(pdf_name, new_path)
# makes a copy of the pdf file with the new name in the respective directory