如何读取不同目录下的txt文件内容并根据重命名其他文件

Question

我刚开始用 Python 3 和运行解决了以下问题：

我为我的论文从不同的期刊下载了大量的 PDF，但它们都是以他们的 DOI 命名的，而不是“作者（年份）- 标题”的格式。文献根据期刊名称和卷数分别保存在不同的目录中，例如：

/Journal 1/
    /Vol. 1/
        file1.pdf
        file1.txt
        file2.pdf
        file2.txt
        filen.pdf
        filen.txt
    /Vol. 2/
        file1.pdf
        file1.txt
/Journal 2/
    ...

因为我不知道如何使用 Python 阅读 PDF 的内容，我写了一个非常短的 bash 脚本，将 PDF 转换为简单的 TXT 文件。 pdf 和 txt 文件具有相同的名称，但文件扩展名不同。

我想重命名所有的 PDF 文件，幸运的是每个文件的连续文本中都有一个字符串，我可以使用。此可变字符串位于两个静态字符串之间：

"Cite this article as: " AUTHOR/YEAR/TITLE ", Journal name".

如何让 Python 进入每个目录，读取 TXT/PDF 的内容，提取两个固定字符串之间的可变字符串，然后重命名适当的 PDF 文件？

如果有人知道如何用 Python 3 做到这一点，我将不胜感激。

Answer 1

终于开始工作了：

#__author__ = 'Telefonmann'
# -*- coding: utf-8 -*-

import os, re, ntpath, shutil

for root, dirs, files in os.walk(os.getcwd()):
    for file in files: # loops through directories and files
        if file.endswith(('.txt')): # only processes txt files
            full_path = ntpath.splitdrive(ntpath.join(root, file))[1]
            # builds correct path under Win 7 (and probably other NT-systems

            with open(full_path, 'r', encoding='utf-8') as f:
                content = f.read().replace('\n', '') # remove newline

                r = re.compile('To\s*cite\s*this\s*article:\s*(.*?),\s*Journal\s*of\s*Quantitative\s*Linguistics\s*,')
                m = r.search(content)
                # finds substring inbetween "To cite this article: " and "Journal of Quantitative Linguistics,"
                # also finds typos like "Journal ofQuantitative ..."

                if m:
                    full_title = m.group(1)

            print("full_title: {0}".format(full_title))
            full_title = (full_title.replace('<','') # removes/replaces forbidden characters in Windows file names
                .replace('>','')
                .replace(':',' -')
                .replace('"','')
                .replace('/','')
                .replace('\','')
                .replace('|','')
                .replace('?','')
                .replace('*',''))

            pdf_name = full_path.replace('txt','pdf')
            # since txt and pdf files only differ in their format extension I simply replace .txt with .pdf
            # to get the right name

            print('File: '+ file)
            print('Full Path: ' + full_path)
            print('Full Title: ' + full_title)
            print('PDF Name: ' + pdf_name)
            print('....................................')
            # for trouble shooting

            dirname = ntpath.dirname(pdf_name)
            new_path = ntpath.join(dirname, "{0}.pdf".format(full_title))

            if ntpath.exists(full_path):
                print("all paths found")
                shutil.copy(pdf_name, new_path)
                # makes a copy of the pdf file with the new name in the respective directory

如何读取不同目录下的txt文件内容并根据重命名其他文件

How read contents of txt files in different directories and rename other files according to

python

pdf

iteration

rename