如何使用 Python 根据另一个文件中的文本提取一个文件中的文本

Question

我是 Python 的新手。我在这里搜索了其他问题，但没有找到我运行遇到的确切情况。

我需要能够读取文件 A 的内容并从文件 B 中提取匹配的行。

我知道如何在 PowerShell 中执行此操作，但在处理大文件时速度非常慢，我正在努力学习 Python。

文件 A 仅包含贷款编号 - 每行 8 到 10 位数字文件-A 可以包含 1 到数千行

文件 B 可以包含 1 到数千行并且其中包含更多数据，但每行将以相同的 8 到 10 位数字贷款号开头。

我需要读取文件 A 并在文件 B 中找到匹配行并将这些匹配行写到一个新文件（都是文本文件）

文件 A 的示例内容 - 无空格 - 每行 1 笔贷款

272991
272992
272993

文件 B 的示例内容

272991~20210129~\Serv1\LOC75309753066182991.pdf~0
272992~20210129~\Serv1\LOC75309753066182992.pdf~0
272993~20210129~\Serv1\LOC75309753066182993.pdf~0

是否有人能够提供帮助，为我指明正确的方向，或者更好的是，提供可行的解决方案？

这是我迄今为止尝试过的方法，但它所做的只是创建一个新的 PulledLoans.txt 文件，里面什么都没有

import os
# os.system('cls')
os.chdir('C:\Temp\')
print(os.getcwd())
# read file
loanFile = 'Loans.txt'
SourceFile = 'Orig.txt'
NewFile = 'PulledLoans.txt'

with open(loanFile, 'r') as content, open(SourceFile, 'r') as source:
    # print(content.readlines())
    for loan in content:
        # print(loan, end='')
        if loan in source:
            print('found loan')

with open(SourceFile) as dfile, open(loanFile) as ifile:
    lines = "\n".join(set(dfile.read().splitlines()) & set(ifile.read().splitlines()))
    print(lines)
    
with open(NewFile, 'w') as ofile:
    ofile.write(lines)

Answer 1

首先，将 fileB 中的所有内容读入字典，其中键是标识符，值是整行

file_b_data = dict()

with open("fileB") as f_b:
    for line in f_b:
        line = line.strip() # Remove whitespace at start and end
        if not line:
            continue # If the line is blank, skip

        row = line.split("~") # Split by ~
        identifier = row[0]   # First element is the identifier
        file_b_data[identifier] = line # Set the value of the dictionary

接下来，阅读 fileA 中的行并从字典中获取匹配值

with open("fileA") as f_a, open("outfile", "w") as f_w:
    for identifier in f_a:
        identifier = identifier.strip()
        if not identifier:
            continue
        if identifier in file_b_data: # Check that the identifier exists in previously read data
            out_line = file_b_data[identifier] + "\n" # Get the value from the dict
            f_w.write(out_line) # Write it to output file

或者，您可以使用 pandas 模块将所有 fielA 和 fileB 读入数据帧，然后找到正确的行。

import pandas as pd

file_b_data = pd.read_csv("fileB.txt", sep="~", names=["identifier", "date", "path", "something"], index_col=0)

这给了我们这个数据框：

identifier date     path                                         something
272991     20210129 \Serv1\LOC75309753066182991.pdf 0
272992     20210129 \Serv1\LOC75309753066182992.pdf 0
272993     20210129 \Serv1\LOC75309753066182993.pdf 0

fileA 相同：（我删除了 272992 以说明它确实有效）

file_a_data = pd.read_csv("fileA.txt", names="identifier")

给我们

   identifier
0      272991
1      272993

然后，在file_b_data中查找这些索引：

wanted_ids = file_a_data['identifiers']
wanted_rows = file_b_data.loc[wanted_ids, :]
wanted_rows.to_csv("out_file.txt", sep="~",header=None)

这将写入此文件：（注意 272992 行丢失，因为它不在 fileA 中）

272991~20210129~\Serv1\LOC75309753066182991.pdf~0
272993~20210129~\Serv1\LOC75309753066182993.pdf~0

如何使用 Python 根据另一个文件中的文本提取一个文件中的文本

How to extract text in one file based from text in another file using Python

python

iteration

matching

string-matching