如何从带有id的文件中提取文本?
How to extarct text from file with id?
我有两个文件。一个包含 id,另一个包含每个 id 的句子,但像这个例子一样几乎没有变化
文件 1 :
111_3232
111_ewe2
111_3434
222_3843h
222_39092
文件 2 :
111 some_text_1 some_text_1
222 some_text_2 some_text_2
我需要制作一个id和它的句子类似的文件
111_3232 some_text_1 some_text_1
111_ewe2 some_text_1 some_text_1
111_3434 some_text_1 some_text_1
222_3843h some_text_2 some_text_2
222_39092 some_text_2 some_text_2
我试过这个代码
import os
f = open("id","r")
ff = open("result","w")
fff = open("sentences.txt","r")
List = fff.readlines()
i =0
for line_id in f.readlines():
for line_sentence in range(len(List)):
if line_id in List[i]:
ff.write(line_sentence)
else :
i+=1
但是得到了
if line_id in List[i]:
IndexError: list index out of range
因为我从 file2 得到了整行,而不仅仅是 id...有什么比我做的更好的方法吗
编辑
我尝试使用 panads 但我对这段代码不太熟悉
df = pd.read_csv('sentence.csv')
for line_id in f.readline():
for line_2 in df.iloc[:, 0] :
for (idx, row) in df.iterrows():
if line_id in line_2:
ff.write(str(row) +'\n')
else :
ff.write("empty" +'\n')
但是得到了错误的数据,因为我无法很好地捕捉到正确的行
实现结果的一种方法是将 sentences
和 file_id
对存储在字典中并遍历 id 文件内容以获得结果
sentences_dict = {}
# read all sentences into a dictionary as key value pair
with open("sentences.txt", "r") as sentences_file:
for line in sentences_file.read().splitlines():
split_lines = line.split(" ")
sentences_dict.update({split_lines[0].strip(): " ".join(split_lines[1:])})
result_file = open("result.txt", "w")
# iterate over id file and match the starting text
with open("id.txt", "r") as id_file:
for file_id in id_file.read().splitlines():
txt = sentences_dict.get(file_id.split("_")[0], "")
result_file.write(f"{file_id}{txt}\n")
result_file.close()
确保始终明确关闭文件,除非您使用 with
关键字打开文件。
基本方法
with open('file1.txt', 'r') as fd1, open('file2.txt', 'r') as fd2:
lines1 = fd1.read().split() # remove \n
lines2 = fd2.readlines()
new_text = ''
for l1 in lines1:
for id_, t1, t2 in (l.split() for l in lines2):
if l1.startswith(id_):
new_text += f'{l1} {t1} {t2}\n'
with open('file3.txt', 'w') as fd:
fd.write(new_text.strip())
我有两个文件。一个包含 id,另一个包含每个 id 的句子,但像这个例子一样几乎没有变化
文件 1 :
111_3232
111_ewe2
111_3434
222_3843h
222_39092
文件 2 :
111 some_text_1 some_text_1
222 some_text_2 some_text_2
我需要制作一个id和它的句子类似的文件
111_3232 some_text_1 some_text_1
111_ewe2 some_text_1 some_text_1
111_3434 some_text_1 some_text_1
222_3843h some_text_2 some_text_2
222_39092 some_text_2 some_text_2
我试过这个代码
import os
f = open("id","r")
ff = open("result","w")
fff = open("sentences.txt","r")
List = fff.readlines()
i =0
for line_id in f.readlines():
for line_sentence in range(len(List)):
if line_id in List[i]:
ff.write(line_sentence)
else :
i+=1
但是得到了
if line_id in List[i]:
IndexError: list index out of range
因为我从 file2 得到了整行,而不仅仅是 id...有什么比我做的更好的方法吗
编辑
我尝试使用 panads 但我对这段代码不太熟悉
df = pd.read_csv('sentence.csv')
for line_id in f.readline():
for line_2 in df.iloc[:, 0] :
for (idx, row) in df.iterrows():
if line_id in line_2:
ff.write(str(row) +'\n')
else :
ff.write("empty" +'\n')
但是得到了错误的数据,因为我无法很好地捕捉到正确的行
实现结果的一种方法是将 sentences
和 file_id
对存储在字典中并遍历 id 文件内容以获得结果
sentences_dict = {}
# read all sentences into a dictionary as key value pair
with open("sentences.txt", "r") as sentences_file:
for line in sentences_file.read().splitlines():
split_lines = line.split(" ")
sentences_dict.update({split_lines[0].strip(): " ".join(split_lines[1:])})
result_file = open("result.txt", "w")
# iterate over id file and match the starting text
with open("id.txt", "r") as id_file:
for file_id in id_file.read().splitlines():
txt = sentences_dict.get(file_id.split("_")[0], "")
result_file.write(f"{file_id}{txt}\n")
result_file.close()
确保始终明确关闭文件,除非您使用 with
关键字打开文件。
基本方法
with open('file1.txt', 'r') as fd1, open('file2.txt', 'r') as fd2:
lines1 = fd1.read().split() # remove \n
lines2 = fd2.readlines()
new_text = ''
for l1 in lines1:
for id_, t1, t2 in (l.split() for l in lines2):
if l1.startswith(id_):
new_text += f'{l1} {t1} {t2}\n'
with open('file3.txt', 'w') as fd:
fd.write(new_text.strip())