如何从 python 中的 .txt 文件读取数据框中的大文本文件
how to read a large text file in a data frame from a .txt file in python
我有一个很大的文本文件,里面有几个不同的人的名字和长长的陈述段落。文件格式为 .txt,我试图将名称和语句分成数据框的两个不同列。
数据是这种格式-
Harvey: I’m inclined to give you a shot. But what if I decide to go the other way?
Mike: I’d say that’s fair. Sometimes I like to hang out with people who aren’t that bright, you know, just to see how the other half lives.
Mike in the club
(mike speaking to jessica.)
Jessica: How are you mike?
Mike: good!
.....
....
等等
文本文件长度为400万
在输出中,我需要一个数据框,其中一个名称列包含演讲者的姓名,另一个声明列包含该人各自的声明。
如果:格式始终为“名称:单行无冒号”
你可以试试:
df = pd.read_csv('untitled.txt',sep=': ', header=None)
或手动进行:
f = open("untitled.txt", "r")
file_contents = []
current_name = ""
current_dialogue = ""
for line in f:
splitted_line = line.split(": ")
if len(splitted_line) > 1:
# you are on a row with name: on it
# first stop the current dialogue - save it
if current_name:
file_contents.append([current_name, current_dialogue])
# then update the name encountered
current_name = splitted_line.pop(0)
current_dialogue = ""
current_dialogue += ": ".join(splitted_line)
# add the last dialogue line
file_contents.append([current_name, current_dialogue])
f.close()
df = pd.DataFrame(file_contents)
df
如果您逐行阅读文件,您可以使用类似这样的方法将说话者从语音文本中分离出来,而无需使用正则表达式。
def find_speaker_and_text_from_line(line):
split = line.split(": ")
name = split.pop(0)
rest = ": ".join(split)
return name, rest
我有一个很大的文本文件,里面有几个不同的人的名字和长长的陈述段落。文件格式为 .txt,我试图将名称和语句分成数据框的两个不同列。
数据是这种格式-
Harvey: I’m inclined to give you a shot. But what if I decide to go the other way?
Mike: I’d say that’s fair. Sometimes I like to hang out with people who aren’t that bright, you know, just to see how the other half lives.
Mike in the club
(mike speaking to jessica.)
Jessica: How are you mike?
Mike: good!
.....
....
等等
文本文件长度为400万
在输出中,我需要一个数据框,其中一个名称列包含演讲者的姓名,另一个声明列包含该人各自的声明。
如果:格式始终为“名称:单行无冒号”
你可以试试:
df = pd.read_csv('untitled.txt',sep=': ', header=None)
或手动进行:
f = open("untitled.txt", "r")
file_contents = []
current_name = ""
current_dialogue = ""
for line in f:
splitted_line = line.split(": ")
if len(splitted_line) > 1:
# you are on a row with name: on it
# first stop the current dialogue - save it
if current_name:
file_contents.append([current_name, current_dialogue])
# then update the name encountered
current_name = splitted_line.pop(0)
current_dialogue = ""
current_dialogue += ": ".join(splitted_line)
# add the last dialogue line
file_contents.append([current_name, current_dialogue])
f.close()
df = pd.DataFrame(file_contents)
df
如果您逐行阅读文件,您可以使用类似这样的方法将说话者从语音文本中分离出来,而无需使用正则表达式。
def find_speaker_and_text_from_line(line):
split = line.split(": ")
name = split.pop(0)
rest = ": ".join(split)
return name, rest