将文本文件中的列数据转换为 Python 中的嵌套列表?
Convert column data from text file into nested lists in Python?
我有一个 txt
文件,其中包含按列书写的句子和标签,如下所示:
O are
O there
O any
O good
B-GENRE romantic
I-GENRE comedies
O out
B-YEAR right
I-YEAR now
O show
O me
O a
O movie
O about
B-PLOT cars
I-PLOT that
I-PLOT talk
我想将此 txt
文件中的数据读取到两个嵌套列表中。
所需的输出应类似于:
labels = [['O','O','O','O','B-GENRE','I-GENRE','O','B-YEAR','I-YEAR'],['O','O','O','O','O','B-PLOT','I-PLOT','I-PLOT']]
sentences = [['are','there','any','good','romantic','comedies','out','right','now'],['show','me','a','movie','about','cars','that','talk']]
我试过以下方法:
with open("engtrain.bio.txt", "r") as f:
lsta = []
for line in f:
lsta.append([x for x in line.replace("\n", "").split()])
但我有以下输出:
[['O', 'are'],
['O', 'there'],
['O', 'any'],
['O', 'good'],
['B-GENRE', 'romantic'],
['I-GENRE', 'comedies'],
['O', 'out'],
['B-YEAR', 'right'],
['I-YEAR', 'now'],
[],
['O', 'show'],
['O', 'me'],
['O', 'a'],
['O', 'movie'],
['O', 'about'],
['B-PLOT', 'cars'],
['I-PLOT', 'that'],
['I-PLOT', 'talk']]
更新
我还尝试了以下方法:
with open("engtest.bio.txt", "r") as f:
lines = f.readlines()
labels = []
sentences = []
for l in lines:
as_list = l.split("\t")
labels.append(as_list[0])
sentences.append(as_list[1].replace("\n", ""))
很遗憾,还是有错误:
IndexError Traceback (most recent call last)
<ipython-input-66-63c266df6ace> in <module>()
6 as_list = l.strip().split("\t")
7 labels.append(as_list[0])
----> 8 sentences.append(as_list[1].replace("\n", ""))
IndexError: list index out of range
原始数据来自link(engtest.bio或entrain.bio):https://groups.csail.mit.edu/sls/downloads/movie/
你能帮帮我吗?
提前致谢
all_labels, all_sentences = [], []
with open('inp', 'r') as f:
lines = f.readlines()
lines.append('') # make sure we process the last sentence
labels, sentences = [], []
for line in lines:
line = line.strip()
if not line: # detect the end of a sentence
if len(labels): # make sure we got some words here
all_labels.append(labels)
all_sentences.append(sentences)
labels, sentences = [], []
continue
# extend the current sentence
label, sentence = line.split()
labels.append(label)
sentences.append(sentence)
print(all_labels)
print(all_sentences)
迭代每一行并将其拆分为 tab
:
labels = [[]]
sentences = [[]]
with open('engtrain.bio', 'r') as f:
for line in f.readlines():
line = line.rstrip()
if line:
label, sentence = line.split('\t')
labels[-1].append(label)
sentences[-1].append(sentence)
else:
labels.append([])
sentences.append([])
输出labels
:
[['O', 'O', 'O', 'B-ACTOR', 'I-ACTOR'], ['O', 'O', 'O', 'O', 'B-ACTOR', 'I-ACTOR', 'O', 'O', 'B-YEAR'] ...
输出sentences
:
[['what', 'movies', 'star', 'bruce', 'willis'], ['show', 'me', 'films', 'with', 'drew', 'barrymore', 'from', 'the', '1980s'] ...
文件中的行可以按逻辑分组为多个部分,由
空白行。所以你实际上有一个两级数据结构,你需要
处理一个部分列表,在每个部分中你需要处理一个列表
线。当然,文本文件是一个简单的行列表,所以我们需要
重新构建 2 个级别。
这是一个非常通用的模式,所以这里有一种可以重复使用的编码方法,无论您需要在每个部分中做什么:
labels = []
sentences = []
# Prepare next section
inner_labels = []
inner_sentences = []
with open('engtrain.bio.txt') as f:
for line in f.readlines():
if len(line.strip()) == 0:
# Finish previous section
labels.append(inner_labels)
sentences.append(inner_sentences)
# Prepare next section
inner_labels = []
inner_sentences = []
continue
# Process line in section
l, s = line.strip().split()
inner_labels.append(l)
inner_sentences.append(s)
# Finish previous section
labels.append(inner_labels)
sentences.append(inner_sentences)
要在不同情况下重用它,只需重新定义“准备下一部分”、“部分中的工艺线”和“完成上一部分”。
可能有更 pythonic 的方式来预处理行列表等,但这是完成工作的可靠模式。
我有一个 txt
文件,其中包含按列书写的句子和标签,如下所示:
O are
O there
O any
O good
B-GENRE romantic
I-GENRE comedies
O out
B-YEAR right
I-YEAR now
O show
O me
O a
O movie
O about
B-PLOT cars
I-PLOT that
I-PLOT talk
我想将此 txt
文件中的数据读取到两个嵌套列表中。
所需的输出应类似于:
labels = [['O','O','O','O','B-GENRE','I-GENRE','O','B-YEAR','I-YEAR'],['O','O','O','O','O','B-PLOT','I-PLOT','I-PLOT']]
sentences = [['are','there','any','good','romantic','comedies','out','right','now'],['show','me','a','movie','about','cars','that','talk']]
我试过以下方法:
with open("engtrain.bio.txt", "r") as f:
lsta = []
for line in f:
lsta.append([x for x in line.replace("\n", "").split()])
但我有以下输出:
[['O', 'are'],
['O', 'there'],
['O', 'any'],
['O', 'good'],
['B-GENRE', 'romantic'],
['I-GENRE', 'comedies'],
['O', 'out'],
['B-YEAR', 'right'],
['I-YEAR', 'now'],
[],
['O', 'show'],
['O', 'me'],
['O', 'a'],
['O', 'movie'],
['O', 'about'],
['B-PLOT', 'cars'],
['I-PLOT', 'that'],
['I-PLOT', 'talk']]
更新 我还尝试了以下方法:
with open("engtest.bio.txt", "r") as f:
lines = f.readlines()
labels = []
sentences = []
for l in lines:
as_list = l.split("\t")
labels.append(as_list[0])
sentences.append(as_list[1].replace("\n", ""))
很遗憾,还是有错误:
IndexError Traceback (most recent call last)
<ipython-input-66-63c266df6ace> in <module>()
6 as_list = l.strip().split("\t")
7 labels.append(as_list[0])
----> 8 sentences.append(as_list[1].replace("\n", ""))
IndexError: list index out of range
原始数据来自link(engtest.bio或entrain.bio):https://groups.csail.mit.edu/sls/downloads/movie/
你能帮帮我吗?
提前致谢
all_labels, all_sentences = [], []
with open('inp', 'r') as f:
lines = f.readlines()
lines.append('') # make sure we process the last sentence
labels, sentences = [], []
for line in lines:
line = line.strip()
if not line: # detect the end of a sentence
if len(labels): # make sure we got some words here
all_labels.append(labels)
all_sentences.append(sentences)
labels, sentences = [], []
continue
# extend the current sentence
label, sentence = line.split()
labels.append(label)
sentences.append(sentence)
print(all_labels)
print(all_sentences)
迭代每一行并将其拆分为 tab
:
labels = [[]]
sentences = [[]]
with open('engtrain.bio', 'r') as f:
for line in f.readlines():
line = line.rstrip()
if line:
label, sentence = line.split('\t')
labels[-1].append(label)
sentences[-1].append(sentence)
else:
labels.append([])
sentences.append([])
输出labels
:
[['O', 'O', 'O', 'B-ACTOR', 'I-ACTOR'], ['O', 'O', 'O', 'O', 'B-ACTOR', 'I-ACTOR', 'O', 'O', 'B-YEAR'] ...
输出sentences
:
[['what', 'movies', 'star', 'bruce', 'willis'], ['show', 'me', 'films', 'with', 'drew', 'barrymore', 'from', 'the', '1980s'] ...
文件中的行可以按逻辑分组为多个部分,由 空白行。所以你实际上有一个两级数据结构,你需要 处理一个部分列表,在每个部分中你需要处理一个列表 线。当然,文本文件是一个简单的行列表,所以我们需要 重新构建 2 个级别。
这是一个非常通用的模式,所以这里有一种可以重复使用的编码方法,无论您需要在每个部分中做什么:
labels = []
sentences = []
# Prepare next section
inner_labels = []
inner_sentences = []
with open('engtrain.bio.txt') as f:
for line in f.readlines():
if len(line.strip()) == 0:
# Finish previous section
labels.append(inner_labels)
sentences.append(inner_sentences)
# Prepare next section
inner_labels = []
inner_sentences = []
continue
# Process line in section
l, s = line.strip().split()
inner_labels.append(l)
inner_sentences.append(s)
# Finish previous section
labels.append(inner_labels)
sentences.append(inner_sentences)
要在不同情况下重用它,只需重新定义“准备下一部分”、“部分中的工艺线”和“完成上一部分”。
可能有更 pythonic 的方式来预处理行列表等,但这是完成工作的可靠模式。