将文本文件中的列数据转换为 Python 中的嵌套列表?

Convert column data from text file into nested lists in Python?

我有一个 txt 文件,其中包含按列书写的句子和标签,如下所示:

O   are
O   there
O   any
O   good
B-GENRE romantic
I-GENRE comedies
O   out
B-YEAR  right
I-YEAR  now

O   show
O   me
O   a
O   movie
O   about
B-PLOT  cars
I-PLOT  that
I-PLOT  talk

我想将此 txt 文件中的数据读取到两个嵌套列表中。 所需的输出应类似于:

labels = [['O','O','O','O','B-GENRE','I-GENRE','O','B-YEAR','I-YEAR'],['O','O','O','O','O','B-PLOT','I-PLOT','I-PLOT']]
sentences = [['are','there','any','good','romantic','comedies','out','right','now'],['show','me','a','movie','about','cars','that','talk']]

我试过以下方法:

with open("engtrain.bio.txt", "r") as f:
  lsta = []
  for line in f:
    lsta.append([x for x in line.replace("\n", "").split()])

但我有以下输出:

[['O', 'are'],
 ['O', 'there'],
 ['O', 'any'],
 ['O', 'good'],
 ['B-GENRE', 'romantic'],
 ['I-GENRE', 'comedies'],
 ['O', 'out'],
 ['B-YEAR', 'right'],
 ['I-YEAR', 'now'],
 [],
 ['O', 'show'],
 ['O', 'me'],
 ['O', 'a'],
 ['O', 'movie'],
 ['O', 'about'],
 ['B-PLOT', 'cars'],
 ['I-PLOT', 'that'],
 ['I-PLOT', 'talk']]

更新 我还尝试了以下方法:

with open("engtest.bio.txt", "r") as f:
  lines = f.readlines()
  labels = []
  sentences = []
  for l in lines:
    as_list = l.split("\t")
    labels.append(as_list[0])
    sentences.append(as_list[1].replace("\n", ""))

很遗憾,还是有错误:

IndexError                                Traceback (most recent call last)
<ipython-input-66-63c266df6ace> in <module>()
      6     as_list = l.strip().split("\t")
      7     labels.append(as_list[0])
----> 8     sentences.append(as_list[1].replace("\n", ""))

IndexError: list index out of range

原始数据来自link(engtest.bio或entrain.bio):https://groups.csail.mit.edu/sls/downloads/movie/

你能帮帮我吗?

提前致谢

all_labels, all_sentences = [], []
with open('inp', 'r') as f:
    lines = f.readlines()
    lines.append('') # make sure we process the last sentence
    labels, sentences = [], []
    for line in lines:
        line = line.strip()
        if not line: # detect the end of a sentence
            if len(labels): # make sure we got some words here
                all_labels.append(labels)
                all_sentences.append(sentences)
                labels, sentences = [], []
            continue
        # extend the current sentence
        label, sentence = line.split()
        labels.append(label)
        sentences.append(sentence)

print(all_labels)
print(all_sentences)

迭代每一行并将其拆分为 tab:

labels = [[]]
sentences = [[]]
with open('engtrain.bio', 'r') as f:
    for line in f.readlines():
        line = line.rstrip()
        if line:
            label, sentence = line.split('\t')
            labels[-1].append(label)
            sentences[-1].append(sentence)
        else:
            labels.append([])
            sentences.append([])

输出labels

[['O', 'O', 'O', 'B-ACTOR', 'I-ACTOR'], ['O', 'O', 'O', 'O', 'B-ACTOR', 'I-ACTOR', 'O', 'O', 'B-YEAR'] ...

输出sentences

[['what', 'movies', 'star', 'bruce', 'willis'], ['show', 'me', 'films', 'with', 'drew', 'barrymore', 'from', 'the', '1980s'] ...

文件中的行可以按逻辑分组为多个部分,由 空白行。所以你实际上有一个两级数据结构,你需要 处理一个部分列表,在每个部分中你需要处理一个列表 线。当然,文本文件是一个简单的行列表,所以我们需要 重新构建 2 个级别。

这是一个非常通用的模式,所以这里有一种可以重复使用的编码方法,无论您需要在每个部分中做什么:

labels = []
sentences = []

# Prepare next section
inner_labels = []
inner_sentences = []

with open('engtrain.bio.txt') as f:
    for line in f.readlines():
        if len(line.strip()) == 0:
            # Finish previous section
            labels.append(inner_labels)
            sentences.append(inner_sentences)
            # Prepare next section
            inner_labels = []
            inner_sentences = []
            continue
        # Process line in section
        l, s = line.strip().split()
        inner_labels.append(l)
        inner_sentences.append(s)

# Finish previous section
labels.append(inner_labels)
sentences.append(inner_sentences)

要在不同情况下重用它,只需重新定义“准备下一部分”、“部分中的工艺线”和“完成上一部分”。

可能有更 pythonic 的方式来预处理行列表等,但这是完成工作的可靠模式。