如何将文本文件中的句子分组为一个结构？

Question

来自此文本数据： https://drive.google.com/file/d/1p34ChEAC9R7HnkyllnpCLCYrIevP4u8T/view?usp=sharing

我想创建这种形式的结构：

{
  'tokens': ['Setelah', 'melalui', 'proses', 'telepon', 'yang', 'panjang', 'tutup', 'sudah', 'kartu', 'kredit', 'bca', 'Ribet'],
  'tag': ['O', 'B', 'B', 'I', 'O', 'O', 'B', 'O', 'B', 'I', 'I', 'B']
}
{
  'tokens': ['@HaloBCA', 'Saya', 'mencoba', 'mengakses', 'menu', 'm-BCA', 'saya', 'namun', 'saya', 'mendapat', 'respons', 'Fasilitas', 'Mobile', 'Banking', 'terblokir', 'bagimana', 'sih', 'padahal', 'saya', 'baru', 'coba', 'akses', 'lo'],
  'tag': ['B', 'O', 'O', 'B', 'B', 'I', 'O', 'O', 'O', 'B', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
}

这是我尝试做的，使用字典：

f = open("a_testdata.txt", "r")
dicts = {}
tokens = []
tags = []

for line in f:
  if len(line.strip()) != 0:
    fields = line.split('\t')
    text = fields[0]
    tag = fields[1].strip()
    tokens.append(text)
    tags.append(tag)
    dicts['token'] = tokens
    dicts['tag'] = tags
  else:
    tokens = []
    tags = []

for key, value in dicts.items():
  print(key, value)

这只输出最后一句话。

token ['@HaloBCA', 'Saya', 'mencoba', 'mengakses', 'menu', 'm-BCA', 'saya', 'namun', 'saya', 'mendapat', 'respons', 'Fasilitas', 'Mobile', 'Banking', 'terblokir', 'bagimana', 'sih', 'padahal', 'saya', 'baru', 'coba', 'akses', 'lo']
tag ['B', 'O', 'O', 'B', 'B', 'I', 'O', 'O', 'O', 'B', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

我的问题是，如果不可能使用字典，如何将这些句子（1 个句子用空白线分隔，参见文本文件）组合成一个结构？如果可以，我该如何使用 DataFrame？

Answer 1

您需要一组字典，因为键不能重复
在重置token/tag列表之前，您需要将其保存到输出中，然后再重置dicts
极端情况：如果dicts有数据，而我们没有运行在末尾插入一个空行，数据将不会被添加到列表

f = open("a_testdata.txt", "r")
output = []
dicts = {}
tokens = []
tags = []

for line in f:
  if len(line.strip()) != 0:
    fields = line.split('\t')
    text = fields[0]
    tag = fields[1].strip()
    tokens.append(text)
    tags.append(tag)
  else:
    dicts['token'] = tokens
    dicts['tag'] = tags
    output.append(dicts)
    dicts = {}
    tokens = []
    tags = []

if dicts:
  output.append(dicts)

for item in output:
  for key, value in item.items():
    print(key, value)

如何将文本文件中的句子分组为一个结构？

How to group sentences from text file into one structure?

python

algorithm

dictionary