读取前缀跨越多行的文件

Reading File with prefix that spans multiple lines

您好,我想清理一个包含成绩单的文本文件。

我复制粘贴了一小段:

*CHI:   and when he went to sleep one night , somehow the frog escaped from
    the jar while he was sleeping .
%mor:   coord|and conj|when pro:sub|he v|go&PAST prep|to n|sleep
    pro:indef|one n|night cm|cm adv|somehow det:art|the n|frog
    v|escape-PAST prep|from det:art|the n|jar conj|while pro:sub|he
    aux|be&PAST&13S part|sleep-PRESP .
%gra:   1|4|LINK 2|4|LINK 3|4|SUBJ 4|0|ROOT 5|4|JCT 6|5|POBJ 7|13|LINK
    8|13|SUBJ 9|8|LP 10|13|JCT 11|12|DET 12|13|SUBJ 13|6|CMOD 14|13|JCT 15|16|DET
    16|14|POBJ 17|20|LINK 18|20|SUBJ 19|20|AUX 20|13|CJCT 21|4|PUNCT
*INV:   0 [=! gasps] .
*CHI:   when the boy woke up he noticed that the frog had disappeared .
%mor:   conj|when det:art|the n|boy v|wake&PAST adv|up pro:sub|he
    v|notice-PAST pro:rel|that det:art|the n|frog aux|have&PAST
    dis#part|appear-PASTP .

基本上我只想阅读带有前缀 *CHI: 的内容,但请阅读他们所说的所有行,这是我目前的代码

def read_file(name):
    file = open(name,"r",encoding = "UTF-8")

    content = file.readlines()

    file.close()

    return content


def extract_file(text):
    clean = []
    for line in text:
        if line.startswith("*CHI:"):
            line = line.replace('\t','')
            clean.append(line)
    return clean

但这只会读取带有前缀的行,但不会读取到末尾。它在 \n

之后停止

所以当我运行这个我会得到

一天晚上他睡觉的时候,青蛙不知怎么就逃走了from\n 而不是

一天晚上他睡觉的时候,不知怎么的,青蛙从 他睡觉时的罐子。

一个解决方案是使用一个布尔值来告诉您是否应该读取以制表符开头的行 space,然后将该行附加到清理列表中的最后一个条目。

这是您的 extract_file 函数的外观。

def extract_file(text):
    clean = []
    read_tab_line = False
    for line in text:
        if line.startswith("*CHI:"):
            read_tab_line = True # we want to read the following tab lines
            clean.append(line)

        elif read_tab_line and line.startswith("\t"):
            clean[-1] += line
        else:
            read_tab_line = False # we do not want to read the following tab lines

    return clean

您正在尝试逐行处理多行格式。你当然可以,比如说,在你的 if 语句中设置一个指示器,并在完成后清除它:

def extract_file(text):
  clean = []
  for line in text:
    if line.startswith("*CHI:"):
      append = True
    elif not line.startwith('\t'):
      append = False
    if append:
      line = line.replace('\t','')
      clean.append(line)
  return clean

另一种方法是在变量 data 中读取整个文件(或者,您可以使用 mmap),然后使用正则表达式提取感兴趣的数据:

def extract_file(name):
  with open(name,"r",encoding = "UTF-8") as file:
    data = file.read()
  r = re.search("^(\*CHI:.*?)^[^\t]", data, re.M | re.S)
  return r.groups(1)[0].replace('\t','').split('\n')