如何在两个字符串之间重复解析文本文件中的文本?

How can I repeatedly parse text in a text file between two strings?

我有一个文本文件,其中包含如下 table:

---
Title of my file
Subtitle of my file
---

+------+-------------------+------+
|  a   |        aa         | aaa  |
|  b   |        bb         | bbb  |
|  c   |        cc         | ccc  |
|  d   |        dd         | ddd  |      # Section 1
|  e   |        ee         | eee  |
|  f   |        ff         | fff  |
+======+===================+======+
|  g   |        gg         | ggg  |
|  h   |        hh         | hhh  |
|  i   |        ii         | iii  |      # Section 2
|  j   |        jj         | jjj  |
|  k   |        kk         | kkk  |
|  l   |        ll         | lll  |
+------+-------------------+------+

我正在尝试使用 python 进行解析,以将每个部分捕获到一个单独的列表中,section1_listsection_2_list,每个列表都包含该部分中的行。例如,section_1_list 将是:

section_1_list = [
    "|  a   |        aa         | aaa  |",
    "|  b   |        bb         | bbb  |",
    "|  c   |        cc         | ccc  |",
    "|  d   |        dd         | ddd  |",
    "|  e   |        ee         | eee  |",
    "|  f   |        ff         | fff  |"
]

请注意,这里没有潜水线。

所以我的问题是:如何编写我的循环以便我可以忽略分界线并将其他分界线聚集到他们自己的列表中?

**我试过的:

Extract Values between two strings in a text file using python

Python read specific lines of text between two strings

**我目前拥有的:

with open(txt_file_path) as f:
    lines = f.readlines()

row_start = False

for line in lines:
    if "-----" in line or "=====" in line:
        block_text = []
        row_start = not row_start

    while row_start == True:
        block_text.append(line)

编辑:我在标题中反复提到,因为我在文本文件中有大约 16 个这样的块。

试试下面的方法。

  1. 读取文件内容。
  2. 替换table的第一行和最后一行(使用re)
  3. 根据 table 中的行分隔符拆分数据(使用 re)
  4. 在新行上拆分每个块以获得预期列表。

见以下代码:

import re
with open(txt_file_path,"r") as f:
    data = f.read()
    data = re.sub(r"[-+]+","",data)
    block_text = re.split(r"[+=]+",data)
    block_text = [text.split("\n") for text in block_text]

我会这样做:

from pprint import pprint

file_contents = """\
---
Title of my file
Subtitle of my file
---

+------+-------------------+------+
|  a   |        aa         | aaa  |
|  b   |        bb         | bbb  |
|  c   |        cc         | ccc  |
|  d   |        dd         | ddd  |      # Section 1
|  e   |        ee         | eee  |
|  f   |        ff         | fff  |
+======+===================+======+
|  g   |        gg         | ggg  |
|  h   |        hh         | hhh  |
|  i   |        ii         | iii  |      # Section 2
|  j   |        jj         | jjj  |
|  k   |        kk         | kkk  |
|  l   |        ll         | lll  |
+------+-------------------+------+\
"""
lines = file_contents.split('\n')

# TODO update as needed
start_end_line_prefixes = ('+---', '+===')

sections = []
curr_section = None

for line in lines:
    if any(line.startswith(prefix) for prefix in start_end_line_prefixes):
        curr_section = []
        sections.append(curr_section)
    elif curr_section is not None:
        curr_section.append(line)

# Remove empty list in last index (if needed)
if not sections[-1]:
    sections.pop()

pprint(sections)

输出:

[['|  a   |        aa         | aaa  |',
  '|  b   |        bb         | bbb  |',
  '|  c   |        cc         | ccc  |',
  '|  d   |        dd         | ddd  |      # Section 1',
  '|  e   |        ee         | eee  |',
  '|  f   |        ff         | fff  |'],
 ['|  g   |        gg         | ggg  |',
  '|  h   |        hh         | hhh  |',
  '|  i   |        ii         | iii  |      # Section 2',
  '|  j   |        jj         | jjj  |',
  '|  k   |        kk         | kkk  |',
  '|  l   |        ll         | lll  |']]