难以从文本列表中提取测试标题和测试结果

Question

我有一个由 section-headers 和 section-content 交替组成的文本行列表。我想逐行解析它，并确定各个部分及其关联的内容（最终将它们一起放入字典中）。

我遇到的麻烦是弄清楚如何仅基于遍历列表并查找 header 来将行解析成对。每次我尝试我都非常接近，但不知何故我的部分最终没有对齐。

我觉得我的算法应该是这样的：

(0) 假设搜索开始时没有识别到header；因此，任何看到的内容都将被忽略，直到遇到 header 部分。

(1) 当“in”一个节时（即遇到header节），将后面的所有节内容累加在一起，直到出现新的节header被看见了。

(2) 在遇到新节 header 时，任何以下行都应被视为新节的一部分。

(3) 部分栏目可能只有一个header，因此内容为空白。其他人可能跨越单行或多行。

换句话说，鉴于此：

garbage
Section-A-Header
section A content line 1
section A content line 2
section A content line 3
Section-B-Header
section B content line 1
section B content line 2
Section-C-Header
Section-D-Header
section D content line 1
section D content line 2
section D content line 3

...我希望能够构建：

{Section-A-Header: section A content line 1 + section A content line 2 + section A content line 3}
{Section-B-Header: section B content line 1 + section B content line 2}
{Section-C-Header: None}
{Section-D-Header: section D content line 1 + section D content line 2 + section D content line 3}

谁能帮我想出一个可靠的实施方案？

更新我正在处理的真实代码的示例数据位于 .

Answer 1

我不确定您遇到的具体问题是什么。

这里有一个伪代码供你借鉴


file = open("sections.txt", 'r')

last_header=''
output = {}
for line in file.readlines():
    if is_section_header(line):
        last_header = line
        output[line] = ""
    else:
        existing_data = output[last_header]
        output[last_header] = existing_data + line

print(output)


def is_section_header(line):
    #some logic to identify header
    return True

Answer 2

这将是我的方法：

result = dict()

with open('foo.txt') as foo:
    section = None
    for line in map(str.strip, foo):
        # identify start of section
        if line.startswith('Section-'):
            section = line
            result[section] = None
        else:
            if section:
                if result[section]:
                    result[section].append(line)
                else:
                    result[section] = [line]

结果：

{
  "Section-A-Header": [
    "section A content line 1",
    "section A content line 2",
    "section A content line 3"
  ],
  "Section-B-Header": [
    "section B content line 1",
    "section B content line 2"
  ],
  "Section-C-Header": None,
  "Section-D-Header": [
    "section D content line 1",
    "section D content line 2",
    "section D content line 3"
  ]
}

注：

这样写只是因为 OP 需要 None 用于空白部分

Answer 3

有些人想查看我正在处理的实际数据的样本（我试图避免这种情况，因为它比我上面提供的样本数据复杂得多）。此数据在测试运行期间从 Pytest 输出，因为它被发送到控制台，因此它在大多数文本行中嵌入了 ANSI 编码。我之前没有包括这个，因为我的困难不在于解析文本，而在于创建逐行查看输出的整体算法。

这是我正在开发的 Pytest plugin 的一部分，它提供了一个 auto-launching 文本用户界面，希望能够更容易地分析 Pytest 的详细输出。

======================================================================================== FAILURES ========================================================================================
[31m[1m______________________________________________________________________________________ test_b_fail _______________________________________________________________________________________[0m

    [94mdef[39;49;00m [92mtest_b_fail[39;49;00m():
>       [94massert[39;49;00m [94m0[39;49;00m
[1m[31mE       assert 0[0m

[1m[31mtests/test_pytest_fold_1.py[0m:26: AssertionError
[31m[1m___________________________________________________________________________ test_g_eval_parameterized[6*9-42] ____________________________________________________________________________[0m

test_input = '6*9', expected = 42

    [37m@pytest[39;49;00m.mark.parametrize([33m"[39;49;00m[33mtest_input, expected[39;49;00m[33m"[39;49;00m, [([33m"[39;49;00m[33m3+5[39;49;00m[33m"[39;49;00m, [94m8[39;49;00m), ([33m"[39;49;00m[33m2+4[39;49;00m[33m"[39;49;00m, [94m6[39;49;00m), ([33m"[39;49;00m[33m6*9[39;49;00m[33m"[39;49;00m, [94m42[39;49;00m)])
    [94mdef[39;49;00m [92mtest_g_eval_parameterized[39;49;00m(test_input, expected):
>       [94massert[39;49;00m [96meval[39;49;00m(test_input) == expected
[1m[31mE       AssertionError: assert 54 == 42[0m
[1m[31mE        +  where 54 = eval('6*9')[0m

[1m[31mtests/test_pytest_fold_1.py[0m:48: AssertionError

我最终获得成功的代码是基于 Phenomenal One 的回答。我的正则表达式定义是：

r"\x1b\[31m\x1b\[1m__+\W(\S+)\W__+\x1b\[0m"

...处理代码为：

def _get_tracebacks(self, section_name: str, regex: str) -> dict:

    last_header = ""
    output = {}

    lines = re.split("\n", self.Sections[section_name].content)
    for line in lines:
        result = re.search(regex, line)
        if result:
            last_header = result.groups()[0]
            output[last_header] = ""
        else:
            if not last_header:
                continue
            existing_data = output[last_header]
            output[last_header] = existing_data + "\n" + line

    return output

感谢所有参与本次讨论的人！

难以从文本列表中提取测试标题和测试结果

Having difficulty extracting test titles and test results from a list of text

python

algorithm