仅解析空行分隔文件中的选定记录

Parse only selected records from empty-line separated file

我有一个具有以下结构的文件:

SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo

SE|text|Bla
SE|entity|Foo

SE|text|Zoo
SE|relation|Bla
SE|relation|Baz

记录(即块)由空行分隔。块中的每一行都以 SE 标记开头。 text 标记始终出现在每个块的第一行。

我想知道如何正确地只提取带有 relation 标签的块,它不一定存在于每个块中。我的尝试贴在下面:

from itertools import groupby
with open('test.txt') as f:
    for nonempty, group in groupby(f, bool):
        if nonempty:
            process_block() ## ?

所需的输出是 json 转储:

{
    "result": [
        {
            "text": "Baz", 
            "relation": ["Bla","Foo"]
        },
        {
            "text": "Zoo", 
            "relation": ["Bla","Baz"]
        }

    ]
}

您不能像评论中提到的那样在字典中存储相同的键两次。 您可以读取文件,在 '\n\n' 处拆分为块,在 '\n' 处将块拆分为行,在 '|' 处将行拆分为数据。

然后您可以将其放入合适的数据结构中并使用模块 json:

将其解析为字符串

创建数据文件:

with open("f.txt","w")as f:
    f.write('''SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo

SE|text|Bla
SE|entity|Foo

SE|text|Zoo
SE|relation|Bla
SE|relation|Baz''')

读取数据并处理:

with open("f.txt") as f:
    all_text = f.read()
    as_blocks = all_text.split("\n\n")
    # skip SE when splitting and filter only with |relation|
    with_relation = [[k.split("|")[1:]
                      for k in b.split("\n")]
                     for b in as_blocks if "|relation|" in b]

    print(with_relation)

创建合适的数据结构 - 将多个相同的键分组到一个列表中:

result = []
for inner in with_relation:
    result.append({})
    for k,v in inner:
        # add as simple key
        if k not in result[-1]:
            result[-1][k] = v

        # got key 2nd time, read it as list
        elif k in result[-1] and not isinstance(result[-1][k], list):
            result[-1][k] = [result[-1][k], v]

        # got it a 3rd+ time, add to list
        else:
            result[-1][k].append(v)

print(result)

从数据结构创建json:

import json

print( json.dumps({"result":result}, indent=4))

输出:

# with_relation
[[['text', 'Baz'], ['entity', 'Bla'], ['relation', 'Bla'], ['relation', 'Foo']], 
 [['text', 'Zoo'], ['relation', 'Bla'], ['relation', 'Baz']]]

# result
[{'text': 'Baz', 'entity': 'Bla', 'relation': ['Bla', 'Foo']}, 
 {'text': 'Zoo', 'relation': ['Bla', 'Baz']}]

# json string
{
    "result": [
        {
            "text": "Baz",
            "entity": "Bla",
            "relation": [
                "Bla",
                "Foo"
            ]
        },
        {
            "text": "Zoo",
            "relation": [
                "Bla",
                "Baz"
            ]
        }
    ]
}

我在纯 python 中提出了一个解决方案,即 returns 一个块,如果它包含任何位置的值。这很可能在像 pandas.

这样的适当框架中做得更优雅
from pprint import pprint

fname = 'ex.txt'

# extract blocks
with open(fname, 'r') as f:
    blocks = [[]]
    for line in f:
        if len(line) == 1:
            blocks.append([])
        else:
            blocks[-1] += [line.strip().split('|')]

# remove blocks that don't contain 'relation
blocks = [block for block in blocks
          if any('relation' == x[1] for x in block)]

pprint(blocks)
# [[['SE', 'text', 'Baz'],
#   ['SE', 'entity', 'Bla'],
#   ['SE', 'relation', 'Bla'],
#   ['SE', 'relation', 'Foo']],
#  [['SE', 'text', 'Zoo'], ['SE', 'relation', 'Bla'], ['SE', 'relation', 'Baz']]]


# To export to proper json format the following can be done
import pandas as pd
import json
results = []
for block in blocks:
    df = pd.DataFrame(block)
    json_dict = {}
    json_dict['text'] = list(df[2][df[1] == 'text'])
    json_dict['relation'] = list(df[2][df[1] == 'relation'])
    results.append(json_dict)
print(json.dumps(results))
# '[{"text": ["Baz"], "relation": ["Bla", "Foo"]}, {"text": ["Zoo"], "relation": ["Bla", "Baz"]}]'

我们来过一遍

  1. 将文件读入列表,每块用空行分隔,列用|字符分隔。
  2. 遍历列表中的每个块并整理出不包含 relation.
  3. 的任何块
  4. 打印输出。

在我看来,这是一个非常适合小型解析器的案例。
此解决方案使用名为 parsimoniousPEG 解析器,但您完全可以使用另一个解析器:

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
import json

data = """
SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo

SE|text|Bla
SE|entity|Foo

SE|text|Zoo
SE|relation|Bla
SE|relation|Baz
"""


class TagVisitor(NodeVisitor):
    grammar = Grammar(r"""
        content = (ws / block)+

        block   = line+
        line    = ~".+" nl?
        nl      = ~"[\n\r]"
        ws      = ~"\s+"
    """)

    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_content(self, node, visited_children):
        filtered = [child[0] for child in visited_children if isinstance(child[0], dict)]
        return {"result": filtered}

    def visit_block(self, node, visited_children):
        text, relations = None, []
        for child in visited_children:
            if child[1] == "text" and not text:
                text = child[2].strip()
            elif child[1] == "relation":
                relations.append(child[2])

        if relations:
            return {"text": text, "relation": relations}

    def visit_line(self, node, visited_children):
        tag1, tag2, text = node.text.split("|")
        return tag1, tag2, text.strip()


tv = TagVisitor()
result = tv.parse(data)

print(json.dumps(result))

这会产生

{"result": 
    [{"text": "Baz", "relation": ["Bla", "Foo"]}, 
     {"text": "Zoo", "relation": ["Bla", "Baz"]}]
}

我们的想法是表达一个语法,从中构建一个抽象语法树,然后 return 块的内容以合适的数据格式。