使用 python 识别存储在文本文件中的多边形内的文本

Question

假设我有文本：

+------------------+
|                  |
|     (txta)       |
|                  |
|     A   B   C    |
+------------------+
|                  |
|     (txtb)       |
|                  |
|     B       B    |
+------------------+

我需要读取此文本文件并输出以下内容。想法是识别“矩形的”文本，然后识别矩形中有多少个 As Bs 和 Cs，因此对于上面的文本输入，输出如下所示

txta A 1 B 1 C 1  
txtb A 0 B 2 C 0

输出可以是任何格式（list/dictionary/etc.），只要在矩形中识别括号中的名称（本例中为txta和txtb）然后统计As Bs Cs的个数即可并报告。

我不知道该尝试什么。

Answer 1

Regex 是攻击这种东西的一种方法，但你知道他们对正则表达式的看法：

Some people, when confronted with a problem, think “I know, I'll use regular expressions.”
Now they have two problems.

现在，我不是正则表达式大师，当然总是有很多方法可以做事。根据您的其余数据的样子，这可能会中断，但如果 text 是您共享的字符串，那么这有效：

import re
from collections import Counter

with open('path/to/file.txt') as f:
    text = f.read()

data = {}
for t in filter(None, text.split('+------------------+')):
    match = re.search(r"\((\w+?)\)", t)
    key, = match.groups()
    match = re.findall(r"(?:\s(\w))+?", t)
    data[key] = Counter(match)

在这之后，data看起来像

{'txta': Counter({'A': 1, 'B': 1, 'C': 1}), 'txtb': Counter({'B': 2})}

你应该能够从那里挖掘你需要的东西，例如

for record, counter in data.items():
    counts = [f"{a} {counter.get(a, 0)}" for a in 'ABC']
    print(record, ' '.join(counts))

产生：

txta A 1 B 1 C 1
txtb A 0 B 2 C 0

不用担心那些 Counter 东西，它们本质上只是字典。如果您认为正则表达式看起来很奇怪，是的，它看起来总是那样。有些人喜欢网站 like this 来帮助解决问题。

使用 python 识别存储在文本文件中的多边形内的文本

Using python identify text within a polygon stored in a text file

python

text