获取特定的 protobuf 成员

fetch specific protobuf members

我想获取所有以 text: 开头的行的数组(直到第一个 asset_performance_label

我看到了这个 post,但不确定如何应用它。

我应该像我试过的那样将原型转换为字符串吗?

    text = extract_text_from_proto(r"(\w+)text:(\w+)asset_performance_label:", '''[pinned_field: HEADLINE_1
    text: "5 Best Products"
    asset_performance_label: PENDING
    policy_summary_info
    {
        review_status: REVIEWED
        approval_status: APPROVED
    }
    , pinned_field: HEADLINE_1
    text: "10 Best Products 2021"
    asset_performance_label: PENDING
    policy_summary_info
    {
        review_status: REVIEWED
        approval_status: APPROVED
    }''')


def extract_text_from_proto(regex, proto_string):
    regex = re.escape(regex)
    result_array = [m.group() for m in re.finditer(regex, proto_string)]
    return result_array
    # return [extract_text(each_item, regex) for each_item in proto],


def extract_text(regex, item):
    m = re.match(regex, str(item))
    if m is None:
        # text = "MISSING TEXT"
        raise Exception("Ad is missing text")
    else:
        text = m.group(2)
    return text

预期结果:["5 Best Products","10 Best Products 2021"]

如果我想匹配(可选)pinned_field: (word)怎么办?所以结果可能是:[HEADLINE_1: 5 Best Products', 'HEADLINE_1:10 Best Products 2021', 'some_text_without_pinned_field']` ?

您可以使用单个捕获组,并在下一行中匹配 assert_performance_label。使用 re.findall 到 return 组值。

\btext:\s*"([^"]+)"\n\s*asset_performance_label\b

模式匹配

  • \btext:\s*" 匹配前面有单词边界 \btext: 以防止部分匹配
  • ([^"]+) 捕获第 1 组,匹配除双引号外的 1+ 个字符
  • "\n\s* 匹配换行符和可选的空白字符
  • asset_performance_label\b 匹配`asset_performance_label 后跟单词边界

例如

import re

def extract_text_from_proto(regex, proto_string):
    return re.findall(regex, proto_string)

text = extract_text_from_proto(r'\btext:\s*"([^"]+)"\n\s*asset_performance_label\b', '''[pinned_field: HEADLINE_1
    text: "5 Best Products"
    asset_performance_label: PENDING
    policy_summary_info
    {
        review_status: REVIEWED
        approval_status: APPROVED
    }
    , pinned_field: HEADLINE_1
    text: "10 Best Products 2021"
    asset_performance_label: PENDING
    policy_summary_info
    {
        review_status: REVIEWED
        approval_status: APPROVED
    }''')


print(text)

输出

['5 Best Products', '10 Best Products 2021']