如何仅通过正则表达式获取原始数据第一段的信息？

Question

下面是我的代码将通过正则表达式处理的原始数据示例：

raw_data = '''
name        :   John
age         :   26
gender      :   male
occupation  :   teacher

Father
---------------------
name        :   Bill
age         :   52
gender      :   male

Mother
---------------------
name        :   Mary
age         :   48
gender      :   female
'''

我想从原始数据中提取以下部分信息并存储在字典中：

dict(name = 'John', age = 26, gender = 'male', occupation = 'teacher')

但是，当我运行我的代码如下时，它并没有像我预期的那样工作：

import re
p = re.compile('[^-]*?^([^:\-]+?):([^\r\n]*?)$', re.M)
rets = p.findall(raw_data)

infoAboutJohnAsDict = {}

if rets != []:
  for ret in rets:
    infoAboutJohnAsDict[ret[0]] = ret[1]
else:
  print("Not match.")

print(f'rets = {rets}')
print(f'infoAboutJohnAsDict = {infoAboutJohnAsDict}')

任何人都可以给我任何建议，告诉我应该如何修改我的代码以实现我打算做的事情吗？

Answer 1

这是一种使用正则表达式的方法。我们可以先 trim 使用 re.sub 关闭您不想要的输入的后半部分。然后，使用 re.findall 查找 John 的所有键值对，并转换为字典。

raw_data = re.sub(r'\s+\w+\s+-+.*', '', raw_data, flags=re.S)
matches = re.findall(r'(\w+)\s*:\s*(\w+)', raw_data)
d = dict()
for m in matches:
    d[m[0]] = m[1]

print(d)
# {'gender': 'male', 'age': '26', 'name': 'John', 'occupation': 'teacher'}

如何仅通过正则表达式获取原始数据第一段的信息？

How to retrieve information in the first section of the raw data only by regular expressions?

python

regex