re 模块的一些缺失值问题

Question

我目前正在研究使用正则表达式的解析函数，但我编写的函数目前无法处理丢失的数据。我使用的代码基于 https://www.vipinajayakumar.com/parsing-text-with-python/ 上的代码，并且正在解析结构化文本字段，其中每一行的形式为：

someField = someValue

我解析的这个文件可能有一些在 someValue 中没有值的字段。由于我之后构建了一行，例如 link，如果我解析的文件不包含某个字段的任何值，我该如何处理缺失值？

编辑：

我举个例子。假设我需要解析的 .txt 文件包含这两个字段：

身高=176

体重=75.9

并且仅使用这部分代码：

import os
import re 
import pandas as pd



def parse_line(line, curr_dict):
    """
    The function parse_line(line) does a parse on the input line. This function is taken from the tutorial on parsing files available at https://www.vipinajayakumar.com/parsing-text-with-python/ 
    """
    for key, rx in curr_dict.items():
        match = rx.search(line)
        if match:
            return key, match
    # if there are no matches
    return None, None

test_dict={
    'Weight' : re.compile(r'Weight=(?P<Weight>\d+[.]\d+)\n'),
    'Height' : re.compile(r'Height=(?P<Height>\d+)\n'), 
}

with open(txt_file,'r') as f:
        new_line = f.readline()
        while new_line:
            key, match = parse_line(new_line, test_dict)
            if key :
                if key == 'Weight':
                    Weight = match.group('Weight')
                    Weight = float(Weight)
                if key == 'Height':
                    Height = match.group('Height')
            new_line = f.readline()
row = {'Height' : Height,
       'Weight' : Weight,
      }
df = pd.DataFrame(row, index=[1])

如果文件中的所有字段都完整，如上所示，则没有问题，但是如果例如：

身高=

体重=75.9

我有一个错误，因为我在高度中有一个缺失值=

Answer 1

假设someValue缺失时只是一片空白，那么只需要在someValue字段对应的分组后添加一个?即可。

根据您提供的内容随机示例：

re.compile(r'Weight=(?P<Weight>\d+[.]\d+)?\n')

注意换行符前的感叹号。

如果它实际上是 space 而不是空白，请尝试

re.compile(r'Weight=(?P<Weight>\d+[.]\d+)? ?\n')

来自docs：

?

Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.

考虑到当值为空时匹配将是 None，您还应该在代码中添加值检查：

if key :
    if key == 'Weight':
        Weight = match.group('Weight')
        if Weight is not None:
            Weight = float(Weight)
        else:
            Weight = 0.0
        
    if key == 'Height':
        Height = match.group('Height')
        if Height is not None:
            Height = float(Height)
        else:
            Height = 0.0

re 模块的一些缺失值问题

Problem with some missing values with re module

python

text-parsing