数据提取 - 这个正则表达式可以做得更好吗?

Data Extraction - Can this Regex be made better?

我有一个 C 程序正在解码来自 APRSIS 服务器的数据。它在 GNU/LINUX 机器上运行良好。

我创建了这个用于提取天气数据的正则表达式。它的长。这是一个示例数据记录和正则表达式:

数据记录

KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE21,qAO,WEBER:!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k
WA6MHA-11>APOTW1,WIDE1-1,WIDE2-1,qAR,N6LXX-10:!3410.50N/11828.90W_182/009g012t070P000h30b10220V126OTW1
KM6AHX-12>APOTU0,N6EX-5,qAR,N6LXX-10:!3411.20N/11813.02W_264/002g010t062p001h61T2WX
KM6AHX-12>APOTU0,N6EX-1*,qAR,VINCNT:!3411.20N/11813.02W_189/010g008t061p001h59T2WX
WA6MHA-11>APOTW1,WIDE1-1,WIDE2-1,qAR,K6LOT-10:!3410.50N/11828.90W_127/008g014t070P000h30b10220V127OTW1
K6OUA-11>APOTW1,WA6ZSN,WIDE2,qAR,N6LXX-10:!3417.39N/11849.36W_225/003g005t066V133P000h45b10138OTW1
KM6AHX-12>APOTU0,N6EX-1*,qAR,VINCNT:!3411.20N/11813.02W_234/005g008t060p001h59T2WX
AD6NH>APJYC1,TCPIP*,qAC,T2CAWEST:=3352.28N/11749.75W_000/000t065h48b10206 /A=259 https://www.ka2ddo.org/ka2ddo/YAAC.html
KM6AHX-12>APOTU0,N6EX-1*,qAR,VINCNT:!3411.20N/11813.02W_170/004g013t060p001h60T2WX
WA6MHA-11>APOTW1,WIDE1-1,WIDE2-1,qAR,N6LXX-10:!3410.50N/11828.90W_120/005g012t069P000h30b10220V127OTW1
K9COE-11>APOTW1,W6SCE-10,qAR,N6LXX-10:!3414.63N/11846.70W_105/007g007t065P035h51b10191OTW1
KM6AHX-12>APOTU0,N6EX-5*,qAR,K6LOT-10:!3411.20N/11813.02W_002/001g013t060p001h60T2WX
KM6AHX-12>APOTU0,N6EX-1*,qAR,VINCNT:!3411.20N/11813.02W_358/003g013t060p001h60T2WX
WA6MHA-11>APOTW1,WIDE1-1,WIDE2-1,qAR,K6LOT-10:!3410.50N/11828.90W_115/004g013t069P000h30b10220V126OTW1

正则表达式

":[!=][0-9.NS]*/[0-9.EW]*_([0-9]{3})/([0-9]{3})([tphbcsLls#grPV][0-9 .]{2,5})?"
    "([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?"
    "([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?"
    "([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?"
    "([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?"
    "([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?.*$"

给定的数据记录可能不包含所有可能的数据类型 ([tphbcsLls#grPV]),也不能保证顺序。

有更好的方法吗?好像有点暴力。

查克·布兰德

您可以将解析分为两个步骤:

  1. 验证字符串,并将所有 x123 类型模式分组到一个捕获组中
  2. 将所有 x123 类型模式拆分为单独的捕获组

第 1 步:

":[!=][0-9.NS]*\/[0-9.EW]*_([0-9]{3})\/([0-9]{3})((?:[tphbcsLls#grPV][0-9 .]{2,5})+)"

正则表达式的解释:

  • :[!=][0-9.NS]*\/[0-9.EW]*_ - 选择正确记录类型的预期模式
  • ([0-9]{3}) - 捕获组 1
  • \/ - 斜杠
  • ([0-9]{3}) - 捕获组 2
  • ( - 捕获组 3 开始
    • (?: - 非捕获组开始
      • [tphbcsLls#grPV][0-9 .]{2,5} - 预期模式(重复)
    • )+ - 非捕获组结束,重复此1+次
  • ) - 捕获组 3 结束

生成的输入捕获组 KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE21,qAO,WEBER:!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k:

  • "217" - 捕获组 1
  • "010" - 捕获组 2
  • "g015t047r000p000P025h76b10078" - 捕获组 3

步骤 2: 现在获取捕获组 3 的结果并拆分它:

"(?=[tphbcsLls#grPV])"

拆分正则表达式的解释:

  • (?= - 正面前瞻:
  • [tphbcsLls#grPV] - 这些字符之一
  • ) - 结束正面前瞻

拆分结果:

  • ["g015", "t047", "r000", "p000", "P025", "h76", "b10078"]

编辑: 在了解到正向 lookahad 不可用之后:您可以使用带有全局标志的 match 而不是 split 来获取数组项目数:

/[tphbcsLls#grPV][^tphbcsLls#grPV]*/g

匹配正则表达式的解释:

  • [tphbcsLls#grPV] - 扫描起始字母
  • [^tphbcsLls#grPV]* - 抓取所有不是起始字母
  • 的字母
  • 使用 g 全局标志冲洗并重复

拆分结果:

  • ["g015", "t047", "r000", "p000", "P025", "h76", "b10078"]

我的回答与彼得的相似。 先将“数据”提取为一长串,然后找出其中的所有子数据。

我已经在 java 中实现了它。

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class So66537099 {
    public static void main(String[] args) {
        final String[] lines = (
                "KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE21,qAO,WEBER:!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k\n" +
                "..."
        ).split("\n");

        final Pattern PATTERN1 = Pattern.compile(".*?:[!=][0-9.NS]*/[0-9.EW]*_([0-9]{3})/([0-9]{3})((?:[tphbcsLl#grPV][0-9 .]{2,5})*).*?");
        final Pattern PATTERN2 = Pattern.compile("[tphbcsLl#grPV][0-9 .]{2,5}");
        for (final String line : lines) {
            System.out.println("line = " + line);
            final Matcher m1 = PATTERN1.matcher(line);
            if (m1.matches()) {
                System.out.println("matches");
                System.out.println("m1.group(1) = " + m1.group(1));
                System.out.println("m1.group(2) = " + m1.group(2));
                final String data = m1.group(3);
                System.out.println("m1.group(3) = " + data);
                if (!data.isEmpty()) {
                    final Matcher m2 = PATTERN2.matcher(data);
                    while (m2.find()) {
                        System.out.println("... m2.group() = " + m2.group());
                    }
                }
            } else {
                System.out.println("doesn't match");
            }
        }
        System.out.println();
    }
}

输出:

line = KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE21,qAO,WEBER:!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k
matches
m1.group(1) = 217
m1.group(2) = 010
m1.group(3) = g015t047r000p000P025h76b10078
... m2.group() = g015
... m2.group() = t047
... m2.group() = r000
... m2.group() = p000
... m2.group() = P025
... m2.group() = h76
... m2.group() = b10078
...

这是我根据 Peter Thoeney 的意见得出的结论。

// gcc -Wall -std=c99 -o RME RME.c && ./RME
// Source: https://gist.github.com/ianmackinnon/3294587

#include <stdio.h>
#include <string.h>
#include <regex.h>

#define    NUMBER_OF_GROUPS     4 //groups in your regex + 1
#define    NUMBER_OF_WX_GROUPS 14

char    WXsource[64];
char    source[] = "KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE2-1,qAO,WEBER"
                  ":!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k";

char    *regexString1 = ":[!=][0-9.NS]*.[0-9.EW]*_([0-9]{3})/([0-9]{3})([tphbcsLls#grPV0-9 .]+).{4}$";
char    *regexString2 = "([tphbcsLls#grPV][^tphbcsLls#grPV]*)";
char    WXDataArray[NUMBER_OF_WX_GROUPS][16];
int     numberDecodedGroups=0;
int     sourceOffset=0;
int     start;
int     count;

regex_t     regexCompiled1;
regex_t     regexCompiled2;
regmatch_t  groupArray[NUMBER_OF_GROUPS];

int main ()
    {
    //the first regex matches the coords, wind data,
    //the entire string of weather data, and weather station ID.
    //It captures the wind data and weather data.
    if (regcomp(&regexCompiled1, regexString1, REG_EXTENDED|REG_NEWLINE))
        {
        printf("Could not compile regular expression 1.\n");
        return(1);
        }

    //The second regex parses the weather data into the individual
    //items. It requires multiple calls to accomplish the process.
    if (regcomp(&regexCompiled2, regexString2, REG_EXTENDED|REG_NEWLINE))
        {
        printf("Could not compile regular expression 2.\n");
        return(1);
        }

    //first extraction. The weather data is in group 3.
    regexec(&regexCompiled1, source, NUMBER_OF_GROUPS, groupArray, 0);

    start = groupArray[3].rm_so;                //start of weather data
    count = groupArray[3].rm_eo-start;          //bytes of weather data

    //create a null terminated string of the weather data
    memcpy(&WXsource[0], &source[start], count);
    WXsource[count]=0;

    //this loop iterates for each entry in the weather data. With each loop
    //the starting point is incremented by the length of the data just
    //extracted. Each string is null terminated in an array.
    //the regex looks for a character of one the field identifiers followed
    //by as many characters it can grab that are NOT field identifiers.
    for(int matchIndex=0; matchIndex < NUMBER_OF_WX_GROUPS; matchIndex++)
        {
        //find the data item. sourceOffset moves the beginning of the source
        //string by the size of the previous extracted data item.
        if (regexec(&regexCompiled2, &WXsource[sourceOffset], NUMBER_OF_GROUPS, groupArray, 0))
            break;

        //start of entry. SHould always be 0
        start = groupArray[1].rm_so;

        //eo ends up being the count.
        count = groupArray[1].rm_eo;

        //copy the sub-string to the output array
        memcpy(&WXDataArray[matchIndex][0], &WXsource[sourceOffset], count);

        //add the null termination
        WXDataArray[matchIndex][count]=0;

        //increment sourceOffset
        sourceOffset += groupArray[1].rm_eo;

        //increment the number of fields extracted
        numberDecodedGroups++;
        }

    for(int Index = 0; Index < numberDecodedGroups; Index++)
        printf("%s\n", &WXDataArray[Index][0]);

    return(0);
    }