数据提取 - 这个正则表达式可以做得更好吗?
Data Extraction - Can this Regex be made better?
我有一个 C 程序正在解码来自 APRSIS 服务器的数据。它在 GNU/LINUX 机器上运行良好。
我创建了这个用于提取天气数据的正则表达式。它的长。这是一个示例数据记录和正则表达式:
数据记录
KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE21,qAO,WEBER:!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k
WA6MHA-11>APOTW1,WIDE1-1,WIDE2-1,qAR,N6LXX-10:!3410.50N/11828.90W_182/009g012t070P000h30b10220V126OTW1
KM6AHX-12>APOTU0,N6EX-5,qAR,N6LXX-10:!3411.20N/11813.02W_264/002g010t062p001h61T2WX
KM6AHX-12>APOTU0,N6EX-1*,qAR,VINCNT:!3411.20N/11813.02W_189/010g008t061p001h59T2WX
WA6MHA-11>APOTW1,WIDE1-1,WIDE2-1,qAR,K6LOT-10:!3410.50N/11828.90W_127/008g014t070P000h30b10220V127OTW1
K6OUA-11>APOTW1,WA6ZSN,WIDE2,qAR,N6LXX-10:!3417.39N/11849.36W_225/003g005t066V133P000h45b10138OTW1
KM6AHX-12>APOTU0,N6EX-1*,qAR,VINCNT:!3411.20N/11813.02W_234/005g008t060p001h59T2WX
AD6NH>APJYC1,TCPIP*,qAC,T2CAWEST:=3352.28N/11749.75W_000/000t065h48b10206 /A=259 https://www.ka2ddo.org/ka2ddo/YAAC.html
KM6AHX-12>APOTU0,N6EX-1*,qAR,VINCNT:!3411.20N/11813.02W_170/004g013t060p001h60T2WX
WA6MHA-11>APOTW1,WIDE1-1,WIDE2-1,qAR,N6LXX-10:!3410.50N/11828.90W_120/005g012t069P000h30b10220V127OTW1
K9COE-11>APOTW1,W6SCE-10,qAR,N6LXX-10:!3414.63N/11846.70W_105/007g007t065P035h51b10191OTW1
KM6AHX-12>APOTU0,N6EX-5*,qAR,K6LOT-10:!3411.20N/11813.02W_002/001g013t060p001h60T2WX
KM6AHX-12>APOTU0,N6EX-1*,qAR,VINCNT:!3411.20N/11813.02W_358/003g013t060p001h60T2WX
WA6MHA-11>APOTW1,WIDE1-1,WIDE2-1,qAR,K6LOT-10:!3410.50N/11828.90W_115/004g013t069P000h30b10220V126OTW1
正则表达式
":[!=][0-9.NS]*/[0-9.EW]*_([0-9]{3})/([0-9]{3})([tphbcsLls#grPV][0-9 .]{2,5})?"
"([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?"
"([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?"
"([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?"
"([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?"
"([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?.*$"
给定的数据记录可能不包含所有可能的数据类型 ([tphbcsLls#grPV]),也不能保证顺序。
有更好的方法吗?好像有点暴力。
查克·布兰德
您可以将解析分为两个步骤:
- 验证字符串,并将所有
x123
类型模式分组到一个捕获组中
- 将所有
x123
类型模式拆分为单独的捕获组
第 1 步:
":[!=][0-9.NS]*\/[0-9.EW]*_([0-9]{3})\/([0-9]{3})((?:[tphbcsLls#grPV][0-9 .]{2,5})+)"
正则表达式的解释:
:[!=][0-9.NS]*\/[0-9.EW]*_
- 选择正确记录类型的预期模式
([0-9]{3})
- 捕获组 1
\/
- 斜杠
([0-9]{3})
- 捕获组 2
(
- 捕获组 3 开始
(?:
- 非捕获组开始
[tphbcsLls#grPV][0-9 .]{2,5}
- 预期模式(重复)
)+
- 非捕获组结束,重复此1+次
)
- 捕获组 3 结束
生成的输入捕获组 KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE21,qAO,WEBER:!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k
:
"217"
- 捕获组 1
"010"
- 捕获组 2
"g015t047r000p000P025h76b10078"
- 捕获组 3
步骤 2: 现在获取捕获组 3 的结果并拆分它:
"(?=[tphbcsLls#grPV])"
拆分正则表达式的解释:
(?=
- 正面前瞻:
[tphbcsLls#grPV]
- 这些字符之一
)
- 结束正面前瞻
拆分结果:
["g015", "t047", "r000", "p000", "P025", "h76", "b10078"]
编辑: 在了解到正向 lookahad 不可用之后:您可以使用带有全局标志的 match
而不是 split
来获取数组项目数:
/[tphbcsLls#grPV][^tphbcsLls#grPV]*/g
匹配正则表达式的解释:
[tphbcsLls#grPV]
- 扫描起始字母
[^tphbcsLls#grPV]*
- 抓取所有不是起始字母 的字母
- 使用
g
全局标志冲洗并重复
拆分结果:
["g015", "t047", "r000", "p000", "P025", "h76", "b10078"]
我的回答与彼得的相似。
先将“数据”提取为一长串,然后找出其中的所有子数据。
我已经在 java 中实现了它。
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class So66537099 {
public static void main(String[] args) {
final String[] lines = (
"KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE21,qAO,WEBER:!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k\n" +
"..."
).split("\n");
final Pattern PATTERN1 = Pattern.compile(".*?:[!=][0-9.NS]*/[0-9.EW]*_([0-9]{3})/([0-9]{3})((?:[tphbcsLl#grPV][0-9 .]{2,5})*).*?");
final Pattern PATTERN2 = Pattern.compile("[tphbcsLl#grPV][0-9 .]{2,5}");
for (final String line : lines) {
System.out.println("line = " + line);
final Matcher m1 = PATTERN1.matcher(line);
if (m1.matches()) {
System.out.println("matches");
System.out.println("m1.group(1) = " + m1.group(1));
System.out.println("m1.group(2) = " + m1.group(2));
final String data = m1.group(3);
System.out.println("m1.group(3) = " + data);
if (!data.isEmpty()) {
final Matcher m2 = PATTERN2.matcher(data);
while (m2.find()) {
System.out.println("... m2.group() = " + m2.group());
}
}
} else {
System.out.println("doesn't match");
}
}
System.out.println();
}
}
输出:
line = KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE21,qAO,WEBER:!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k
matches
m1.group(1) = 217
m1.group(2) = 010
m1.group(3) = g015t047r000p000P025h76b10078
... m2.group() = g015
... m2.group() = t047
... m2.group() = r000
... m2.group() = p000
... m2.group() = P025
... m2.group() = h76
... m2.group() = b10078
...
这是我根据 Peter Thoeney 的意见得出的结论。
// gcc -Wall -std=c99 -o RME RME.c && ./RME
// Source: https://gist.github.com/ianmackinnon/3294587
#include <stdio.h>
#include <string.h>
#include <regex.h>
#define NUMBER_OF_GROUPS 4 //groups in your regex + 1
#define NUMBER_OF_WX_GROUPS 14
char WXsource[64];
char source[] = "KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE2-1,qAO,WEBER"
":!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k";
char *regexString1 = ":[!=][0-9.NS]*.[0-9.EW]*_([0-9]{3})/([0-9]{3})([tphbcsLls#grPV0-9 .]+).{4}$";
char *regexString2 = "([tphbcsLls#grPV][^tphbcsLls#grPV]*)";
char WXDataArray[NUMBER_OF_WX_GROUPS][16];
int numberDecodedGroups=0;
int sourceOffset=0;
int start;
int count;
regex_t regexCompiled1;
regex_t regexCompiled2;
regmatch_t groupArray[NUMBER_OF_GROUPS];
int main ()
{
//the first regex matches the coords, wind data,
//the entire string of weather data, and weather station ID.
//It captures the wind data and weather data.
if (regcomp(®exCompiled1, regexString1, REG_EXTENDED|REG_NEWLINE))
{
printf("Could not compile regular expression 1.\n");
return(1);
}
//The second regex parses the weather data into the individual
//items. It requires multiple calls to accomplish the process.
if (regcomp(®exCompiled2, regexString2, REG_EXTENDED|REG_NEWLINE))
{
printf("Could not compile regular expression 2.\n");
return(1);
}
//first extraction. The weather data is in group 3.
regexec(®exCompiled1, source, NUMBER_OF_GROUPS, groupArray, 0);
start = groupArray[3].rm_so; //start of weather data
count = groupArray[3].rm_eo-start; //bytes of weather data
//create a null terminated string of the weather data
memcpy(&WXsource[0], &source[start], count);
WXsource[count]=0;
//this loop iterates for each entry in the weather data. With each loop
//the starting point is incremented by the length of the data just
//extracted. Each string is null terminated in an array.
//the regex looks for a character of one the field identifiers followed
//by as many characters it can grab that are NOT field identifiers.
for(int matchIndex=0; matchIndex < NUMBER_OF_WX_GROUPS; matchIndex++)
{
//find the data item. sourceOffset moves the beginning of the source
//string by the size of the previous extracted data item.
if (regexec(®exCompiled2, &WXsource[sourceOffset], NUMBER_OF_GROUPS, groupArray, 0))
break;
//start of entry. SHould always be 0
start = groupArray[1].rm_so;
//eo ends up being the count.
count = groupArray[1].rm_eo;
//copy the sub-string to the output array
memcpy(&WXDataArray[matchIndex][0], &WXsource[sourceOffset], count);
//add the null termination
WXDataArray[matchIndex][count]=0;
//increment sourceOffset
sourceOffset += groupArray[1].rm_eo;
//increment the number of fields extracted
numberDecodedGroups++;
}
for(int Index = 0; Index < numberDecodedGroups; Index++)
printf("%s\n", &WXDataArray[Index][0]);
return(0);
}
我有一个 C 程序正在解码来自 APRSIS 服务器的数据。它在 GNU/LINUX 机器上运行良好。
我创建了这个用于提取天气数据的正则表达式。它的长。这是一个示例数据记录和正则表达式:
数据记录
KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE21,qAO,WEBER:!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k
WA6MHA-11>APOTW1,WIDE1-1,WIDE2-1,qAR,N6LXX-10:!3410.50N/11828.90W_182/009g012t070P000h30b10220V126OTW1
KM6AHX-12>APOTU0,N6EX-5,qAR,N6LXX-10:!3411.20N/11813.02W_264/002g010t062p001h61T2WX
KM6AHX-12>APOTU0,N6EX-1*,qAR,VINCNT:!3411.20N/11813.02W_189/010g008t061p001h59T2WX
WA6MHA-11>APOTW1,WIDE1-1,WIDE2-1,qAR,K6LOT-10:!3410.50N/11828.90W_127/008g014t070P000h30b10220V127OTW1
K6OUA-11>APOTW1,WA6ZSN,WIDE2,qAR,N6LXX-10:!3417.39N/11849.36W_225/003g005t066V133P000h45b10138OTW1
KM6AHX-12>APOTU0,N6EX-1*,qAR,VINCNT:!3411.20N/11813.02W_234/005g008t060p001h59T2WX
AD6NH>APJYC1,TCPIP*,qAC,T2CAWEST:=3352.28N/11749.75W_000/000t065h48b10206 /A=259 https://www.ka2ddo.org/ka2ddo/YAAC.html
KM6AHX-12>APOTU0,N6EX-1*,qAR,VINCNT:!3411.20N/11813.02W_170/004g013t060p001h60T2WX
WA6MHA-11>APOTW1,WIDE1-1,WIDE2-1,qAR,N6LXX-10:!3410.50N/11828.90W_120/005g012t069P000h30b10220V127OTW1
K9COE-11>APOTW1,W6SCE-10,qAR,N6LXX-10:!3414.63N/11846.70W_105/007g007t065P035h51b10191OTW1
KM6AHX-12>APOTU0,N6EX-5*,qAR,K6LOT-10:!3411.20N/11813.02W_002/001g013t060p001h60T2WX
KM6AHX-12>APOTU0,N6EX-1*,qAR,VINCNT:!3411.20N/11813.02W_358/003g013t060p001h60T2WX
WA6MHA-11>APOTW1,WIDE1-1,WIDE2-1,qAR,K6LOT-10:!3410.50N/11828.90W_115/004g013t069P000h30b10220V126OTW1
正则表达式
":[!=][0-9.NS]*/[0-9.EW]*_([0-9]{3})/([0-9]{3})([tphbcsLls#grPV][0-9 .]{2,5})?"
"([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?"
"([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?"
"([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?"
"([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?"
"([tphbcsLls#grPV][0-9 .]{2,5})?([tphbcsLls#grPV][0-9 .]{2,5})?.*$"
给定的数据记录可能不包含所有可能的数据类型 ([tphbcsLls#grPV]),也不能保证顺序。
有更好的方法吗?好像有点暴力。
查克·布兰德
您可以将解析分为两个步骤:
- 验证字符串,并将所有
x123
类型模式分组到一个捕获组中 - 将所有
x123
类型模式拆分为单独的捕获组
第 1 步:
":[!=][0-9.NS]*\/[0-9.EW]*_([0-9]{3})\/([0-9]{3})((?:[tphbcsLls#grPV][0-9 .]{2,5})+)"
正则表达式的解释:
:[!=][0-9.NS]*\/[0-9.EW]*_
- 选择正确记录类型的预期模式([0-9]{3})
- 捕获组 1\/
- 斜杠([0-9]{3})
- 捕获组 2(
- 捕获组 3 开始(?:
- 非捕获组开始[tphbcsLls#grPV][0-9 .]{2,5}
- 预期模式(重复)
)+
- 非捕获组结束,重复此1+次
)
- 捕获组 3 结束
生成的输入捕获组 KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE21,qAO,WEBER:!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k
:
"217"
- 捕获组 1"010"
- 捕获组 2"g015t047r000p000P025h76b10078"
- 捕获组 3
步骤 2: 现在获取捕获组 3 的结果并拆分它:
"(?=[tphbcsLls#grPV])"
拆分正则表达式的解释:
(?=
- 正面前瞻:[tphbcsLls#grPV]
- 这些字符之一)
- 结束正面前瞻
拆分结果:
["g015", "t047", "r000", "p000", "P025", "h76", "b10078"]
编辑: 在了解到正向 lookahad 不可用之后:您可以使用带有全局标志的 match
而不是 split
来获取数组项目数:
/[tphbcsLls#grPV][^tphbcsLls#grPV]*/g
匹配正则表达式的解释:
[tphbcsLls#grPV]
- 扫描起始字母[^tphbcsLls#grPV]*
- 抓取所有不是起始字母 的字母
- 使用
g
全局标志冲洗并重复
拆分结果:
["g015", "t047", "r000", "p000", "P025", "h76", "b10078"]
我的回答与彼得的相似。 先将“数据”提取为一长串,然后找出其中的所有子数据。
我已经在 java 中实现了它。
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class So66537099 {
public static void main(String[] args) {
final String[] lines = (
"KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE21,qAO,WEBER:!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k\n" +
"..."
).split("\n");
final Pattern PATTERN1 = Pattern.compile(".*?:[!=][0-9.NS]*/[0-9.EW]*_([0-9]{3})/([0-9]{3})((?:[tphbcsLl#grPV][0-9 .]{2,5})*).*?");
final Pattern PATTERN2 = Pattern.compile("[tphbcsLl#grPV][0-9 .]{2,5}");
for (final String line : lines) {
System.out.println("line = " + line);
final Matcher m1 = PATTERN1.matcher(line);
if (m1.matches()) {
System.out.println("matches");
System.out.println("m1.group(1) = " + m1.group(1));
System.out.println("m1.group(2) = " + m1.group(2));
final String data = m1.group(3);
System.out.println("m1.group(3) = " + data);
if (!data.isEmpty()) {
final Matcher m2 = PATTERN2.matcher(data);
while (m2.find()) {
System.out.println("... m2.group() = " + m2.group());
}
}
} else {
System.out.println("doesn't match");
}
}
System.out.println();
}
}
输出:
line = KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE21,qAO,WEBER:!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k
matches
m1.group(1) = 217
m1.group(2) = 010
m1.group(3) = g015t047r000p000P025h76b10078
... m2.group() = g015
... m2.group() = t047
... m2.group() = r000
... m2.group() = p000
... m2.group() = P025
... m2.group() = h76
... m2.group() = b10078
...
这是我根据 Peter Thoeney 的意见得出的结论。
// gcc -Wall -std=c99 -o RME RME.c && ./RME
// Source: https://gist.github.com/ianmackinnon/3294587
#include <stdio.h>
#include <string.h>
#include <regex.h>
#define NUMBER_OF_GROUPS 4 //groups in your regex + 1
#define NUMBER_OF_WX_GROUPS 14
char WXsource[64];
char source[] = "KG7FOQ-13>APTT4,HARIN,WIDE1*,WIDE2-1,qAO,WEBER"
":!4227.10N/11422.32W_217/010g015t047r000p000P025h76b10078TU2k";
char *regexString1 = ":[!=][0-9.NS]*.[0-9.EW]*_([0-9]{3})/([0-9]{3})([tphbcsLls#grPV0-9 .]+).{4}$";
char *regexString2 = "([tphbcsLls#grPV][^tphbcsLls#grPV]*)";
char WXDataArray[NUMBER_OF_WX_GROUPS][16];
int numberDecodedGroups=0;
int sourceOffset=0;
int start;
int count;
regex_t regexCompiled1;
regex_t regexCompiled2;
regmatch_t groupArray[NUMBER_OF_GROUPS];
int main ()
{
//the first regex matches the coords, wind data,
//the entire string of weather data, and weather station ID.
//It captures the wind data and weather data.
if (regcomp(®exCompiled1, regexString1, REG_EXTENDED|REG_NEWLINE))
{
printf("Could not compile regular expression 1.\n");
return(1);
}
//The second regex parses the weather data into the individual
//items. It requires multiple calls to accomplish the process.
if (regcomp(®exCompiled2, regexString2, REG_EXTENDED|REG_NEWLINE))
{
printf("Could not compile regular expression 2.\n");
return(1);
}
//first extraction. The weather data is in group 3.
regexec(®exCompiled1, source, NUMBER_OF_GROUPS, groupArray, 0);
start = groupArray[3].rm_so; //start of weather data
count = groupArray[3].rm_eo-start; //bytes of weather data
//create a null terminated string of the weather data
memcpy(&WXsource[0], &source[start], count);
WXsource[count]=0;
//this loop iterates for each entry in the weather data. With each loop
//the starting point is incremented by the length of the data just
//extracted. Each string is null terminated in an array.
//the regex looks for a character of one the field identifiers followed
//by as many characters it can grab that are NOT field identifiers.
for(int matchIndex=0; matchIndex < NUMBER_OF_WX_GROUPS; matchIndex++)
{
//find the data item. sourceOffset moves the beginning of the source
//string by the size of the previous extracted data item.
if (regexec(®exCompiled2, &WXsource[sourceOffset], NUMBER_OF_GROUPS, groupArray, 0))
break;
//start of entry. SHould always be 0
start = groupArray[1].rm_so;
//eo ends up being the count.
count = groupArray[1].rm_eo;
//copy the sub-string to the output array
memcpy(&WXDataArray[matchIndex][0], &WXsource[sourceOffset], count);
//add the null termination
WXDataArray[matchIndex][count]=0;
//increment sourceOffset
sourceOffset += groupArray[1].rm_eo;
//increment the number of fields extracted
numberDecodedGroups++;
}
for(int Index = 0; Index < numberDecodedGroups; Index++)
printf("%s\n", &WXDataArray[Index][0]);
return(0);
}