使用 RegExp 从 Whois 中提取信息

Extracting information from Whois with RegExp

如何从 Whois 查询结果中提取多个段?

我得到一个数组,它是 Whois 查找的结果(来自 foreach 循环)。

例如,如果我想要从 "domain...." 行到 WHOIS 数据库的“>>> 上次更新”的所有内容:-行。我该怎么做?

Whois 是使用 exec 命令执行的:

foreach ($query as $domain) {               
            $scanUrl = 'whois '.$domain->url;
            exec($scanUrl, $output);             
    }

Whois 可以正常工作,我可以通过 preg_grep:

获取创建、过期和注册商
    $domainCreated  = preg_grep('/created/', $output);
    $domainExpires  = preg_grep('/expires/', $output);
    $domainRegistrar  = preg_grep('/registrar..........:/', $output);

但我需要从数组中获取多个部分,例如从域....行到 >>> WHOIS 数据库的最后更新:-行。

所有 Whois 结果都在一个数组中。 Whois 结果如下所示:

Array
(
[0] =>
[1] => domain.............: iltalehti.fi
[2] => status.............: Registered
[3] => created............: 1.1.1991 00:00:00
[4] => expires............: 31.8.2022 00:00:00
[5] => available..........: 30.9.2022 00:00:00
[6] => modified...........: 6.9.2017
[7] => holder transfer....: 13.7.2013
[8] => RegistryLock.......: no
[9] =>
[10] => Nameservers
[11] =>
[12] => nserver............: a.ns-sec.com [Technical Error]
[13] => nserver............: d.ns-sec.org [OK]
[14] => nserver............: c.ns-sec.fi [178.217.128.53] 
[2001:67c:224:53::53:1] [OK]
[15] => nserver............: b.ns-sec.net [OK]
[16] =>
[17] => DNSSEC
[18] =>
[19] => dnssec.............: no
[20] =>
[21] => Holder
[22] =>
[23] => name...............: Alma Media Oyj
[24] => register number....: 1944757-4
[25] => address............: PL 140
[26] => address............: 00101
[27] => address............: Helsinki
[28] => country............: Finland
[29] => phone..............: +358 10 665 000
[30] => holder email.......:
[31] =>
[32] => Registrar
[33] =>
[34] => registrar..........: Cybercom Finland Oy
[35] => www................: www.cybercom.com
[36] =>
[37] => >>> Last update of WHOIS database: 24.3.2020 12:45:05 (EET) <<<
[38] =>
[39] =>
[40] => Copyright (c) Finnish Transport and Communications Agency Traficom
[41] =>
[42] =>
[43] => domain.............: yle.fi
[44] => status.............: Registered
[45] => created............: 1.1.1991 00:00:00
[46] => expires............: 31.8.2020 00:00:00
[47] => available..........: 30.9.2020 00:00:00
[48] => modified...........: 16.1.2018
[49] => RegistryLock.......: no
[50] =>
[51] => Nameservers
[52] =>
[53] => nserver............: ns-997.awsdns-60.net [OK]
[54] => nserver............: ns-1394.awsdns-46.org [OK]
[55] => nserver............: ns-1882.awsdns-43.co.uk [OK]
[56] => nserver............: ns-76.awsdns-09.com [OK]
[57] =>
[58] => DNSSEC
[59] =>
[60] => dnssec.............: no
[61] =>
[62] => Holder
[63] =>
[64] => name...............: Yleisradio Oy
[65] => register number....: 0215438-8
[66] => address............: Radiokatu 5
[67] => address............: 00024
[68] => address............: Yleisradio
[69] => country............: Finland
[70] => phone..............: +358914801
[71] => holder email.......:
[72] =>
[73] => Registrar
[74] =>
[75] => registrar..........: Yleisradio Oy
[76] =>
[77] => >>> Last update of WHOIS database: 24.3.2020 12:45:12 (EET) <<<
[78] =>
[79] =>
[80] => Copyright (c) Finnish Transport and Communications Agency Traficom
[81] =>
[82] =>
[83] => domain.............: is.fi
[84] => status.............: Registered
[85] => created............: 12.9.2016 10:01:17
[86] => expires............: 12.9.2020 10:01:17
[87] => available..........: 12.10.2020 10:01:17
[88] => modified...........: 17.9.2017
[89] => holder transfer....: 3.2.2017
[90] => RegistryLock.......: no
[91] =>
[92] => Nameservers
[93] =>
[94] => nserver............: ns-2017.awsdns-60.co.uk [OK]
[95] => nserver............: ns-824.awsdns-39.net [OK]
[96] => nserver............: ns-111.awsdns-13.com [OK]
[97] => nserver............: ns-1159.awsdns-16.org [OK]
[98] =>
[99] => DNSSEC
[100] =>
[101] => dnssec.............: no
[102] =>
[103] => Holder
[104] =>
[105] => name...............: Sanoma Media Finland Oy
[106] => register number....: 1515901-4
[107] => address............: Töölönlahdenkatu 2
[108] => address............: 00100
[109] => address............: Helsinki
[110] => country............: Finland
[111] => phone..............: +35891221
[112] => holder email.......:
[113] =>
[114] => Registrar
[115] =>
[116] => registrar..........: Sanoma Oyj
[117] =>
[118] => >>> Last update of WHOIS database: 24.3.2020 12:46:59 (EET) <<<
[119] =>
[120] =>
[121] => Copyright (c) Finnish Transport and Communications Agency Traficom
[122] =>
[123] =>
[124] => domain.............: hs.fi
[125] => status.............: Registered
[126] => created............: 10.7.2009 00:00:00
[127] => expires............: 14.7.2020 11:17:58
[128] => available..........: 14.8.2020 11:17:58
[129] => modified...........: 7.9.2017
[130] => RegistryLock.......: no
[131] =>
[132] => Nameservers
[133] =>
[134] => nserver............: ns-83.awsdns-10.com [OK]
[135] => nserver............: ns-1635.awsdns-12.co.uk [OK]
[136] => nserver............: ns-1461.awsdns-54.org [OK]
[137] => nserver............: ns-678.awsdns-20.net [OK]
[138] =>
[139] => DNSSEC
[140] =>
[141] => dnssec.............: no
[142] =>
[143] => Holder
[144] =>
[145] => name...............: Sanoma Media Finland Oy / Helsingin Sanomat
[146] => register number....: 1515901-4
[147] => address............: Töölönlahdenkatu 2
[148] => address............: 00100
[149] => address............: Helsinki
[150] => country............: Finland
[151] => phone..............: +35891221
[152] => holder email.......:
[153] =>
[154] => Registrar
[155] =>
[156] => registrar..........: Sanoma Oyj
[157] =>
[158] => >>> Last update of WHOIS database: 24.3.2020 12:45:20 (EET) <<<
[159] =>
[160] =>
[161] => Copyright (c) Finnish Transport and Communications Agency Traficom
[162] =>
)

我尝试过类似的东西:

$domainRawScan = preg_grep('/\bdomain\b.*\b>>> Last update of WHOIS database:\b/', $output);

但我对使用 RegExp 还很陌生,发现语法相当混乱。任何帮助将不胜感激。

一种处理方法是获取 exec 命令返回的 $output 数组并将其转换回单个字符串:

$text = implode("\n", $output)

然后使用preg_match_all获取所有关键字和值

preg_match_all('/^(.*?)\.*: (.+)/m', $text, $matches);

然后 $matches[1][n] 将具有关键字 n,而 $matches[2][n] 将具有值 n

Regex Demo

^             # Start of line in multiline mode
(             # Start of capture group 1
   .*?        # Match 0 or more characters until ...
)             # End of capture group 1
\.*           # Match 0 or more periods
:             # Match a colon followed by a space
(             # Start of capture group 2
   .+         # Match 1 or more characters up to but not including a newline
)             # End of capture group 2

更新

每次通过循环您将处理一个域和 keyword/value 对。你将如何处理这些由你决定。

foreach ($query as $domain) {
    $scanUrl = 'whois '. $domain->url;
    $output = []; // start with an empty array
    exec($scanUrl, $output);
    $text = implode("\n", $output);
    preg_match_all('/^(.*?)\.*: (.+)/m', $text, $matches);
    $n = count($matches[1]); // number of keyword/value pairs
    for ($i = 0; $i < $n; $i++) {
        // display next keyword/value pair:
        echo $matches[1][$i], "->", $matches[2][$i], "\n";
    }
}

更新 2

与其将 exec 命令返回的行数组合并为一个字符串并执行 preg_match_all,这将为您提供一个匹配数组,这样做可能更方便个别 preg_match 调用来自 exec 命令的个别输出行:

foreach ($query as $domain) {
    $scanUrl = 'whois '. $domain->url;
    $output = []; // start with an empty array
    exec($scanUrl, $output);
    foreach ($output as $line) {
         if (preg_match('/^(.*?)\.*: (.+)/', $line, $matches)) {
             echo $matches[1], "->", $matches[2], "\n";
         }
    }    
}