从 pcap http 文件中提取重复 table 数据的最佳方法(awk 可以处理中断)?
Best way to pull repeated table data from a pcap http file (could awk handle the disruptive breaks)?
我正在从我的光伏系统收集数据读数。 Web 客户端将绘制一天的数据图表 - 我想在一个文件中收集一整年或两年的模式等。
到目前为止,我使用 Wireshark 将行捕获到一个 cap 文件中,然后使用 headers 和一些 ret运行smitted 数据包过滤我想要的数据。
感兴趣的数据正在发送到 js 应用程序,但我想提取每个数据包中重复的数据,如 date time=watts, 请参见下面的示例...
我希望使用 AWK 将数据解析为按日期和时间键控的数组,然后将其打印回文件。这消除了 ret运行smitted 数据包中的重复项并对数据进行排序。理想情况下,我也会删除瓦特字段中不需要的小数数据。
此示例通过字符串传递以删除 cap 中的二进制数据。 awk 能处理得更好吗?
有规律的数据包中断会在任何地方中断字段,在此样本中,2018 年是数据包的末尾,18 是下一个数据包的开头。 inter-line 文本不一致,尽管二进制文件中可能有更一致的内容。
所以规则需要是:
- 忽略直到
{"1":"{
- 解析 4n-2n-2n space 2n:2n space real_nb 逗号(忽略任何其他换行符或字符)
- 在
}","0":"2018-01-01"}
停止收集 注意结束日期不同!
这里有 2 个示例块。第一个显示 table 块周围的字符串,该块从那天起被缩短了几次。
第二个块只是没有上下文的一天的完整 table 数据。
(我为视觉分隔添加了一个换行符。注意在 76.549995 内换行,最好四舍五入为 77)
Path=/
/[CB
$e/N
{"1":"{2018-01-08 08:50=4.5, 2018-01-08 08:55=9.5, 2018-01-08 11:30=76
/[CB
$e/QM
.549995, 2018-01-08 11:35=73.9, 2018-01-08 11:40=65.93333, 2018-01-08 15:30=2.25, 2018-01-08 15:40=0.0}","0":"2018-01-08"}
/[CB
$e/Vq
XT2P
HTTP/1.1 200 OK
{"1":"{2018-01-01 08:15=9.5, 2018-01-01 08:20=22.0, 2018-01-01 08:25=29.4, 2018-01-01 08:30=30.150002, 2018-01-01 08:35=35.3, 2018-01-01 08:40=42.0, 2018-01-01 08:45=77.5, 2018-01-01 08:50=62.6, 2018-01-01 08:55=62.6, 2018-01-01 09:00=75.4, 2018-01-01 09:05=61.199997, 2018-01-01 09:10=57.85, 2018-01-01 09:15=45.7, 2018-01-01 09:20=44.266666, 2018-01-01 09:25=47.2, 2018-01-01 09:30=46.8, 2018-01-01 09:35=53.2, 2018-01-01 09:40=58.2, 2018-01-01 09:45=55.600002, 2018-01-01 09:50=56.733337, 2018-01-01 09:55=62.0, 2018-01-01 10:00=66.3, 2018-01-01 10:05=62.466663, 2018-01-01 10:10=62.699997, 2018-01-01 10:15=70.3, 2018-01-01 10:20=87.1, 2018-01-01 10:25=88.24999, 2018-01-01 10:30=102.5, 2018-01-01 10:35=95.46667, 2018-01-01 10:40=100.73334, 2018-01-01 10:45=100.700005, 2018-01-01 10:50=102.06667, 2018-01-01 10:55=116.4, 20
/[CB
X7BP
18-01-01 11:00=126.7, 2018-01-01 11:05=125.166664, 2018-01-01 11:10=128.26666, 2018-01-01 11:15=125.43333, 2018-01-01 11:20=119.666664, 2018-01-01 11:25=116.649994, 2018-01-01 11:30=94.700005, 2018-01-01 11:35=101.7, 2018-01-01 11:40=95.13333, 2018-01-01 11:45=98.76666, 2018-01-01 11:50=98.466675, 2018-01-01 11:55=92.43334, 2018-01-01 12:00=85.96667, 2018-01-01 12:05=77.833336, 2018-01-01 12:10=75.95, 2018-01-01 12:15=67.75, 2018-01-01 12:20=57.699997, 2018-01-01 12:25=74.2, 2018-01-01 12:30=87.1, 2018-01-01 12:35=77.6, 2018-01-01 12:40=74.1, 2018-01-01 12:45=63.36667, 2018-01-01 12:50=59.300003, 2018-01-01 12:55=76.9, 2018-01-01 13:00=66.6, 2018-01-01 13:05=203.4, 2018-01-01 13:10=203.45, 2018-01-01 13:15=203.45, 2018-01-01 13:20=157.3, 2018-01-01 13:25=101.333336, 2018-01-01 13:30=96.45, 2018-01-01 13:35=81.3, 2018-01-01 13:40=93.7, 2018-01-01 13:45=127.9, 2018-01-01 13:50=176.1, 2018-01-01 13:55=152.0, 2018-01-01 14:00=169.6, 2018-01-01 14:05=203.2, 2018-01-01 14:10=257.5, 2018-01-01 14:15=261.30002, 2018-01-01 14:20=261.3, 2018-01-01 14:25=218.13335, 2018-01-01 14:30=385.5, 2018-01-01 14:35=287.5, 2018-01-01 14:40=248.35002, 2018-01-01 14:45=98.2, 2018-01-01 14:50=136.2, 2018-01-01 14:55=160.0, 2018-01-01 15:00=148.1
/[CB
X7BP
, 2018-01-01 15:05=133.59999, 2018-01-01 15:10=93.3, 2018-01-01 15:15=79.25, 2018-01-01 15:20=44.300003, 2018-01-01 15:25=36.56667, 2018-01-01 15:30=43.8, 2018-01-01 15:35=39.3, 2018-01-01 15:40=39.5, 2018-01-01 15:45=33.05, 2018-01-01 15:50=28.649998, 2018-01-01 15:55=26.65, 2018-01-01 16:00=16.55, 2018-01-01 16:05=7.5, 2018-01-01 16:10=0.0}","0":"2018-01-01"}
我将有几千行源数据和 40-100k date_time 个数据点,键控数组可以处理吗?我应该将逗号定义为我的行分隔符吗? (我不确定 packet/line 中断文本中是否会出现逗号...)
有更好、更简单的解决方案吗?
目前我一直在使用文本编辑器处理几个样本月并测试我的分析思路,但这对于完整的数据集来说太慢且繁重。
我的理想输出看起来像(我编辑的样本数据不同)
06/11/18 11:20 799
06/11/18 11:25 744
06/11/18 11:30 720
06/11/18 11:35 681
06/11/18 11:40 543
06/11/18 11:45 350
06/11/18 11:50 274
06/11/18 11:55 230
06/11/18 12:00 286
06/11/18 12:05 435
06/11/18 12:10 544
06/11/18 12:15 899
06/11/18 12:20 1187
06/11/18 12:25 1575
06/11/18 12:30 1362
06/11/18 12:35 1423
也许 Python 更适合,但这对我来说是一个更大的学习曲线和更低的起始知识点...
这是我的开始,它获取了大部分关于正确的数据,但不处理跨 2 个数据包或尾随 } 拆分的记录""
awk 'BEGIN{RS=","}; (~"^201"){if (NF=2) {split(,X,"=");print ,X[1], X[2]}}' sample.txt
产出
2018-01-06 15:30 39.033333
2018-01-06 15:35 34.9
2018-01-06 15:40 24.25
2018-01-06 15 NB lost data at packet break as line not starting 201
2018-01-06 15:50 0.0
2018-01-06 15:55 0.0}" NB failed to remove trailer
2018-01-07 08:25 7.8
2018-01-07 08:30 23.7
刚刚注意到我的文本编辑版本将日期重新格式化为 dd/mm/yy,而 awk 保留了输入日期格式。电子表格会读取任何一个,所以我不在乎!
记录一下 运行 我对二进制 cap 文件的 awk,它似乎仍然以与字符串输出文件相同的方式工作。
真实数据,来自strings
的输出
Mac OS X 10.11.6, build 15G22010 (Darwin 15.6.0)
Dumpcap (Wireshark) 2.6.5 (v2.6.5-0-gf766965a)
host 47.91.67.66
Mac OS X 10.11.6, build 15G22010 (Darwin 15.6.0)
.#/[CB
HTTP/1.1 200 OK
Date: Tue, 12 Nov 2019 16:15:11 GMT
Content-Type: application/json;charset=UTF-8
Content-Length: 2432
Connection: keep-alive
Accept-Charset: big5, big5-hkscs, euc-jp, euc-kr, gb18030, gb2312, gbk, ibm-thai, ibm00858, ibm01140, ibm01141, ibm01142, ibm01143, ibm01144, ibm01145, ibm01146, ibm01147, ibm01148, ibm01149, ibm037, ibm1026, ibm1047, ibm273, ibm277, ibm278, ibm280, ibm284, ibm285, ibm290, ibm297, ibm420, ibm424, ibm437, ibm500, ibm775, ibm850, ibm852, ibm855, ibm857, ibm860, ibm861, ibm862, ibm863, ibm864, ibm865, ibm866, ibm868, ibm869, ibm870, ibm871, ibm918, iso-2022-cn, iso-2022-jp, iso-2022-jp-2, iso-2022-kr, iso-8859-1, iso-8859-13, iso-8859-15, iso-8859-2, iso-8859-3, iso-8859-4, iso-8859-5, iso-8859-6, iso-8859-7, iso-8859-8, iso-8859-9, jis_x0201, jis_x0212-1990, koi8-r, koi8-u, shift_jis, tis-620, us-ascii, utf-16, utf-16be, utf-16le, utf-32, utf-32be, utf-32le, utf-8, windows-1250, windows-1251, windows-1252, windows-1253, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, windows-31j, x-big5-hkscs-2001, x-big5-solaris, x-euc-jp-linux, x-euc-tw, x-eucjp-open, x-ibm1006, x-ibm1025, x-ibm1046, x-ibm1097, x-ibm1098, x-ibm1112, x-ibm1122, x-ibm1123, x-ibm1124, x-ibm13
/v/[CB
X7BP
64, x-ibm1381, x-ibm1383, x-ibm300, x-ibm33722, x-ibm737, x-ibm833, x-ibm834, x-ibm856, x-ibm874, x-ibm875, x-ibm921, x-ibm922, x-ibm930, x-ibm933, x-ibm935, x-ibm937, x-ibm939, x-ibm942, x-ibm942c, x-ibm943, x-ibm943c, x-ibm948, x-ibm949, x-ibm949c, x-ibm950, x-ibm964, x-ibm970, x-iscii91, x-iso-2022-cn-cns, x-iso-2022-cn-gb, x-iso-8859-11, x-jis0208, x-jisautodetect, x-johab, x-macarabic, x-maccentraleurope, x-maccroatian, x-maccyrillic, x-macdingbat, x-macgreek, x-machebrew, x-maciceland, x-macroman, x-macromania, x-macsymbol, x-macthai, x-macturkish, x-macukraine, x-ms932_0213, x-ms950-hkscs, x-ms950-hkscs-xp, x-mswin-936, x-pck, x-sjis_0213, x-utf-16le-bom, x-utf-32be-bom, x-utf-32le-bom, x-windows-50220, x-windows-50221, x-windows-874, x-windows-949, x-windows-950, x-windows-iso2022jp
Set-Cookie: SERVERID=dfd94e11c720d0a37cf8b7c8c0cc0c75|1573575311|1573575148;Path=/
/[CB
X7BP
{"1":"{2018-01-01 08:15=9.5, 2018-01-01 08:20=22.0, 2018-01-01 08:25=29.4, 2018-01-01 08:30=30.150002, 2018-01-01 08:35=35.3, 2018-01-01 08:40=42.0, 2018-01-01 08:45=77.5, 2018-01-01 08:50=62.6, 2018-01-01 08:55=62.6, 2018-01-01 09:00=75.4, 2018-01-01 09:05=61.199997, 2018-01-01 09:10=57.85, 2018-01-01 09:15=45.7, 2018-01-01 09:20=44.266666, 2018-01-01 09:25=47.2, 2018-01-01 09:30=46.8, 2018-01-01 09:35=53.2, 2018-01-01 09:40=58.2, 2018-01-01 09:45=55.600002, 2018-01-01 09:50=56.733337, 2018-01-01 09:55=62.0, 2018-01-01 10:00=66.3, 2018-01-01 10:05=62.466663, 2018-01-01 10:10=62.699997, 2018-01-01 10:15=70.3, 2018-01-01 10:20=87.1, 2018-01-01 10:25=88.24999, 2018-01-01 10:30=102.5, 2018-01-01 10:35=95.46667, 2018-01-01 10:40=100.73334, 2018-01-01 10:45=100.700005, 2018-01-01 10:50=102.06667, 2018-01-01 10:55=116.4, 20
/[CB
X7BP
18-01-01 11:00=126.7, 2018-01-01 11:05=125.166664, 2018-01-01 11:10=128.26666, 2018-01-01 11:15=125.43333, 2018-01-01 11:20=119.666664, 2018-01-01 11:25=116.649994, 2018-01-01 11:30=94.700005, 2018-01-01 11:35=101.7, 2018-01-01 11:40=95.13333, 2018-01-01 11:45=98.76666, 2018-01-01 11:50=98.466675, 2018-01-01 11:55=92.43334, 2018-01-01 12:00=85.96667, 2018-01-01 12:05=77.833336, 2018-01-01 12:10=75.95, 2018-01-01 12:15=67.75, 2018-01-01 12:20=57.699997, 2018-01-01 12:25=74.2, 2018-01-01 12:30=87.1, 2018-01-01 12:35=77.6, 2018-01-01 12:40=74.1, 2018-01-01 12:45=63.36667, 2018-01-01 12:50=59.300003, 2018-01-01 12:55=76.9, 2018-01-01 13:00=66.6, 2018-01-01 13:05=203.4, 2018-01-01 13:10=203.45, 2018-01-01 13:15=203.45, 2018-01-01 13:20=157.3, 2018-01-01 13:25=101.333336, 2018-01-01 13:30=96.45, 2018-01-01 13:35=81.3, 2018-01-01 13:40=93.7, 2018-01-01 13:45=127.9, 2018-01-01 13:50=176.1, 2018-01-01 13:55=152.0, 2018-01-01 14:00=169.6, 2018-01-01 14:05=203.2, 2018-01-01 14:10=257.5, 2018-01-01 14:15=261.30002, 2018-01-01 14:20=261.3, 2018-01-01 14:25=218.13335, 2018-01-01 14:30=385.5, 2018-01-01 14:35=287.5, 2018-01-01 14:40=248.35002, 2018-01-01 14:45=98.2, 2018-01-01 14:50=136.2, 2018-01-01 14:55=160.0, 2018-01-01 15:00=148.1
/[CB
X7BP
, 2018-01-01 15:05=133.59999, 2018-01-01 15:10=93.3, 2018-01-01 15:15=79.25, 2018-01-01 15:20=44.300003, 2018-01-01 15:25=36.56667, 2018-01-01 15:30=43.8, 2018-01-01 15:35=39.3, 2018-01-01 15:40=39.5, 2018-01-01 15:45=33.05, 2018-01-01 15:50=28.649998, 2018-01-01 15:55=26.65, 2018-01-01 16:00=16.55, 2018-01-01 16:05=7.5, 2018-01-01 16:10=0.0}","0":"2018-01-01"}
/[CB
HTTP/1.1 200 OK
Date: Tue, 12 Nov 2019 16:15:14 GMT
Content-Type: application/json;charset=UTF-8
Content-Length: 2184
Connection: keep-alive
Accept-Charset: big5, big5-hkscs, euc-jp, euc-kr, gb18030, gb2312, gbk, ibm-thai, ibm00858, ibm01140, ibm01141, ibm01142, ibm01143, ibm01144, ibm01145, ibm01146, ibm01147, ibm01148, ibm01149, ibm037, ibm1026, ibm1047, ibm273, ibm277, ibm278, ibm280, ibm284, ibm285, ibm290, ibm297, ibm420, ibm424, ibm437, ibm500, ibm775, ibm850, ibm852, ibm855, ibm857, ibm860, ibm861, ibm862, ibm863, ibm864, ibm865, ibm866, ibm868, ibm869, ibm870, ibm871, ibm918, iso-2022-cn, iso-2022-jp, iso-2022-jp-2, iso-2022-kr, iso-8859-1, iso-8859-13, iso-8859-15, iso-8859-2, iso-8859-3, iso-8859-4, iso-8859-5, iso-8859-6, iso-8859-7, iso-8859-8, iso-8859-9, jis_x0201, jis_x0212-1990, koi8-r, koi8-u, shift_jis, tis-620, us-ascii, utf-16, utf-16be, utf-16le, utf-32, utf-32be, utf-32le, utf-8, windows-1250, windows-1251, windows-1252, windows-1253, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, windows-31j, x-big5-hkscs-2001, x-big5-solaris, x-euc-jp-linux, x-euc-tw, x-eucjp-open, x-ibm1006, x-ibm1025, x-ibm1046, x-ibm1097, x-ibm1098, x-ibm1112, x-ibm1122, x-ibm1123, x-ibm1124, x-ibm13
/q/[CB
64, x-ibm1381, x-ibm1383, x-ibm300, x-ibm33722, x-ibm737, x-ibm833, x-ibm834, x-ibm856, x-ibm874, x-ibm875, x-ibm921, x-ibm922, x-ibm930, x-ibm933, x-ibm935, x-ibm937, x-ibm939, x-ibm942, x-ibm942c, x-ibm943, x-ibm943c, x-ibm948, x-ibm949, x-ibm949c, x-ibm950, x-ibm964, x-ibm970, x-iscii91, x-iso-2022-cn-cns, x-iso-2022-cn-gb, x-iso-8859-11, x-jis0208, x-jisautodetect, x-johab, x-macarabic, x-maccentraleurope, x-maccroatian, x-maccyrillic, x-macdingbat, x-macgreek, x-machebrew, x-maciceland, x-macroman, x-macromania, x-macsymbol, x-macthai, x-macturkish, x-macukraine, x-ms932_0213, x-ms950-hkscs, x-ms950-hkscs-xp, x-mswin-936, x-pck, x-sjis_0213, x-utf-16le-bom, x-utf-32be-bom, x-utf-32le-bom, x-windows-50220, x-windows-50221, x-windows-874, x-windows-949, x-windows-950, x-windows-iso2022jp
Set-Cookie: SERVERID=dfd94e11c720d0a37cf8b7c8c0cc0c75|1573575314|1573575148;Path=/
/[CB
{"1":"{2018-01-02 08:35=0.0, 2018-01-02 08:40=6.6, 2018-01-02 08:45=6.35, 2018-01-02 08:50=7.8, 2018-01-02 08:55=6.9, 2018-01-02 09:00=12.2, 2018-01-02 09:05=18.3, 2018-01-02 09:10=25.9, 2018-01-02 09:15=26.15, 2018-01-02 09:20=40.0, 2018-01-02 09:25=36.45, 2018-01-02 09:30=36.450005, 2018-01-02 09:35=30.633333, 2018-01-02 09:40=41.4, 2018-01-02 09:45=44.1, 2018-01-02 09:50=53.9, 2018-01-02 09:55=66.2, 2018-01-02 10:00=75.6, 2018-01-02 10:05=70.1, 2018-01-02 10:10=72.05, 2018-01-02 10:15=54.0, 2018-01-02 10:20=40.55, 2018-01-02 10:25=40.549995, 2018-01-02 10:30=31.699997, 2018-01-02 10:35=33.8, 2018-01-02 10:40=47.6, 2018-01-02 10:45=40.699997, 2018-01-02 10:50=36.65, 2018-01-02 10:55=19.55, 2018-01-02 11:00=12.1, 2018-01-02 11:05=9.549999, 2018-01-02 11:10=25.9, 2018-01-02 11:15=30.0, 2018-01-02 11:20=52.3, 2018-01-0
/[CB
2 11:25=63.3, 2018-01-02 11:30=97.1, 2018-01-02 11:35=147.7, 2018-01-02 11:40=163.8, 2018-01-02 11:45=186.8, 2018-01-02 11:50=241.0, 2018-01-02 11:55=289.9, 2018-01-02 12:00=265.45, 2018-01-02 12:05=247.70001, 2018-01-02 12:10=204.5, 2018-01-02 12:15=206.59999, 2018-01-02 12:20=207.83333, 2018-01-02 12:25=201.36665, 2018-01-02 12:30=189.93333, 2018-01-02 12:35=185.30002, 2018-01-02 12:40=151.65, 2018-01-02 12:45=222.9, 2018-01-02 12:50=197.65, 2018-01-02 12:55=199.46667, 2018-01-02 13:00=254.3, 2018-01-02 13:05=337.7, 2018-01-02 13:10=296.06668, 2018-01-02 13:15=308.80002, 2018-01-02 13:20=314.9, 2018-01-02 13:25=348.0, 2018-01-02 13:30=378.6, 2018-01-02 13:35=356.06665, 2018-01-02 13:40=360.1, 2018-01-02 13:45=287.86667, 2018-01-02 13:50=262.6, 2018-01-02 13:55=265.80002, 2018-01-02 14:00=256.53333, 2018-01-02 14:05=251.90001, 2018-01-02 14:10=158.45, 2018-01-02 14:15=117.0, 2018-01-02 14:20=99.5, 2018-01-02 14:25=91.25, 2018-01-02 14:30=94.1, 2018-01-02 14:35=95.55, 2018-01-02 14:40=91.666664, 2018-01-02 14:45=87.23334, 2018-01-02 14:50=81.66667, 2018-01-02 14:55=79.166664, 2018-01-02 15:00=75.333336, 2018-01-02 15:05=72.850006, 2018-01-02 15:10=60.300003, 2018-01-02 15:15=43.75, 2018-01-02 15:20=30.0, 2018-01-02 15:25
2t/[CB
=18.2, 2018-01-02 15:30=11.0, 2018-01-02 15:35=7.0, 2018-01-02 15:40=3.3, 2018-01-02 15:45=1.55}","0":"2018-01-02"}
/[CB
X>~P```
Get the full file of a month or so data here
`https://www.dropbox.com/s/3vb6g9ywlgt7isw/dayData2.txt?dl=1`
以下代码使用 GNU sed
并重新创建输入作为流,此处文档用 END_OF_INPUT
分隔,并附有一些注释:
cat <<'END_OF_INPUT' |
Path=/
/[CB
$e/N
{"1":"{2018-01-08 08:50=4.5, 2018-01-08 08:55=9.5, 2018-01-08 11:30=76
/[CB
$e/QM
.549995, 2018-01-08 11:35=73.9, 2018-01-08 11:40=65.93333, 2018-01-08 15:30=2.25, 2018-01-08 15:40=0.0}","0":"2018-01-08"}
/[CB
$e/Vq
XT2P
HTTP/1.1 200 OK
{"1":"{2018-01-01 08:15=9.5, 2018-01-01 08:20=22.0, 2018-01-01 08:25=29.4, 2018-01-01 08:30=30.150002, 2018-01-01 08:35=35.3, 2018-01-01 08:40=42.0, 2018-01-01 08:45=77.5, 2018-01-01 08:50=62.6, 2018-01-01 08:55=62.6, 2018-01-01 09:00=75.4, 2018-01-01 09:05=61.199997, 2018-01-01 09:10=57.85, 2018-01-01 09:15=45.7, 2018-01-01 09:20=44.266666, 2018-01-01 09:25=47.2, 2018-01-01 09:30=46.8, 2018-01-01 09:35=53.2, 2018-01-01 09:40=58.2, 2018-01-01 09:45=55.600002, 2018-01-01 09:50=56.733337, 2018-01-01 09:55=62.0, 2018-01-01 10:00=66.3, 2018-01-01 10:05=62.466663, 2018-01-01 10:10=62.699997, 2018-01-01 10:15=70.3, 2018-01-01 10:20=87.1, 2018-01-01 10:25=88.24999, 2018-01-01 10:30=102.5, 2018-01-01 10:35=95.46667, 2018-01-01 10:40=100.73334, 2018-01-01 10:45=100.700005, 2018-01-01 10:50=102.06667, 2018-01-01 10:55=116.4, 20
/[CB
X7BP
18-01-01 11:00=126.7, 2018-01-01 11:05=125.166664, 2018-01-01 11:10=128.26666, 2018-01-01 11:15=125.43333, 2018-01-01 11:20=119.666664, 2018-01-01 11:25=116.649994, 2018-01-01 11:30=94.700005, 2018-01-01 11:35=101.7, 2018-01-01 11:40=95.13333, 2018-01-01 11:45=98.76666, 2018-01-01 11:50=98.466675, 2018-01-01 11:55=92.43334, 2018-01-01 12:00=85.96667, 2018-01-01 12:05=77.833336, 2018-01-01 12:10=75.95, 2018-01-01 12:15=67.75, 2018-01-01 12:20=57.699997, 2018-01-01 12:25=74.2, 2018-01-01 12:30=87.1, 2018-01-01 12:35=77.6, 2018-01-01 12:40=74.1, 2018-01-01 12:45=63.36667, 2018-01-01 12:50=59.300003, 2018-01-01 12:55=76.9, 2018-01-01 13:00=66.6, 2018-01-01 13:05=203.4, 2018-01-01 13:10=203.45, 2018-01-01 13:15=203.45, 2018-01-01 13:20=157.3, 2018-01-01 13:25=101.333336, 2018-01-01 13:30=96.45, 2018-01-01 13:35=81.3, 2018-01-01 13:40=93.7, 2018-01-01 13:45=127.9, 2018-01-01 13:50=176.1, 2018-01-01 13:55=152.0, 2018-01-01 14:00=169.6, 2018-01-01 14:05=203.2, 2018-01-01 14:10=257.5, 2018-01-01 14:15=261.30002, 2018-01-01 14:20=261.3, 2018-01-01 14:25=218.13335, 2018-01-01 14:30=385.5, 2018-01-01 14:35=287.5, 2018-01-01 14:40=248.35002, 2018-01-01 14:45=98.2, 2018-01-01 14:50=136.2, 2018-01-01 14:55=160.0, 2018-01-01 15:00=148.1
/[CB
X7BP
, 2018-01-01 15:05=133.59999, 2018-01-01 15:10=93.3, 2018-01-01 15:15=79.25, 2018-01-01 15:20=44.300003, 2018-01-01 15:25=36.56667, 2018-01-01 15:30=43.8, 2018-01-01 15:35=39.3, 2018-01-01 15:40=39.5, 2018-01-01 15:45=33.05, 2018-01-01 15:50=28.649998, 2018-01-01 15:55=26.65, 2018-01-01 16:00=16.55, 2018-01-01 16:05=7.5, 2018-01-01 16:10=0.0}","0":"2018-01-01"}
END_OF_INPUT
# preprocessing
# ignore until {"1":"{
# stop collecting at }","0":"2018-01-01"}
sed -E -n '/\{"1":"\{/,/\}","0":"[0-9]{4}-[0-9]{2}-[0-9]{2}"\}/p' |
# remove the /[CB + the next line + one newline more
sed -E '/\/\[CB/{N;d;n;}' |
# we shoudld get nice {"1":.....} lines here
# elements are separated by comma
# so we can just be cruel
tr ',' '\n' |
# now each line will have one date
# so for each data in line
# output it in our format(TM)
sed -E -n '
/.*[0-9]{2}([0-9]{2})-([0-9]{2})-([0-9]{2}) ([0-9]{2}:[0-9]{2})=([0-9]*.[0-9]*).*/{
s!!// !
p
}
'
将输出:
08/01/18 08:50 4.5
08/01/18 08:55 9.5
08/01/18 11:30 76
08/01/18 11:35 73.9
08/01/18 11:40 65.93333
08/01/18 15:30 2.25
08/01/18 15:40 0.0
01/01/18 08:15 9.5
01/01/18 08:20 22.0
01/01/18 08:25 29.4
01/01/18 08:30 30.150002
01/01/18 08:35 35.3
01/01/18 08:40 42.0
01/01/18 08:45 77.5
01/01/18 08:50 62.6
01/01/18 08:55 62.6
01/01/18 09:00 75.4
01/01/18 09:05 61.199997
01/01/18 09:10 57.85
01/01/18 09:15 45.7
01/01/18 09:20 44.266666
01/01/18 09:25 47.2
01/01/18 09:30 46.8
01/01/18 09:35 53.2
01/01/18 09:40 58.2
01/01/18 09:45 55.600002
01/01/18 09:50 56.733337
01/01/18 09:55 62.0
01/01/18 10:00 66.3
01/01/18 10:05 62.466663
01/01/18 10:10 62.699997
01/01/18 10:15 70.3
01/01/18 10:20 87.1
01/01/18 10:25 88.24999
01/01/18 10:30 102.5
01/01/18 10:35 95.46667
01/01/18 10:40 100.73334
01/01/18 10:45 100.700005
01/01/18 10:50 102.06667
01/01/18 10:55 116.4
01/01/18 11:05 125.166664
01/01/18 11:10 128.26666
01/01/18 11:15 125.43333
01/01/18 11:20 119.666664
01/01/18 11:25 116.649994
01/01/18 11:30 94.700005
01/01/18 11:35 101.7
01/01/18 11:40 95.13333
01/01/18 11:45 98.76666
01/01/18 11:50 98.466675
01/01/18 11:55 92.43334
01/01/18 12:00 85.96667
01/01/18 12:05 77.833336
01/01/18 12:10 75.95
01/01/18 12:15 67.75
01/01/18 12:20 57.699997
01/01/18 12:25 74.2
01/01/18 12:30 87.1
01/01/18 12:35 77.6
01/01/18 12:40 74.1
01/01/18 12:45 63.36667
01/01/18 12:50 59.300003
01/01/18 12:55 76.9
01/01/18 13:00 66.6
01/01/18 13:05 203.4
01/01/18 13:10 203.45
01/01/18 13:15 203.45
01/01/18 13:20 157.3
01/01/18 13:25 101.333336
01/01/18 13:30 96.45
01/01/18 13:35 81.3
01/01/18 13:40 93.7
01/01/18 13:45 127.9
01/01/18 13:50 176.1
01/01/18 13:55 152.0
01/01/18 14:00 169.6
01/01/18 14:05 203.2
01/01/18 14:10 257.5
01/01/18 14:15 261.30002
01/01/18 14:20 261.3
01/01/18 14:25 218.13335
01/01/18 14:30 385.5
01/01/18 14:35 287.5
01/01/18 14:40 248.35002
01/01/18 14:45 98.2
01/01/18 14:50 136.2
01/01/18 14:55 160.0
01/01/18 15:00 148.1
01/01/18 15:05 133.59999
01/01/18 15:10 93.3
01/01/18 15:15 79.25
01/01/18 15:20 44.300003
01/01/18 15:25 36.56667
01/01/18 15:30 43.8
01/01/18 15:35 39.3
01/01/18 15:40 39.5
01/01/18 15:45 33.05
01/01/18 15:50 28.649998
01/01/18 15:55 26.65
01/01/18 16:00 16.55
01/01/18 16:05 7.5
01/01/18 16:10 0.0
在 GNU sed
的一个命令中支持 \n
作为换行符:
sed -E -n '
/\{"1":"\{/,/\}","0":"[0-9]{4}-[0-9]{2}-[0-9]{2}"\}/{
# remove the /[CB + the next line + one newline more
/\/\[CB/{N;d;n;}
: loop
/([^\n]*)[0-9]{2}([0-9]{2})-([0-9]{2})-([0-9]{2}) ([0-9]{2}:[0-9]{2})=([0-9]*.[0-9]*)([^\n]*)/{
# put the interesting string on the end of the pattern space
s!!\n// !
# again, until nothing interesting is found
b loop
}
# remove everything in front of the newline that we did not parse
s/[^\n]*\n//
# output
p
} '
KamilCuk 提供了最佳解决方案。带有一系列命令的第一个解决方案做得最好,但就目前而言使用起来不方便,而且它不会对二进制 cap 文件进行操作。
组合的 sed
命令,他的解决方案 2,效果不佳。可能是因为它一次只处理一种线型,而多线问题没有得到很好的解决。如果读取点可以备份一行或保存最后一行的剩余部分并包含在下一行中,也许可以修复它。
我自己的 quick and lossy(稍后详述)是一个方便的 one liner。它适用于二进制 cap 文件,这将允许它接受来自 tcpdump
或 ngrep
的管道 - 也是有用的选项。
比较我的有损解决方案:它丢失了跨 IP 数据包拆分的大约 1% 的数据点,它还允许我拒绝 1% 在光伏系统关闭时仅记录 0.0 瓦的数据包。
就我的目标而言,分析功率输出随时间和季节变化的趋势和概率(我将在 15 或 30 分钟的时间内完成,也会结合 week_from_solstice 内的天数,例如之前的 14-7 天和 12 月 21 日之后的 7-14)丢失一些读数并不重要。在一天结束时删除零实际上改进了我的数据分析。
所以下次我通过 IP 捕获处理数据样本时,我想我可能会使用:
awk 'BEGIN{RS=","}; (~"^201"){if (NF=2) {split(,X,"="); if (0+X[2] > 0) {split(,D,"-");print D[3]"/"D[2]"/"substr(D[1],3,2),X[1], 0+X[2]}}}'
0+X[2]
需要,因为有些行以 0.0}" 结尾,计算从中得到数字 0 并丢弃 }"。
如果我不想重新格式化日期(Excel 需要 2018-01-31)命令更简单:
awk 'BEGIN{RS=","}; (~"^201"){if (NF=2) {split(,X,"="); if (0+X[2] > 0) {print ,X[1], 0+X[2]}}}'
该命令还会删除数据库没有数据(通信中断或被清除)的响应,然后 IP 源发送“2017-12-25 10:10 null”
我正在从我的光伏系统收集数据读数。 Web 客户端将绘制一天的数据图表 - 我想在一个文件中收集一整年或两年的模式等。 到目前为止,我使用 Wireshark 将行捕获到一个 cap 文件中,然后使用 headers 和一些 ret运行smitted 数据包过滤我想要的数据。 感兴趣的数据正在发送到 js 应用程序,但我想提取每个数据包中重复的数据,如 date time=watts, 请参见下面的示例...
我希望使用 AWK 将数据解析为按日期和时间键控的数组,然后将其打印回文件。这消除了 ret运行smitted 数据包中的重复项并对数据进行排序。理想情况下,我也会删除瓦特字段中不需要的小数数据。
此示例通过字符串传递以删除 cap 中的二进制数据。 awk 能处理得更好吗? 有规律的数据包中断会在任何地方中断字段,在此样本中,2018 年是数据包的末尾,18 是下一个数据包的开头。 inter-line 文本不一致,尽管二进制文件中可能有更一致的内容。 所以规则需要是:
- 忽略直到
{"1":"{
- 解析 4n-2n-2n space 2n:2n space real_nb 逗号(忽略任何其他换行符或字符)
- 在
}","0":"2018-01-01"}
停止收集 注意结束日期不同!
这里有 2 个示例块。第一个显示 table 块周围的字符串,该块从那天起被缩短了几次。 第二个块只是没有上下文的一天的完整 table 数据。
(我为视觉分隔添加了一个换行符。注意在 76.549995 内换行,最好四舍五入为 77)
Path=/
/[CB
$e/N
{"1":"{2018-01-08 08:50=4.5, 2018-01-08 08:55=9.5, 2018-01-08 11:30=76
/[CB
$e/QM
.549995, 2018-01-08 11:35=73.9, 2018-01-08 11:40=65.93333, 2018-01-08 15:30=2.25, 2018-01-08 15:40=0.0}","0":"2018-01-08"}
/[CB
$e/Vq
XT2P
HTTP/1.1 200 OK
{"1":"{2018-01-01 08:15=9.5, 2018-01-01 08:20=22.0, 2018-01-01 08:25=29.4, 2018-01-01 08:30=30.150002, 2018-01-01 08:35=35.3, 2018-01-01 08:40=42.0, 2018-01-01 08:45=77.5, 2018-01-01 08:50=62.6, 2018-01-01 08:55=62.6, 2018-01-01 09:00=75.4, 2018-01-01 09:05=61.199997, 2018-01-01 09:10=57.85, 2018-01-01 09:15=45.7, 2018-01-01 09:20=44.266666, 2018-01-01 09:25=47.2, 2018-01-01 09:30=46.8, 2018-01-01 09:35=53.2, 2018-01-01 09:40=58.2, 2018-01-01 09:45=55.600002, 2018-01-01 09:50=56.733337, 2018-01-01 09:55=62.0, 2018-01-01 10:00=66.3, 2018-01-01 10:05=62.466663, 2018-01-01 10:10=62.699997, 2018-01-01 10:15=70.3, 2018-01-01 10:20=87.1, 2018-01-01 10:25=88.24999, 2018-01-01 10:30=102.5, 2018-01-01 10:35=95.46667, 2018-01-01 10:40=100.73334, 2018-01-01 10:45=100.700005, 2018-01-01 10:50=102.06667, 2018-01-01 10:55=116.4, 20
/[CB
X7BP
18-01-01 11:00=126.7, 2018-01-01 11:05=125.166664, 2018-01-01 11:10=128.26666, 2018-01-01 11:15=125.43333, 2018-01-01 11:20=119.666664, 2018-01-01 11:25=116.649994, 2018-01-01 11:30=94.700005, 2018-01-01 11:35=101.7, 2018-01-01 11:40=95.13333, 2018-01-01 11:45=98.76666, 2018-01-01 11:50=98.466675, 2018-01-01 11:55=92.43334, 2018-01-01 12:00=85.96667, 2018-01-01 12:05=77.833336, 2018-01-01 12:10=75.95, 2018-01-01 12:15=67.75, 2018-01-01 12:20=57.699997, 2018-01-01 12:25=74.2, 2018-01-01 12:30=87.1, 2018-01-01 12:35=77.6, 2018-01-01 12:40=74.1, 2018-01-01 12:45=63.36667, 2018-01-01 12:50=59.300003, 2018-01-01 12:55=76.9, 2018-01-01 13:00=66.6, 2018-01-01 13:05=203.4, 2018-01-01 13:10=203.45, 2018-01-01 13:15=203.45, 2018-01-01 13:20=157.3, 2018-01-01 13:25=101.333336, 2018-01-01 13:30=96.45, 2018-01-01 13:35=81.3, 2018-01-01 13:40=93.7, 2018-01-01 13:45=127.9, 2018-01-01 13:50=176.1, 2018-01-01 13:55=152.0, 2018-01-01 14:00=169.6, 2018-01-01 14:05=203.2, 2018-01-01 14:10=257.5, 2018-01-01 14:15=261.30002, 2018-01-01 14:20=261.3, 2018-01-01 14:25=218.13335, 2018-01-01 14:30=385.5, 2018-01-01 14:35=287.5, 2018-01-01 14:40=248.35002, 2018-01-01 14:45=98.2, 2018-01-01 14:50=136.2, 2018-01-01 14:55=160.0, 2018-01-01 15:00=148.1
/[CB
X7BP
, 2018-01-01 15:05=133.59999, 2018-01-01 15:10=93.3, 2018-01-01 15:15=79.25, 2018-01-01 15:20=44.300003, 2018-01-01 15:25=36.56667, 2018-01-01 15:30=43.8, 2018-01-01 15:35=39.3, 2018-01-01 15:40=39.5, 2018-01-01 15:45=33.05, 2018-01-01 15:50=28.649998, 2018-01-01 15:55=26.65, 2018-01-01 16:00=16.55, 2018-01-01 16:05=7.5, 2018-01-01 16:10=0.0}","0":"2018-01-01"}
我将有几千行源数据和 40-100k date_time 个数据点,键控数组可以处理吗?我应该将逗号定义为我的行分隔符吗? (我不确定 packet/line 中断文本中是否会出现逗号...) 有更好、更简单的解决方案吗?
目前我一直在使用文本编辑器处理几个样本月并测试我的分析思路,但这对于完整的数据集来说太慢且繁重。
我的理想输出看起来像(我编辑的样本数据不同)
06/11/18 11:20 799
06/11/18 11:25 744
06/11/18 11:30 720
06/11/18 11:35 681
06/11/18 11:40 543
06/11/18 11:45 350
06/11/18 11:50 274
06/11/18 11:55 230
06/11/18 12:00 286
06/11/18 12:05 435
06/11/18 12:10 544
06/11/18 12:15 899
06/11/18 12:20 1187
06/11/18 12:25 1575
06/11/18 12:30 1362
06/11/18 12:35 1423
也许 Python 更适合,但这对我来说是一个更大的学习曲线和更低的起始知识点...
这是我的开始,它获取了大部分关于正确的数据,但不处理跨 2 个数据包或尾随 } 拆分的记录""
awk 'BEGIN{RS=","}; (~"^201"){if (NF=2) {split(,X,"=");print ,X[1], X[2]}}' sample.txt
产出
2018-01-06 15:30 39.033333
2018-01-06 15:35 34.9
2018-01-06 15:40 24.25
2018-01-06 15 NB lost data at packet break as line not starting 201
2018-01-06 15:50 0.0
2018-01-06 15:55 0.0}" NB failed to remove trailer
2018-01-07 08:25 7.8
2018-01-07 08:30 23.7
刚刚注意到我的文本编辑版本将日期重新格式化为 dd/mm/yy,而 awk 保留了输入日期格式。电子表格会读取任何一个,所以我不在乎!
记录一下 运行 我对二进制 cap 文件的 awk,它似乎仍然以与字符串输出文件相同的方式工作。
真实数据,来自strings
Mac OS X 10.11.6, build 15G22010 (Darwin 15.6.0)
Dumpcap (Wireshark) 2.6.5 (v2.6.5-0-gf766965a)
host 47.91.67.66
Mac OS X 10.11.6, build 15G22010 (Darwin 15.6.0)
.#/[CB
HTTP/1.1 200 OK
Date: Tue, 12 Nov 2019 16:15:11 GMT
Content-Type: application/json;charset=UTF-8
Content-Length: 2432
Connection: keep-alive
Accept-Charset: big5, big5-hkscs, euc-jp, euc-kr, gb18030, gb2312, gbk, ibm-thai, ibm00858, ibm01140, ibm01141, ibm01142, ibm01143, ibm01144, ibm01145, ibm01146, ibm01147, ibm01148, ibm01149, ibm037, ibm1026, ibm1047, ibm273, ibm277, ibm278, ibm280, ibm284, ibm285, ibm290, ibm297, ibm420, ibm424, ibm437, ibm500, ibm775, ibm850, ibm852, ibm855, ibm857, ibm860, ibm861, ibm862, ibm863, ibm864, ibm865, ibm866, ibm868, ibm869, ibm870, ibm871, ibm918, iso-2022-cn, iso-2022-jp, iso-2022-jp-2, iso-2022-kr, iso-8859-1, iso-8859-13, iso-8859-15, iso-8859-2, iso-8859-3, iso-8859-4, iso-8859-5, iso-8859-6, iso-8859-7, iso-8859-8, iso-8859-9, jis_x0201, jis_x0212-1990, koi8-r, koi8-u, shift_jis, tis-620, us-ascii, utf-16, utf-16be, utf-16le, utf-32, utf-32be, utf-32le, utf-8, windows-1250, windows-1251, windows-1252, windows-1253, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, windows-31j, x-big5-hkscs-2001, x-big5-solaris, x-euc-jp-linux, x-euc-tw, x-eucjp-open, x-ibm1006, x-ibm1025, x-ibm1046, x-ibm1097, x-ibm1098, x-ibm1112, x-ibm1122, x-ibm1123, x-ibm1124, x-ibm13
/v/[CB
X7BP
64, x-ibm1381, x-ibm1383, x-ibm300, x-ibm33722, x-ibm737, x-ibm833, x-ibm834, x-ibm856, x-ibm874, x-ibm875, x-ibm921, x-ibm922, x-ibm930, x-ibm933, x-ibm935, x-ibm937, x-ibm939, x-ibm942, x-ibm942c, x-ibm943, x-ibm943c, x-ibm948, x-ibm949, x-ibm949c, x-ibm950, x-ibm964, x-ibm970, x-iscii91, x-iso-2022-cn-cns, x-iso-2022-cn-gb, x-iso-8859-11, x-jis0208, x-jisautodetect, x-johab, x-macarabic, x-maccentraleurope, x-maccroatian, x-maccyrillic, x-macdingbat, x-macgreek, x-machebrew, x-maciceland, x-macroman, x-macromania, x-macsymbol, x-macthai, x-macturkish, x-macukraine, x-ms932_0213, x-ms950-hkscs, x-ms950-hkscs-xp, x-mswin-936, x-pck, x-sjis_0213, x-utf-16le-bom, x-utf-32be-bom, x-utf-32le-bom, x-windows-50220, x-windows-50221, x-windows-874, x-windows-949, x-windows-950, x-windows-iso2022jp
Set-Cookie: SERVERID=dfd94e11c720d0a37cf8b7c8c0cc0c75|1573575311|1573575148;Path=/
/[CB
X7BP
{"1":"{2018-01-01 08:15=9.5, 2018-01-01 08:20=22.0, 2018-01-01 08:25=29.4, 2018-01-01 08:30=30.150002, 2018-01-01 08:35=35.3, 2018-01-01 08:40=42.0, 2018-01-01 08:45=77.5, 2018-01-01 08:50=62.6, 2018-01-01 08:55=62.6, 2018-01-01 09:00=75.4, 2018-01-01 09:05=61.199997, 2018-01-01 09:10=57.85, 2018-01-01 09:15=45.7, 2018-01-01 09:20=44.266666, 2018-01-01 09:25=47.2, 2018-01-01 09:30=46.8, 2018-01-01 09:35=53.2, 2018-01-01 09:40=58.2, 2018-01-01 09:45=55.600002, 2018-01-01 09:50=56.733337, 2018-01-01 09:55=62.0, 2018-01-01 10:00=66.3, 2018-01-01 10:05=62.466663, 2018-01-01 10:10=62.699997, 2018-01-01 10:15=70.3, 2018-01-01 10:20=87.1, 2018-01-01 10:25=88.24999, 2018-01-01 10:30=102.5, 2018-01-01 10:35=95.46667, 2018-01-01 10:40=100.73334, 2018-01-01 10:45=100.700005, 2018-01-01 10:50=102.06667, 2018-01-01 10:55=116.4, 20
/[CB
X7BP
18-01-01 11:00=126.7, 2018-01-01 11:05=125.166664, 2018-01-01 11:10=128.26666, 2018-01-01 11:15=125.43333, 2018-01-01 11:20=119.666664, 2018-01-01 11:25=116.649994, 2018-01-01 11:30=94.700005, 2018-01-01 11:35=101.7, 2018-01-01 11:40=95.13333, 2018-01-01 11:45=98.76666, 2018-01-01 11:50=98.466675, 2018-01-01 11:55=92.43334, 2018-01-01 12:00=85.96667, 2018-01-01 12:05=77.833336, 2018-01-01 12:10=75.95, 2018-01-01 12:15=67.75, 2018-01-01 12:20=57.699997, 2018-01-01 12:25=74.2, 2018-01-01 12:30=87.1, 2018-01-01 12:35=77.6, 2018-01-01 12:40=74.1, 2018-01-01 12:45=63.36667, 2018-01-01 12:50=59.300003, 2018-01-01 12:55=76.9, 2018-01-01 13:00=66.6, 2018-01-01 13:05=203.4, 2018-01-01 13:10=203.45, 2018-01-01 13:15=203.45, 2018-01-01 13:20=157.3, 2018-01-01 13:25=101.333336, 2018-01-01 13:30=96.45, 2018-01-01 13:35=81.3, 2018-01-01 13:40=93.7, 2018-01-01 13:45=127.9, 2018-01-01 13:50=176.1, 2018-01-01 13:55=152.0, 2018-01-01 14:00=169.6, 2018-01-01 14:05=203.2, 2018-01-01 14:10=257.5, 2018-01-01 14:15=261.30002, 2018-01-01 14:20=261.3, 2018-01-01 14:25=218.13335, 2018-01-01 14:30=385.5, 2018-01-01 14:35=287.5, 2018-01-01 14:40=248.35002, 2018-01-01 14:45=98.2, 2018-01-01 14:50=136.2, 2018-01-01 14:55=160.0, 2018-01-01 15:00=148.1
/[CB
X7BP
, 2018-01-01 15:05=133.59999, 2018-01-01 15:10=93.3, 2018-01-01 15:15=79.25, 2018-01-01 15:20=44.300003, 2018-01-01 15:25=36.56667, 2018-01-01 15:30=43.8, 2018-01-01 15:35=39.3, 2018-01-01 15:40=39.5, 2018-01-01 15:45=33.05, 2018-01-01 15:50=28.649998, 2018-01-01 15:55=26.65, 2018-01-01 16:00=16.55, 2018-01-01 16:05=7.5, 2018-01-01 16:10=0.0}","0":"2018-01-01"}
/[CB
HTTP/1.1 200 OK
Date: Tue, 12 Nov 2019 16:15:14 GMT
Content-Type: application/json;charset=UTF-8
Content-Length: 2184
Connection: keep-alive
Accept-Charset: big5, big5-hkscs, euc-jp, euc-kr, gb18030, gb2312, gbk, ibm-thai, ibm00858, ibm01140, ibm01141, ibm01142, ibm01143, ibm01144, ibm01145, ibm01146, ibm01147, ibm01148, ibm01149, ibm037, ibm1026, ibm1047, ibm273, ibm277, ibm278, ibm280, ibm284, ibm285, ibm290, ibm297, ibm420, ibm424, ibm437, ibm500, ibm775, ibm850, ibm852, ibm855, ibm857, ibm860, ibm861, ibm862, ibm863, ibm864, ibm865, ibm866, ibm868, ibm869, ibm870, ibm871, ibm918, iso-2022-cn, iso-2022-jp, iso-2022-jp-2, iso-2022-kr, iso-8859-1, iso-8859-13, iso-8859-15, iso-8859-2, iso-8859-3, iso-8859-4, iso-8859-5, iso-8859-6, iso-8859-7, iso-8859-8, iso-8859-9, jis_x0201, jis_x0212-1990, koi8-r, koi8-u, shift_jis, tis-620, us-ascii, utf-16, utf-16be, utf-16le, utf-32, utf-32be, utf-32le, utf-8, windows-1250, windows-1251, windows-1252, windows-1253, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, windows-31j, x-big5-hkscs-2001, x-big5-solaris, x-euc-jp-linux, x-euc-tw, x-eucjp-open, x-ibm1006, x-ibm1025, x-ibm1046, x-ibm1097, x-ibm1098, x-ibm1112, x-ibm1122, x-ibm1123, x-ibm1124, x-ibm13
/q/[CB
64, x-ibm1381, x-ibm1383, x-ibm300, x-ibm33722, x-ibm737, x-ibm833, x-ibm834, x-ibm856, x-ibm874, x-ibm875, x-ibm921, x-ibm922, x-ibm930, x-ibm933, x-ibm935, x-ibm937, x-ibm939, x-ibm942, x-ibm942c, x-ibm943, x-ibm943c, x-ibm948, x-ibm949, x-ibm949c, x-ibm950, x-ibm964, x-ibm970, x-iscii91, x-iso-2022-cn-cns, x-iso-2022-cn-gb, x-iso-8859-11, x-jis0208, x-jisautodetect, x-johab, x-macarabic, x-maccentraleurope, x-maccroatian, x-maccyrillic, x-macdingbat, x-macgreek, x-machebrew, x-maciceland, x-macroman, x-macromania, x-macsymbol, x-macthai, x-macturkish, x-macukraine, x-ms932_0213, x-ms950-hkscs, x-ms950-hkscs-xp, x-mswin-936, x-pck, x-sjis_0213, x-utf-16le-bom, x-utf-32be-bom, x-utf-32le-bom, x-windows-50220, x-windows-50221, x-windows-874, x-windows-949, x-windows-950, x-windows-iso2022jp
Set-Cookie: SERVERID=dfd94e11c720d0a37cf8b7c8c0cc0c75|1573575314|1573575148;Path=/
/[CB
{"1":"{2018-01-02 08:35=0.0, 2018-01-02 08:40=6.6, 2018-01-02 08:45=6.35, 2018-01-02 08:50=7.8, 2018-01-02 08:55=6.9, 2018-01-02 09:00=12.2, 2018-01-02 09:05=18.3, 2018-01-02 09:10=25.9, 2018-01-02 09:15=26.15, 2018-01-02 09:20=40.0, 2018-01-02 09:25=36.45, 2018-01-02 09:30=36.450005, 2018-01-02 09:35=30.633333, 2018-01-02 09:40=41.4, 2018-01-02 09:45=44.1, 2018-01-02 09:50=53.9, 2018-01-02 09:55=66.2, 2018-01-02 10:00=75.6, 2018-01-02 10:05=70.1, 2018-01-02 10:10=72.05, 2018-01-02 10:15=54.0, 2018-01-02 10:20=40.55, 2018-01-02 10:25=40.549995, 2018-01-02 10:30=31.699997, 2018-01-02 10:35=33.8, 2018-01-02 10:40=47.6, 2018-01-02 10:45=40.699997, 2018-01-02 10:50=36.65, 2018-01-02 10:55=19.55, 2018-01-02 11:00=12.1, 2018-01-02 11:05=9.549999, 2018-01-02 11:10=25.9, 2018-01-02 11:15=30.0, 2018-01-02 11:20=52.3, 2018-01-0
/[CB
2 11:25=63.3, 2018-01-02 11:30=97.1, 2018-01-02 11:35=147.7, 2018-01-02 11:40=163.8, 2018-01-02 11:45=186.8, 2018-01-02 11:50=241.0, 2018-01-02 11:55=289.9, 2018-01-02 12:00=265.45, 2018-01-02 12:05=247.70001, 2018-01-02 12:10=204.5, 2018-01-02 12:15=206.59999, 2018-01-02 12:20=207.83333, 2018-01-02 12:25=201.36665, 2018-01-02 12:30=189.93333, 2018-01-02 12:35=185.30002, 2018-01-02 12:40=151.65, 2018-01-02 12:45=222.9, 2018-01-02 12:50=197.65, 2018-01-02 12:55=199.46667, 2018-01-02 13:00=254.3, 2018-01-02 13:05=337.7, 2018-01-02 13:10=296.06668, 2018-01-02 13:15=308.80002, 2018-01-02 13:20=314.9, 2018-01-02 13:25=348.0, 2018-01-02 13:30=378.6, 2018-01-02 13:35=356.06665, 2018-01-02 13:40=360.1, 2018-01-02 13:45=287.86667, 2018-01-02 13:50=262.6, 2018-01-02 13:55=265.80002, 2018-01-02 14:00=256.53333, 2018-01-02 14:05=251.90001, 2018-01-02 14:10=158.45, 2018-01-02 14:15=117.0, 2018-01-02 14:20=99.5, 2018-01-02 14:25=91.25, 2018-01-02 14:30=94.1, 2018-01-02 14:35=95.55, 2018-01-02 14:40=91.666664, 2018-01-02 14:45=87.23334, 2018-01-02 14:50=81.66667, 2018-01-02 14:55=79.166664, 2018-01-02 15:00=75.333336, 2018-01-02 15:05=72.850006, 2018-01-02 15:10=60.300003, 2018-01-02 15:15=43.75, 2018-01-02 15:20=30.0, 2018-01-02 15:25
2t/[CB
=18.2, 2018-01-02 15:30=11.0, 2018-01-02 15:35=7.0, 2018-01-02 15:40=3.3, 2018-01-02 15:45=1.55}","0":"2018-01-02"}
/[CB
X>~P```
Get the full file of a month or so data here
`https://www.dropbox.com/s/3vb6g9ywlgt7isw/dayData2.txt?dl=1`
以下代码使用 GNU sed
并重新创建输入作为流,此处文档用 END_OF_INPUT
分隔,并附有一些注释:
cat <<'END_OF_INPUT' |
Path=/
/[CB
$e/N
{"1":"{2018-01-08 08:50=4.5, 2018-01-08 08:55=9.5, 2018-01-08 11:30=76
/[CB
$e/QM
.549995, 2018-01-08 11:35=73.9, 2018-01-08 11:40=65.93333, 2018-01-08 15:30=2.25, 2018-01-08 15:40=0.0}","0":"2018-01-08"}
/[CB
$e/Vq
XT2P
HTTP/1.1 200 OK
{"1":"{2018-01-01 08:15=9.5, 2018-01-01 08:20=22.0, 2018-01-01 08:25=29.4, 2018-01-01 08:30=30.150002, 2018-01-01 08:35=35.3, 2018-01-01 08:40=42.0, 2018-01-01 08:45=77.5, 2018-01-01 08:50=62.6, 2018-01-01 08:55=62.6, 2018-01-01 09:00=75.4, 2018-01-01 09:05=61.199997, 2018-01-01 09:10=57.85, 2018-01-01 09:15=45.7, 2018-01-01 09:20=44.266666, 2018-01-01 09:25=47.2, 2018-01-01 09:30=46.8, 2018-01-01 09:35=53.2, 2018-01-01 09:40=58.2, 2018-01-01 09:45=55.600002, 2018-01-01 09:50=56.733337, 2018-01-01 09:55=62.0, 2018-01-01 10:00=66.3, 2018-01-01 10:05=62.466663, 2018-01-01 10:10=62.699997, 2018-01-01 10:15=70.3, 2018-01-01 10:20=87.1, 2018-01-01 10:25=88.24999, 2018-01-01 10:30=102.5, 2018-01-01 10:35=95.46667, 2018-01-01 10:40=100.73334, 2018-01-01 10:45=100.700005, 2018-01-01 10:50=102.06667, 2018-01-01 10:55=116.4, 20
/[CB
X7BP
18-01-01 11:00=126.7, 2018-01-01 11:05=125.166664, 2018-01-01 11:10=128.26666, 2018-01-01 11:15=125.43333, 2018-01-01 11:20=119.666664, 2018-01-01 11:25=116.649994, 2018-01-01 11:30=94.700005, 2018-01-01 11:35=101.7, 2018-01-01 11:40=95.13333, 2018-01-01 11:45=98.76666, 2018-01-01 11:50=98.466675, 2018-01-01 11:55=92.43334, 2018-01-01 12:00=85.96667, 2018-01-01 12:05=77.833336, 2018-01-01 12:10=75.95, 2018-01-01 12:15=67.75, 2018-01-01 12:20=57.699997, 2018-01-01 12:25=74.2, 2018-01-01 12:30=87.1, 2018-01-01 12:35=77.6, 2018-01-01 12:40=74.1, 2018-01-01 12:45=63.36667, 2018-01-01 12:50=59.300003, 2018-01-01 12:55=76.9, 2018-01-01 13:00=66.6, 2018-01-01 13:05=203.4, 2018-01-01 13:10=203.45, 2018-01-01 13:15=203.45, 2018-01-01 13:20=157.3, 2018-01-01 13:25=101.333336, 2018-01-01 13:30=96.45, 2018-01-01 13:35=81.3, 2018-01-01 13:40=93.7, 2018-01-01 13:45=127.9, 2018-01-01 13:50=176.1, 2018-01-01 13:55=152.0, 2018-01-01 14:00=169.6, 2018-01-01 14:05=203.2, 2018-01-01 14:10=257.5, 2018-01-01 14:15=261.30002, 2018-01-01 14:20=261.3, 2018-01-01 14:25=218.13335, 2018-01-01 14:30=385.5, 2018-01-01 14:35=287.5, 2018-01-01 14:40=248.35002, 2018-01-01 14:45=98.2, 2018-01-01 14:50=136.2, 2018-01-01 14:55=160.0, 2018-01-01 15:00=148.1
/[CB
X7BP
, 2018-01-01 15:05=133.59999, 2018-01-01 15:10=93.3, 2018-01-01 15:15=79.25, 2018-01-01 15:20=44.300003, 2018-01-01 15:25=36.56667, 2018-01-01 15:30=43.8, 2018-01-01 15:35=39.3, 2018-01-01 15:40=39.5, 2018-01-01 15:45=33.05, 2018-01-01 15:50=28.649998, 2018-01-01 15:55=26.65, 2018-01-01 16:00=16.55, 2018-01-01 16:05=7.5, 2018-01-01 16:10=0.0}","0":"2018-01-01"}
END_OF_INPUT
# preprocessing
# ignore until {"1":"{
# stop collecting at }","0":"2018-01-01"}
sed -E -n '/\{"1":"\{/,/\}","0":"[0-9]{4}-[0-9]{2}-[0-9]{2}"\}/p' |
# remove the /[CB + the next line + one newline more
sed -E '/\/\[CB/{N;d;n;}' |
# we shoudld get nice {"1":.....} lines here
# elements are separated by comma
# so we can just be cruel
tr ',' '\n' |
# now each line will have one date
# so for each data in line
# output it in our format(TM)
sed -E -n '
/.*[0-9]{2}([0-9]{2})-([0-9]{2})-([0-9]{2}) ([0-9]{2}:[0-9]{2})=([0-9]*.[0-9]*).*/{
s!!// !
p
}
'
将输出:
08/01/18 08:50 4.5
08/01/18 08:55 9.5
08/01/18 11:30 76
08/01/18 11:35 73.9
08/01/18 11:40 65.93333
08/01/18 15:30 2.25
08/01/18 15:40 0.0
01/01/18 08:15 9.5
01/01/18 08:20 22.0
01/01/18 08:25 29.4
01/01/18 08:30 30.150002
01/01/18 08:35 35.3
01/01/18 08:40 42.0
01/01/18 08:45 77.5
01/01/18 08:50 62.6
01/01/18 08:55 62.6
01/01/18 09:00 75.4
01/01/18 09:05 61.199997
01/01/18 09:10 57.85
01/01/18 09:15 45.7
01/01/18 09:20 44.266666
01/01/18 09:25 47.2
01/01/18 09:30 46.8
01/01/18 09:35 53.2
01/01/18 09:40 58.2
01/01/18 09:45 55.600002
01/01/18 09:50 56.733337
01/01/18 09:55 62.0
01/01/18 10:00 66.3
01/01/18 10:05 62.466663
01/01/18 10:10 62.699997
01/01/18 10:15 70.3
01/01/18 10:20 87.1
01/01/18 10:25 88.24999
01/01/18 10:30 102.5
01/01/18 10:35 95.46667
01/01/18 10:40 100.73334
01/01/18 10:45 100.700005
01/01/18 10:50 102.06667
01/01/18 10:55 116.4
01/01/18 11:05 125.166664
01/01/18 11:10 128.26666
01/01/18 11:15 125.43333
01/01/18 11:20 119.666664
01/01/18 11:25 116.649994
01/01/18 11:30 94.700005
01/01/18 11:35 101.7
01/01/18 11:40 95.13333
01/01/18 11:45 98.76666
01/01/18 11:50 98.466675
01/01/18 11:55 92.43334
01/01/18 12:00 85.96667
01/01/18 12:05 77.833336
01/01/18 12:10 75.95
01/01/18 12:15 67.75
01/01/18 12:20 57.699997
01/01/18 12:25 74.2
01/01/18 12:30 87.1
01/01/18 12:35 77.6
01/01/18 12:40 74.1
01/01/18 12:45 63.36667
01/01/18 12:50 59.300003
01/01/18 12:55 76.9
01/01/18 13:00 66.6
01/01/18 13:05 203.4
01/01/18 13:10 203.45
01/01/18 13:15 203.45
01/01/18 13:20 157.3
01/01/18 13:25 101.333336
01/01/18 13:30 96.45
01/01/18 13:35 81.3
01/01/18 13:40 93.7
01/01/18 13:45 127.9
01/01/18 13:50 176.1
01/01/18 13:55 152.0
01/01/18 14:00 169.6
01/01/18 14:05 203.2
01/01/18 14:10 257.5
01/01/18 14:15 261.30002
01/01/18 14:20 261.3
01/01/18 14:25 218.13335
01/01/18 14:30 385.5
01/01/18 14:35 287.5
01/01/18 14:40 248.35002
01/01/18 14:45 98.2
01/01/18 14:50 136.2
01/01/18 14:55 160.0
01/01/18 15:00 148.1
01/01/18 15:05 133.59999
01/01/18 15:10 93.3
01/01/18 15:15 79.25
01/01/18 15:20 44.300003
01/01/18 15:25 36.56667
01/01/18 15:30 43.8
01/01/18 15:35 39.3
01/01/18 15:40 39.5
01/01/18 15:45 33.05
01/01/18 15:50 28.649998
01/01/18 15:55 26.65
01/01/18 16:00 16.55
01/01/18 16:05 7.5
01/01/18 16:10 0.0
在 GNU sed
的一个命令中支持 \n
作为换行符:
sed -E -n '
/\{"1":"\{/,/\}","0":"[0-9]{4}-[0-9]{2}-[0-9]{2}"\}/{
# remove the /[CB + the next line + one newline more
/\/\[CB/{N;d;n;}
: loop
/([^\n]*)[0-9]{2}([0-9]{2})-([0-9]{2})-([0-9]{2}) ([0-9]{2}:[0-9]{2})=([0-9]*.[0-9]*)([^\n]*)/{
# put the interesting string on the end of the pattern space
s!!\n// !
# again, until nothing interesting is found
b loop
}
# remove everything in front of the newline that we did not parse
s/[^\n]*\n//
# output
p
} '
KamilCuk 提供了最佳解决方案。带有一系列命令的第一个解决方案做得最好,但就目前而言使用起来不方便,而且它不会对二进制 cap 文件进行操作。
组合的 sed
命令,他的解决方案 2,效果不佳。可能是因为它一次只处理一种线型,而多线问题没有得到很好的解决。如果读取点可以备份一行或保存最后一行的剩余部分并包含在下一行中,也许可以修复它。
我自己的 quick and lossy(稍后详述)是一个方便的 one liner。它适用于二进制 cap 文件,这将允许它接受来自 tcpdump
或 ngrep
的管道 - 也是有用的选项。
比较我的有损解决方案:它丢失了跨 IP 数据包拆分的大约 1% 的数据点,它还允许我拒绝 1% 在光伏系统关闭时仅记录 0.0 瓦的数据包。
就我的目标而言,分析功率输出随时间和季节变化的趋势和概率(我将在 15 或 30 分钟的时间内完成,也会结合 week_from_solstice 内的天数,例如之前的 14-7 天和 12 月 21 日之后的 7-14)丢失一些读数并不重要。在一天结束时删除零实际上改进了我的数据分析。
所以下次我通过 IP 捕获处理数据样本时,我想我可能会使用:
awk 'BEGIN{RS=","}; (~"^201"){if (NF=2) {split(,X,"="); if (0+X[2] > 0) {split(,D,"-");print D[3]"/"D[2]"/"substr(D[1],3,2),X[1], 0+X[2]}}}'
0+X[2]
需要,因为有些行以 0.0}" 结尾,计算从中得到数字 0 并丢弃 }"。
如果我不想重新格式化日期(Excel 需要 2018-01-31)命令更简单:
awk 'BEGIN{RS=","}; (~"^201"){if (NF=2) {split(,X,"="); if (0+X[2] > 0) {print ,X[1], 0+X[2]}}}'
该命令还会删除数据库没有数据(通信中断或被清除)的响应,然后 IP 源发送“2017-12-25 10:10 null”