R:将文本文件拆分为可行的数据框

R: splitting a textfile into a workable dataframe

我正在处理一个包含本地化数据的文本文件。每 5 分钟就会有多个报告,这些报告可能会导致计算区域。如果它解析区域,它会输出一个已识别的房间 ID(示例中为 4260 和 4256):

[08/14/2021 05:05:59 600] - TagId: 4194912 Identified RoomId:4260
[08/14/2021 05:05:59 616] - TagId: 4194912 Last Monitorid:4195283
[08/14/2021 05:05:59 631] - TagId: 4194912 After RoomId:2199
[08/14/2021 05:05:59 631] - Localization RoomId: 2199
[08/14/2021 05:05:59 663] - TagId: 4194912 Reporting RoomId:2199
[08/14/2021 05:05:59 663] - MacId: F0_5C_19_C6_88_A4 RSSI: -72
[08/14/2021 05:05:59 678] - MacId: F0_5C_19_C7_86_54 RSSI: -82
[08/14/2021 05:05:59 678] - MacId: F0_5C_19_C6_89_3C RSSI: -45
[08/14/2021 05:05:59 694] - MacId: F0_5C_19_C6_88_22 RSSI: -80
[08/14/2021 05:05:59 709] - MacId: F0_5C_19_C6_88_12 RSSI: -60
[08/14/2021 05:05:59 709] - MacId: F0_5C_19_C6_88_A8 RSSI: -83
[08/14/2021 05:05:59 709] - MacId: F0_5C_19_C6_88_90 RSSI: -89
[08/14/2021 05:05:59 709] - MacId: F0_5C_19_C6_88_2E RSSI: -54
[08/14/2021 05:05:59 913] - MacId: 40_E3_D6_CA_56_5C RSSI: -92
[08/14/2021 05:05:59 913] - MacId: F0_5C_19_C6_88_52 RSSI: -92
[08/14/2021 05:05:59 928] - MacId: F0_5C_19_C6_88_B8 RSSI: -80
[08/14/2021 05:06:00 288] - MacId: F0_5C_19_C6_88_A4 RSSI: -72
[08/14/2021 05:06:00 288] - MacId: F0_5C_19_C7_86_54 RSSI: -82
[08/14/2021 05:06:00 288] - MacId: 40_E3_D6_CA_57_0A RSSI: -90
[08/14/2021 05:06:00 288] - MacId: F0_5C_19_C6_89_3C RSSI: -45
[08/14/2021 05:06:00 413] - MacId: F0_5C_19_C6_88_90 RSSI: -90
[08/14/2021 05:06:00 413] - MacId: F0_5C_19_C6_88_12 RSSI: -60
[08/14/2021 05:06:00 428] - MacId: F0_5C_19_C6_88_22 RSSI: -80
[08/14/2021 05:06:00 428] - MacId: F0_5C_19_C6_88_A8 RSSI: -83
[08/14/2021 05:06:00 428] - MacId: F0_5C_19_C6_88_2E RSSI: -55
[08/14/2021 05:11:00 974] - MacId: F0_5C_19_C6_88_A4 RSSI: -72
[08/14/2021 05:11:01 006] - TagId: 4194912 Identified RoomId:4256
[08/14/2021 05:11:01 021] - TagId: 4194912 Last Monitorid:4195283
[08/14/2021 05:11:01 037] - TagId: 4194912 After RoomId:2199
[08/14/2021 05:11:01 052] - Localization RoomId: 2199
[08/14/2021 05:11:01 084] - TagId: 4194912 Reporting RoomId:2199
[08/14/2021 05:11:01 084] - MacId: F0_5C_19_C7_86_54 RSSI: -83
[08/14/2021 05:11:01 084] - MacId: F0_5C_19_C6_88_78 RSSI: -90
[08/14/2021 05:11:01 099] - MacId: F0_5C_19_C6_89_3C RSSI: -45
[08/14/2021 05:11:01 349] - MacId: F0_5C_19_C6_88_12 RSSI: -60
[08/14/2021 05:11:01 349] - MacId: F0_5C_19_C6_88_2E RSSI: -55
[08/14/2021 05:11:01 349] - MacId: F0_5C_19_C6_88_A8 RSSI: -84
[08/14/2021 05:11:01 349] - MacId: F0_5C_19_C6_88_90 RSSI: -89
[08/14/2021 05:11:01 365] - MacId: F0_5C_19_C6_88_22 RSSI: -80
[08/14/2021 05:11:01 474] - MacId: 40_E3_D6_CA_56_5C RSSI: -93
[08/14/2021 05:11:01 490] - MacId: F0_5C_19_C6_88_52 RSSI: -90
[08/14/2021 05:11:01 490] - MacId: F0_5C_19_C6_88_BE RSSI: -89
[08/14/2021 05:11:01 802] - MacId: F0_5C_19_C6_88_A4 RSSI: -72
[08/14/2021 05:11:01 802] - MacId: 40_E3_D6_CA_57_0A RSSI: -90
[08/14/2021 05:11:01 802] - MacId: F0_5C_19_C6_89_3C RSSI: -45
[08/14/2021 05:11:01 802] - MacId: F0_5C_19_C6_88_78 RSSI: -89
[08/14/2021 05:11:01 802] - MacId: F0_5C_19_C7_86_54 RSSI: -82
[08/14/2021 05:11:02 006] - MacId: F0_5C_19_C6_88_90 RSSI: -89
[08/14/2021 05:11:02 006] - MacId: F0_5C_19_C6_88_22 RSSI: -80
[08/14/2021 05:11:02 021] - MacId: F0_5C_19_C6_88_A8 RSSI: -84
[08/14/2021 05:11:02 021] - MacId: F0_5C_19_C6_88_2E RSSI: -55
[08/14/2021 05:11:02 021] - MacId: F0_5C_19_C6_88_12 RSSI: -60
[08/14/2021 05:11:02 115] - MacId: F0_5C_19_C6_88_52 RSSI: -91
[08/14/2021 05:11:02 115] - MacId: F0_5C_19_C6_88_BE RSSI: -88

我希望得到以下形式的数据:

如果 RoomId 未在 5 分钟时间范围内解析(来自原始文本文件),则 RoomId 列可以只是 NA。

一位非常有帮助的成员已经展示了如何以正确的方式拆分列 ()

所以主要问题是:我怎样才能以类似于图像的方式构建这个原始文本文件,它是一个可行的数据框,尽管原始文本文件中的行并非都是相似的?

假设数据存储在名为 'temp.txt' 的文本文件中,您可以使用 readLines 读取它。仅保留具有 MacIdRSSI 值的行并获得 RoomId 并保留 'Identified RoomId' 行。将数据拆分成集合,并使用之前 post 中的代码从每个集合中提取 DatetimeMacIdRSSI,并通过删除所有内容提取房间 ID,直到 RoomId.您可以将输出合并到一个数据帧中。

data <- readLines('temp.txt')
req_data <- grep('MacId.*RSSI|Identified RoomId', data, value = TRUE)
result <- do.call(rbind, by(req_data, cumsum(grepl('Identified', req_data)), 
          function(x) {
  room_id <- sub('.*RoomId:\s*', '', x[1])
  cbind(strcapture('\[(.*)\] - MacId: (.*) RSSI: (.*)', x[-1], 
             proto = list(Datetime = character(), MacId = character(), 
                          RSSI = numeric())), RoomId = room_id)
}))
rownames(result) <- NULL

对于共享的文本数据,我得到的输出为 -

result
                  Datetime             MacId RSSI RoomId
1  08/14/2021 05:05:59 663 F0_5C_19_C6_88_A4  -72   4260
2  08/14/2021 05:05:59 678 F0_5C_19_C7_86_54  -82   4260
3  08/14/2021 05:05:59 678 F0_5C_19_C6_89_3C  -45   4260
4  08/14/2021 05:05:59 694 F0_5C_19_C6_88_22  -80   4260
5  08/14/2021 05:05:59 709 F0_5C_19_C6_88_12  -60   4260
6  08/14/2021 05:05:59 709 F0_5C_19_C6_88_A8  -83   4260
7  08/14/2021 05:05:59 709 F0_5C_19_C6_88_90  -89   4260
8  08/14/2021 05:05:59 709 F0_5C_19_C6_88_2E  -54   4260
9  08/14/2021 05:05:59 913 40_E3_D6_CA_56_5C  -92   4260
10 08/14/2021 05:05:59 913 F0_5C_19_C6_88_52  -92   4260
11 08/14/2021 05:05:59 928 F0_5C_19_C6_88_B8  -80   4260
12 08/14/2021 05:06:00 288 F0_5C_19_C6_88_A4  -72   4260
13 08/14/2021 05:06:00 288 F0_5C_19_C7_86_54  -82   4260
14 08/14/2021 05:06:00 288 40_E3_D6_CA_57_0A  -90   4260
15 08/14/2021 05:06:00 288 F0_5C_19_C6_89_3C  -45   4260
16 08/14/2021 05:06:00 413 F0_5C_19_C6_88_90  -90   4260
17 08/14/2021 05:06:00 413 F0_5C_19_C6_88_12  -60   4260
18 08/14/2021 05:06:00 428 F0_5C_19_C6_88_22  -80   4260
19 08/14/2021 05:06:00 428 F0_5C_19_C6_88_A8  -83   4260
20 08/14/2021 05:06:00 428 F0_5C_19_C6_88_2E  -55   4260
21 08/14/2021 05:11:00 974 F0_5C_19_C6_88_A4  -72   4260
22 08/14/2021 05:11:01 084 F0_5C_19_C7_86_54  -83   4256
23 08/14/2021 05:11:01 084 F0_5C_19_C6_88_78  -90   4256
24 08/14/2021 05:11:01 099 F0_5C_19_C6_89_3C  -45   4256
25 08/14/2021 05:11:01 349 F0_5C_19_C6_88_12  -60   4256
26 08/14/2021 05:11:01 349 F0_5C_19_C6_88_2E  -55   4256
27 08/14/2021 05:11:01 349 F0_5C_19_C6_88_A8  -84   4256
28 08/14/2021 05:11:01 349 F0_5C_19_C6_88_90  -89   4256
29 08/14/2021 05:11:01 365 F0_5C_19_C6_88_22  -80   4256
30 08/14/2021 05:11:01 474 40_E3_D6_CA_56_5C  -93   4256
31 08/14/2021 05:11:01 490 F0_5C_19_C6_88_52  -90   4256
32 08/14/2021 05:11:01 490 F0_5C_19_C6_88_BE  -89   4256
33 08/14/2021 05:11:01 802 F0_5C_19_C6_88_A4  -72   4256
34 08/14/2021 05:11:01 802 40_E3_D6_CA_57_0A  -90   4256
35 08/14/2021 05:11:01 802 F0_5C_19_C6_89_3C  -45   4256
36 08/14/2021 05:11:01 802 F0_5C_19_C6_88_78  -89   4256
37 08/14/2021 05:11:01 802 F0_5C_19_C7_86_54  -82   4256
38 08/14/2021 05:11:02 006 F0_5C_19_C6_88_90  -89   4256
39 08/14/2021 05:11:02 006 F0_5C_19_C6_88_22  -80   4256
40 08/14/2021 05:11:02 021 F0_5C_19_C6_88_A8  -84   4256
41 08/14/2021 05:11:02 021 F0_5C_19_C6_88_2E  -55   4256
42 08/14/2021 05:11:02 021 F0_5C_19_C6_88_12  -60   4256
43 08/14/2021 05:11:02 115 F0_5C_19_C6_88_52  -91   4256
44 08/14/2021 05:11:02 115 F0_5C_19_C6_88_BE  -88   4256