重新排列和聚合 R 行

Question

编辑 --- 我已经清理了问题以缩小范围。

我正在尝试按以下形式聚合数据框，但卡住了。

这是电话系统的 isdn 日志输出，因此它包含在整个日志中同时发生的呼叫。这些电话是来电而不是去电。

数据框如下所示：

"V1" "V2""V3""V4"   "V5"        "V6"        "V7"                   "V8"
"1" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189056:" "Oct  2 00:00:01.326 AEDST: ISDN Se0/0/0:15 Q931: RX <- SETUP pd = 8  callref = 0x174E "
"2" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189057:" "  Bearer Capability i = 0x8090A3 "
"3" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189058:" "      Standard = CCITT "
"4" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189059:" "      Transfer Capability = Speech  "
"5" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189060:" "      Transfer Mode = Circuit "
"6" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189061:" "      Transfer Rate = 64 kbit/s "
"7" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189062:" "  Channel ID i = 0xA1839B "
"8" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189063:" "      Preferred, Channel 27 "
"9" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189064:" "  Calling Party Number i = 0x2183, '00123456789' "
"10" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189065:" "     Plan:ISDN, Type:National "
"11" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189066:" " Called Party Number i = 0xC1, '0123456' "
"12" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189067:" "     Plan:ISDN, Type:Subscriber(local) "
"13" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189068:" " Sending Complete"
"14" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189069:" "Oct  2 00:00:01.334 AEDST: ISDN Se0/0/0:15 Q931: TX -> CALL_PROC pd = 8  callref = 0x974E "
"15" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189070:" " Channel ID i = 0xA9839B "
"16" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189071:" "     Exclusive, Channel 27"
"17" "Oct" "" "2" "00:00:02" "10.20.5.31" "82189072:" "Oct  2 00:00:01.350 AEDST: ISDN Se0/0/0:15 Q931: TX -> ALERTING pd = 8  callref = 0x974E "
"18" "Oct" "" "2" "00:00:02" "10.20.5.31" "82189073:" " Progress Ind i = 0x8088 - In-band info or appropriate now available "
"19" "Oct" "" "2" "00:00:02" "10.20.5.31" "82189074:" "Oct  2 00:00:01.358 AEDST: ISDN Se0/0/0:15 Q931: TX -> CONNECT pd = 8  callref = 0x974E"
"20" "Oct" "" "2" "00:00:02" "10.20.5.31" "82189075:" "Oct  2 00:00:01.382 AEDST: ISDN Se0/0/0:15 Q931: RX <- CONNECT_ACK pd = 8  callref = 0x174E"
"21" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488302:" "Oct  2 00:00:18.210 AEDST: ISDN Se0/0/0:15 Q931: TX -> DISCONNECT pd = 8  callref = 0x9AC7 "
"22" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488303:" " Cause i = 0x8090 - Normal call clearing"
"23" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488304:" "Oct  2 00:00:18.290 AEDST: ISDN Se0/0/0:15 Q931: RX <- RELEASE pd = 8  callref = 0x1AC7"
"24" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488305:" "Oct  2 00:00:18.314 AEDST: ISDN Se0/0/0:15 Q931: TX -> RELEASE_COMP pd = 8  callref = 0x9AC7"
"25" "Oct" "" "2" "00:00:21" "10.20.5.31" "82189076:" "Oct  2 00:00:21.053 AEDST: ISDN Se0/1/0:15 Q931: RX <- SETUP pd = 8  callref = 0x093A "

我希望数据集如下所示：

    "V1" "V2""V3""V4"   "V5"        "V6"        "V7"    "UniqueId"       "V8"
    "1" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189056:" "0x174E" "Oct  2 00:00:01.326 AEDST: ISDN Se0/0/0:15 Q931: RX <- SETUP pd = 8  callref = 0x174E "
    "2" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189057:" "0x174E" " Bearer Capability i = 0x8090A3 "
    "3" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189058:" "0x174E" "      Standard = CCITT "
   ....
    "21" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488302:" "0x9AC7" "Oct  2 00:00:18.210 AEDST: ISDN Se0/0/0:15 Q931: TX -> DISCONNECT pd = 8  callref = 0x9AC7 "

再次重申：

call reference是识别这个数据集的唯一方式，也称为作为 callref 例如 0x174E（这是找到唯一调用的唯一方法在数据集中）。这是请求的数据框中的新列 (UniqueId)。
下面的任何行也将在新列中粘贴相同的 callref id，直到它碰到另一行，该行表示相同的 callref 或另一个 call ref。
每次显示 callref 时，任何可以将这些行折叠成一行的人都可以获得奖励积分。请注意，这可能会在几种不同的状态下发生（当包含 callref 的行还包含 TX -> CALL_PROC、TX -> ALERTING、TX -> CONNECT、RX <- CONNECT_ACK 和其他一些状态时.)

例如，我合并了第 1、2 和 3 行的 V7 列，因为它们属于同一个 callref

    "V1" "V2""V3""V4"   "V5"        "V6"        "V7"    "UniqueId"       "V8"
    "1" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189056:" "0x174E" "Oct  2 00:00:01.326 AEDST: ISDN Se0/0/0:15 Q931: RX <- SETUP pd = 8  callref = 0x174E \n Bearer Capability i = 0x8090A3 \n Standard = CCITT"

如有任何答案，我们将不胜感激。

Answer 1

所以这个答案有点乱，但我尽力了。

你可以跳过我的 read.fwf，因为你对 str_split 也做了同样的事情。我只是想以一种可行的格式获取数据。

我首先阅读了信息，分离出了一些列

example1 <- read.fwf("ex.csv", widths = c(1, 6, 10, 10, 10, 1000), strip.white = T)

将所有内容都转换为字符串而不是因子，删除了第一行 headers，并重命名了列。

example <- example1 %>%
  mutate_all(.funs = as.character) %>%
  slice(-1) %>%
  select(-1,
         Date = 2,
         Time = 3,
         IP = 4,
         id = 5,
         Description = 6)

然后我索引了 callref 出现的第一个位置，然后按那些文本块进行分组。

x <- which(grepl("callref", example$Description))

example <- example %>%
  mutate(callref = ifelse(grepl("callref", Description), 1, 0),
         group = rep(x, c(diff(c(x, x))[1:length(x)-1], nrow(.) - x[length(x)]+1)))

在 example df 分组后，我总结了文本，以过去组内的描述。我认为这是您想要做的主要事情？

example2 <- example %>%
  group_by(group) %>%
  summarise(text = paste(Description, collapse = "*"))

之后我把它加入到主example df中，我用separate把一些重要的信息分离出来。我们可以通过这种方式获取 RX_TX 以及 callref id。如果需要，您可以拆分出任何其他重要信息，然后我建议使用 tidyr 中的 spread 函数将该信息转换为列，以便您可以进一步清理它以供分析。

example3 <- example %>%
  filter(callref == 1) %>%
  left_join(example2, by = "group") %>%
  select(-Description) %>%
  rename(Description = text) %>%
  separate(Description, into = c("firstpart", "RX_TX"), sep = "Q931: ") %>%
  separate(RX_TX, into = c("RX_TX", "Info"), sep = "pd = 8") %>%
  mutate(Call_Ref = substr(gsub("callref \= ", "", Info), 1, 8))

重新排列和聚合 R 行

Re-arrange and aggregate R Rows

r

dataframe

tidyr