按组取消列出列
Unlisting columns by groups
我有以下格式的数据框:
id | name | logs
---+--------------------+-----------------------------------------
84 | "zibaroo" | "C47931038"
12 | "fabien kelyarsky" | c("C47331040", "B19412225", "B18511449")
96 | "mitra lutsko" | c("F19712226", "A18311450")
34 | "PaulSandoz" | "A47431044"
65 | "BeamVision" | "D47531045"
如您所见,"logs" 列在每个单元格中包含字符串向量。
有没有一种有效的方法可以将数据框转换为长格式(每行一个观察值)而无需将 "logs" 分成几列的中间步骤?
这很重要,因为数据集非常大,每个人的日志数量似乎是任意的。
换句话说,我需要以下内容:
id | name | log
---+--------------------+------------
84 | "zibaroo" | "C47931038"
12 | "fabien kelyarsky" | "C47331040"
12 | "fabien kelyarsky" | "B19412225"
12 | "fabien kelyarsky" | "B18511449"
96 | "mitra lutsko" | "F19712226"
96 | "mitra lutsko" | "A18311450"
34 | "PaulSandoz" | "A47431044"
65 | "BeamVision" | "D47531045"
这是真实数据帧的一部分 dput
:
structure(list(id = 148:157, name = c("avihil1", "Niarfe", "doug henderson",
"nick tan", "madisp", "woodbusy", "kevinhcross", "cylol", "andrewarrow",
"gstavrev"), logs = list("Z47331572", "Z47031573", c("F47531574",
"B195945", "D186871", "S192939", "S182865", "G19539045"), c("A47231575",
"A190933", "C181859"), "F47431576", c("B47231577", "D193936",
"Q184862"), "Y47331579", c("A47531580", "Z195944", "B185870"),
"N47731581", "E47231582")), .Names = c("id", "name", "logs"
), row.names = 149:158, class = "data.frame")
这是 tidyr 的完美案例:
library(tidyr)
library(dplyr)
dat %>% unnest(logs)
使用 splitstackshape
中的 listCol_l
可能是一个不错的选择,因为 data.frame
中的列 "logs" 是 list
library(splitstackshape)
listCol_l(df, 'logs')
# id name logs_ul
#1: 148 avihil1 Z47331572
#2: 149 Niarfe Z47031573
#3: 150 doug henderson F47531574
#4: 150 doug henderson B195945
#5: 150 doug henderson D186871
#6: 150 doug henderson S192939
#7: 150 doug henderson S182865
#8: 150 doug henderson G19539045
#9: 151 nick tan A47231575
#10: 151 nick tan A190933
#11: 151 nick tan C181859
#12: 152 madisp F47431576
#13: 153 woodbusy B47231577
#14: 153 woodbusy D193936
#15: 153 woodbusy Q184862
#16: 154 kevinhcross Y47331579
#17: 155 cylol A47531580
#18: 155 cylol Z195944
#19: 155 cylol B185870
#20: 156 andrewarrow N47731581
#21: 157 gstavrev E47231582
只是为了显示另一个选项
library(data.table)
setDT(df)[, .(logs = unlist(logs)), by = .(id, name)]
我有以下格式的数据框:
id | name | logs
---+--------------------+-----------------------------------------
84 | "zibaroo" | "C47931038"
12 | "fabien kelyarsky" | c("C47331040", "B19412225", "B18511449")
96 | "mitra lutsko" | c("F19712226", "A18311450")
34 | "PaulSandoz" | "A47431044"
65 | "BeamVision" | "D47531045"
如您所见,"logs" 列在每个单元格中包含字符串向量。
有没有一种有效的方法可以将数据框转换为长格式(每行一个观察值)而无需将 "logs" 分成几列的中间步骤?
这很重要,因为数据集非常大,每个人的日志数量似乎是任意的。
换句话说,我需要以下内容:
id | name | log
---+--------------------+------------
84 | "zibaroo" | "C47931038"
12 | "fabien kelyarsky" | "C47331040"
12 | "fabien kelyarsky" | "B19412225"
12 | "fabien kelyarsky" | "B18511449"
96 | "mitra lutsko" | "F19712226"
96 | "mitra lutsko" | "A18311450"
34 | "PaulSandoz" | "A47431044"
65 | "BeamVision" | "D47531045"
这是真实数据帧的一部分 dput
:
structure(list(id = 148:157, name = c("avihil1", "Niarfe", "doug henderson",
"nick tan", "madisp", "woodbusy", "kevinhcross", "cylol", "andrewarrow",
"gstavrev"), logs = list("Z47331572", "Z47031573", c("F47531574",
"B195945", "D186871", "S192939", "S182865", "G19539045"), c("A47231575",
"A190933", "C181859"), "F47431576", c("B47231577", "D193936",
"Q184862"), "Y47331579", c("A47531580", "Z195944", "B185870"),
"N47731581", "E47231582")), .Names = c("id", "name", "logs"
), row.names = 149:158, class = "data.frame")
这是 tidyr 的完美案例:
library(tidyr)
library(dplyr)
dat %>% unnest(logs)
使用 splitstackshape
中的 listCol_l
可能是一个不错的选择,因为 data.frame
中的列 "logs" 是 list
library(splitstackshape)
listCol_l(df, 'logs')
# id name logs_ul
#1: 148 avihil1 Z47331572
#2: 149 Niarfe Z47031573
#3: 150 doug henderson F47531574
#4: 150 doug henderson B195945
#5: 150 doug henderson D186871
#6: 150 doug henderson S192939
#7: 150 doug henderson S182865
#8: 150 doug henderson G19539045
#9: 151 nick tan A47231575
#10: 151 nick tan A190933
#11: 151 nick tan C181859
#12: 152 madisp F47431576
#13: 153 woodbusy B47231577
#14: 153 woodbusy D193936
#15: 153 woodbusy Q184862
#16: 154 kevinhcross Y47331579
#17: 155 cylol A47531580
#18: 155 cylol Z195944
#19: 155 cylol B185870
#20: 156 andrewarrow N47731581
#21: 157 gstavrev E47231582
只是为了显示另一个选项
library(data.table)
setDT(df)[, .(logs = unlist(logs)), by = .(id, name)]