从长到宽,数据集不平衡,每次观察一行
From long to wide with unbalanced dataset and one row per observation
我尝试了很多方法试图找出从长到宽的方法,但每次观察我无法得到一行。它给出了许多 NA 值,因为我的数据是不平衡的(我不能将所有值向上移动一行,等等)
这是我的一部分数据:
structure(list(employees = c(384, 432, 624, 334, 356, 338, 348,
1122, 1110, 1492), profit_margin = c(-0.14684, -0.85298, -0.58792,
-0.38872, -1.30312, -0.86866, -0.6363, -1.925, 0.567, 3.984),
RD_expenses = c(8946.414554, 9977.75638, 43326.90616, 48870.14658,
35022.10866, 39584.25952, 32259.2173, 6303.95, 6812.46, 14993.39
), RD_intensity = c(7.10910850621956, 8.98811378416267, 15.6492601635234,
17.6773777378817, 13.1744528168514, 14.3544852219875, 11.2624231565094,
0.500071500320608, 0.559723756230354, 1.36999818636439),
sales = c(125844.3945, 111010.5704, 276862.329, 276455.8596,
265833.4972, 275762.3064, 286432.2966, 1260609.732, 1217111.106,
1094409.478), treated = c("1", "1", "1", "1", "1", "1", "1",
"1", "1", "1"), year = c(2013L, 2014L, 2015L, 2016L, 2017L,
2018L, 2019L, 2015L, 2016L, 2017L), id = c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L), company = c("ALLERGAN PUBLIC LIMITED COMPANY",
"ALLERGAN PUBLIC LIMITED COMPANY", "ALLERGAN PUBLIC LIMITED COMPANY",
"ALLERGAN PUBLIC LIMITED COMPANY", "ALLERGAN PUBLIC LIMITED COMPANY",
"ALLERGAN PUBLIC LIMITED COMPANY", "ALLERGAN PUBLIC LIMITED COMPANY",
"ALPINE ELECTRONICS, INC.", "ALPINE ELECTRONICS, INC.", "ALPINE ELECTRONICS, INC."
)), row.names = c(NA, -10L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000001c71d471ef0>)
我试过这个:
test %>% group_by(id, company) %>% dplyr::mutate(row = row_number()) %>% tidyr::pivot_wider(names_from = year, values_from = c("employees", "profit_margin", "RD_expenses", "RD_intensity", "sales", "treated"))
但这给出了许多 NA 值,而不是每次观察一行,如下所示:
1 ALLERGAN PUBLIC LIMITED COMPANY 1 384 NA NA NA NA
1 ALLERGAN PUBLIC LIMITED COMPANY 2 NA 432 NA NA NA
1 ALLERGAN PUBLIC LIMITED COMPANY 3 NA NA 624 NA NA
1 ALLERGAN PUBLIC LIMITED COMPANY 4 NA NA NA 334 NA
1 ALLERGAN PUBLIC LIMITED COMPANY 5 NA NA NA NA 356
1 ALLERGAN PUBLIC LIMITED COMPANY 6 NA NA NA NA NA
1 ALLERGAN PUBLIC LIMITED COMPANY 7 NA NA NA NA NA
2 ALPINE ELECTRONICS, INC. 1 NA NA 1122 NA NA
2 ALPINE ELECTRONICS, INC. 2 NA NA NA 1110 NA
2 ALPINE ELECTRONICS, INC. 3 NA NA NA NA 1492
此外,我对每个公司的观察结果不完全是 7 个,所以这有点困难。
我也试过这个:
test %>%
group_by(id) %>%
dplyr::mutate(Visit = 1:n()) %>%
gather("employees", "profit_margin", "RD_expenses", "RD_intensity", "sales", "treated", "year", key = variable, value = number) %>%
unite(combi, variable, Visit) %>%
spread(combi, number)
但这给出了更奇怪的结果,列到 _31,其中 1 个公司(或 id)的最大观察值是 7。
有什么想法吗?我需要它才能使用匹配!
谢谢
我认为您可以完全跳过 row
列的创建。
tidyr::pivot_wider(df, names_from = year,
values_from = c(employees, profit_margin, RD_expenses, RD_intensity, sales, treated))
您可以只使用基础 R 中的 reshape()
函数。
reshape(d, direction = "wide", timevar = "year", idvar = c("id", "company"))
公司没有数据的任何年份都会有 NA
s。在 idvar
.
中包括任何时间固定变量(例如,国家或战略,如果测量的话)
我尝试了很多方法试图找出从长到宽的方法,但每次观察我无法得到一行。它给出了许多 NA 值,因为我的数据是不平衡的(我不能将所有值向上移动一行,等等)
这是我的一部分数据:
structure(list(employees = c(384, 432, 624, 334, 356, 338, 348,
1122, 1110, 1492), profit_margin = c(-0.14684, -0.85298, -0.58792,
-0.38872, -1.30312, -0.86866, -0.6363, -1.925, 0.567, 3.984),
RD_expenses = c(8946.414554, 9977.75638, 43326.90616, 48870.14658,
35022.10866, 39584.25952, 32259.2173, 6303.95, 6812.46, 14993.39
), RD_intensity = c(7.10910850621956, 8.98811378416267, 15.6492601635234,
17.6773777378817, 13.1744528168514, 14.3544852219875, 11.2624231565094,
0.500071500320608, 0.559723756230354, 1.36999818636439),
sales = c(125844.3945, 111010.5704, 276862.329, 276455.8596,
265833.4972, 275762.3064, 286432.2966, 1260609.732, 1217111.106,
1094409.478), treated = c("1", "1", "1", "1", "1", "1", "1",
"1", "1", "1"), year = c(2013L, 2014L, 2015L, 2016L, 2017L,
2018L, 2019L, 2015L, 2016L, 2017L), id = c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L), company = c("ALLERGAN PUBLIC LIMITED COMPANY",
"ALLERGAN PUBLIC LIMITED COMPANY", "ALLERGAN PUBLIC LIMITED COMPANY",
"ALLERGAN PUBLIC LIMITED COMPANY", "ALLERGAN PUBLIC LIMITED COMPANY",
"ALLERGAN PUBLIC LIMITED COMPANY", "ALLERGAN PUBLIC LIMITED COMPANY",
"ALPINE ELECTRONICS, INC.", "ALPINE ELECTRONICS, INC.", "ALPINE ELECTRONICS, INC."
)), row.names = c(NA, -10L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000001c71d471ef0>)
我试过这个:
test %>% group_by(id, company) %>% dplyr::mutate(row = row_number()) %>% tidyr::pivot_wider(names_from = year, values_from = c("employees", "profit_margin", "RD_expenses", "RD_intensity", "sales", "treated"))
但这给出了许多 NA 值,而不是每次观察一行,如下所示:
1 ALLERGAN PUBLIC LIMITED COMPANY 1 384 NA NA NA NA
1 ALLERGAN PUBLIC LIMITED COMPANY 2 NA 432 NA NA NA
1 ALLERGAN PUBLIC LIMITED COMPANY 3 NA NA 624 NA NA
1 ALLERGAN PUBLIC LIMITED COMPANY 4 NA NA NA 334 NA
1 ALLERGAN PUBLIC LIMITED COMPANY 5 NA NA NA NA 356
1 ALLERGAN PUBLIC LIMITED COMPANY 6 NA NA NA NA NA
1 ALLERGAN PUBLIC LIMITED COMPANY 7 NA NA NA NA NA
2 ALPINE ELECTRONICS, INC. 1 NA NA 1122 NA NA
2 ALPINE ELECTRONICS, INC. 2 NA NA NA 1110 NA
2 ALPINE ELECTRONICS, INC. 3 NA NA NA NA 1492
此外,我对每个公司的观察结果不完全是 7 个,所以这有点困难。
我也试过这个:
test %>%
group_by(id) %>%
dplyr::mutate(Visit = 1:n()) %>%
gather("employees", "profit_margin", "RD_expenses", "RD_intensity", "sales", "treated", "year", key = variable, value = number) %>%
unite(combi, variable, Visit) %>%
spread(combi, number)
但这给出了更奇怪的结果,列到 _31,其中 1 个公司(或 id)的最大观察值是 7。
有什么想法吗?我需要它才能使用匹配!
谢谢
我认为您可以完全跳过 row
列的创建。
tidyr::pivot_wider(df, names_from = year,
values_from = c(employees, profit_margin, RD_expenses, RD_intensity, sales, treated))
您可以只使用基础 R 中的 reshape()
函数。
reshape(d, direction = "wide", timevar = "year", idvar = c("id", "company"))
公司没有数据的任何年份都会有 NA
s。在 idvar
.