在使用 dcast 重塑字符变量之前,依次重命名字符变量中的重复值
Sequentially rename duplicate value in character variable before reshaping it with dcast
我正在从一个网站上抓取 汽车信息,但我从中获取的数据不稳定且不那么干净。我正在尝试清理这些数据并将其整理到数据框中。
例如:
dd <- data.frame(measure = c("wheel", "wheel", "length", "width", "wheel", "width"), value = 1:6, model = "a", stringsAsFactors = F)
dd
measure value model
1 wheel 1 a
2 wheel 2 a
3 length 3 a
4 width 4 a
5 wheel 5 a
6 width 6 a
在这个例子中,我有 3 个 wheel
的值和 2 个 width
的值。在我的真实数据中,重复的并不总是同一件事,它可能有重复也可能没有重复并且可能重复不止一次。
我需要将此 table 重塑为每个 model
一行,但我不想聚合具有公共 measure
的 value
。准确地说,我希望 table 变成:
model length wheel wheel1 wheel2 width width1
1 a 3 1 2 5 4 6
这是使用 dcast
在手动修改的数据上获得的:
library(reshape2)
res <- data.frame(measure = c("wheel", "wheel1", "length", "width", "wheel2", "width1"), value = 1:6, model = "a", stringsAsFactors = F)
dcast(res, model ~ measure)
我需要一种方法来修改 dcast
使其不聚合 measure
或自动修改 dd
使其变为 res
.
我尝试了一些丑陋但不完全是我需要的东西:
dd[duplicated(dd$measure), "measure"] <- paste0(dd[duplicated(dd$measure), "measure"] , 1:3)
dd
measure value model
1 wheel 1 a
2 wheel1 2 a
3 length 3 a
4 width 4 a
5 wheel2 5 a
6 width3 6 a
此代码无效,因为 width
获取索引 3
而不是 2
。此外,这不会适应另一个 table,例如:
dd2 <- data.frame(measure = c("wheel", "wheel", "length", "width", "wheel"), value = 1:5, model = "a", stringsAsFactors = F)
dd2[duplicated(dd2$measure), "measure"] <- paste0(dd2[duplicated(dd2$measure), "measure"] , 1:3)
Error in `[<-.data.frame`(`*tmp*`, duplicated(dd2$measure), "measure", :
replacement has 3 rows, data has 2
无论如何,我怎样才能动态修改我的变量 measure
以便所有单词都是唯一的?
你可以使用 dplyr::mutate
如下:
dd <- dd %>%
group_by(model, measure) %>%
mutate(measure2 = paste0(measure, ifelse(row_number() > 1, row_number() - 1, ""))) %>%
ungroup() %>%
mutate(measure = measure2) %>%
select(measure, model, value)
dd
# A tibble: 6 x 3
measure model value
<chr> <chr> <int>
1 wheel a 1
2 wheel1 a 2
3 length a 3
4 width a 4
5 wheel2 a 5
6 width1 a 6
另一种 tidyverse
可能性是:
dd %>%
arrange(model, measure) %>%
group_by(model, measure) %>%
mutate(var = paste(measure, seq_along(measure), sep = "_")) %>%
ungroup() %>%
select(-measure) %>%
spread(var, value)
model length_1 wheel_1 wheel_2 wheel_3 width_1 width_2
<chr> <int> <int> <int> <int> <int> <int>
1 a 3 1 2 5 4 6
您还可以使用 sapply
对值重新编号
sapply(unique(dd$measure), function(x) {
z <- dd$measure[dd$measure %in% x]
if (length(z) > 1)
dd$measure[dd$measure %in% x] <<- paste0(z, ".", seq(length(z)))
})
并在之后使用reshape
。
reshape(dd, direction="wide", timevar="measure", idvar="model")
# model value.wheel.1 value.wheel.2 value.length value.width.1 value.wheel.3 value.width.2
# 1 a 1 2 3 4 5 6
数据
dd <- structure(list(measure = c("wheel", "wheel", "length", "width", "wheel", "width"),
value = 1:6, model = c("a", "a", "a", "a", "a", "a")),
class = "data.frame", row.names = c(NA, -6L))
make.unique
就是这样做的:
dd$measure <- make.unique(dd$measure,sep = "")
dd
# measure value model
# 1 wheel 1 a
# 2 wheel1 2 a
# 3 length 3 a
# 4 width 4 a
# 5 wheel2 5 a
# 6 width1 6 a
我正在从一个网站上抓取 汽车信息,但我从中获取的数据不稳定且不那么干净。我正在尝试清理这些数据并将其整理到数据框中。
例如:
dd <- data.frame(measure = c("wheel", "wheel", "length", "width", "wheel", "width"), value = 1:6, model = "a", stringsAsFactors = F)
dd
measure value model
1 wheel 1 a
2 wheel 2 a
3 length 3 a
4 width 4 a
5 wheel 5 a
6 width 6 a
在这个例子中,我有 3 个 wheel
的值和 2 个 width
的值。在我的真实数据中,重复的并不总是同一件事,它可能有重复也可能没有重复并且可能重复不止一次。
我需要将此 table 重塑为每个 model
一行,但我不想聚合具有公共 measure
的 value
。准确地说,我希望 table 变成:
model length wheel wheel1 wheel2 width width1
1 a 3 1 2 5 4 6
这是使用 dcast
在手动修改的数据上获得的:
library(reshape2)
res <- data.frame(measure = c("wheel", "wheel1", "length", "width", "wheel2", "width1"), value = 1:6, model = "a", stringsAsFactors = F)
dcast(res, model ~ measure)
我需要一种方法来修改 dcast
使其不聚合 measure
或自动修改 dd
使其变为 res
.
我尝试了一些丑陋但不完全是我需要的东西:
dd[duplicated(dd$measure), "measure"] <- paste0(dd[duplicated(dd$measure), "measure"] , 1:3)
dd
measure value model
1 wheel 1 a
2 wheel1 2 a
3 length 3 a
4 width 4 a
5 wheel2 5 a
6 width3 6 a
此代码无效,因为 width
获取索引 3
而不是 2
。此外,这不会适应另一个 table,例如:
dd2 <- data.frame(measure = c("wheel", "wheel", "length", "width", "wheel"), value = 1:5, model = "a", stringsAsFactors = F)
dd2[duplicated(dd2$measure), "measure"] <- paste0(dd2[duplicated(dd2$measure), "measure"] , 1:3)
Error in `[<-.data.frame`(`*tmp*`, duplicated(dd2$measure), "measure", :
replacement has 3 rows, data has 2
无论如何,我怎样才能动态修改我的变量 measure
以便所有单词都是唯一的?
你可以使用 dplyr::mutate
如下:
dd <- dd %>%
group_by(model, measure) %>%
mutate(measure2 = paste0(measure, ifelse(row_number() > 1, row_number() - 1, ""))) %>%
ungroup() %>%
mutate(measure = measure2) %>%
select(measure, model, value)
dd
# A tibble: 6 x 3
measure model value
<chr> <chr> <int>
1 wheel a 1
2 wheel1 a 2
3 length a 3
4 width a 4
5 wheel2 a 5
6 width1 a 6
另一种 tidyverse
可能性是:
dd %>%
arrange(model, measure) %>%
group_by(model, measure) %>%
mutate(var = paste(measure, seq_along(measure), sep = "_")) %>%
ungroup() %>%
select(-measure) %>%
spread(var, value)
model length_1 wheel_1 wheel_2 wheel_3 width_1 width_2
<chr> <int> <int> <int> <int> <int> <int>
1 a 3 1 2 5 4 6
您还可以使用 sapply
sapply(unique(dd$measure), function(x) {
z <- dd$measure[dd$measure %in% x]
if (length(z) > 1)
dd$measure[dd$measure %in% x] <<- paste0(z, ".", seq(length(z)))
})
并在之后使用reshape
。
reshape(dd, direction="wide", timevar="measure", idvar="model")
# model value.wheel.1 value.wheel.2 value.length value.width.1 value.wheel.3 value.width.2
# 1 a 1 2 3 4 5 6
数据
dd <- structure(list(measure = c("wheel", "wheel", "length", "width", "wheel", "width"),
value = 1:6, model = c("a", "a", "a", "a", "a", "a")),
class = "data.frame", row.names = c(NA, -6L))
make.unique
就是这样做的:
dd$measure <- make.unique(dd$measure,sep = "")
dd
# measure value model
# 1 wheel 1 a
# 2 wheel1 2 a
# 3 length 3 a
# 4 width 4 a
# 5 wheel2 5 a
# 6 width1 6 a