在使用 dcast 重塑字符变量之前,依次重命名字符变量中的重复值

Sequentially rename duplicate value in character variable before reshaping it with dcast

我正在从一个网站上抓取 汽车信息,但我从中获取的数据不稳定且不那么干净。我正在尝试清理这些数据并将其整理到数据框中。

例如:

dd <- data.frame(measure = c("wheel", "wheel", "length", "width", "wheel", "width"), value = 1:6, model = "a", stringsAsFactors = F)
dd
  measure value model
1   wheel     1     a
2   wheel     2     a
3  length     3     a
4   width     4     a
5   wheel     5     a
6   width     6     a

在这个例子中,我有 3 个 wheel 的值和 2 个 width 的值。在我的真实数据中,重复的并不总是同一件事,它可能有重复也可能没有重复并且可能重复不止一次。

我需要将此 table 重塑为每个 model 一行,但我不想聚合具有公共 measurevalue。准确地说,我希望 table 变成:

  model length wheel wheel1 wheel2 width width1
1     a      3     1      2      5     4      6

这是使用 dcast 在手动修改的数据上获得的:

library(reshape2)    
res <- data.frame(measure = c("wheel", "wheel1", "length", "width", "wheel2", "width1"), value = 1:6, model = "a", stringsAsFactors = F)
dcast(res, model ~ measure)

我需要一种方法来修改 dcast 使其不聚合 measure 或自动修改 dd 使其变为 res.

我尝试了一些丑陋但不完全是我需要的东西:

dd[duplicated(dd$measure), "measure"] <- paste0(dd[duplicated(dd$measure), "measure"] , 1:3)
dd
  measure value model
1   wheel     1     a
2  wheel1     2     a
3  length     3     a
4   width     4     a
5  wheel2     5     a
6  width3     6     a

此代码无效,因为 width 获取索引 3 而不是 2。此外,这不会适应另一个 table,例如:

dd2 <- data.frame(measure = c("wheel", "wheel", "length", "width", "wheel"), value = 1:5, model = "a", stringsAsFactors = F)
dd2[duplicated(dd2$measure), "measure"] <- paste0(dd2[duplicated(dd2$measure), "measure"] , 1:3)
Error in `[<-.data.frame`(`*tmp*`, duplicated(dd2$measure), "measure",  : 
  replacement has 3 rows, data has 2

无论如何,我怎样才能动态修改我的变量 measure 以便所有单词都是唯一的?

你可以使用 dplyr::mutate 如下:

dd <- dd %>%
  group_by(model, measure) %>%
  mutate(measure2 = paste0(measure, ifelse(row_number() > 1, row_number() - 1, ""))) %>%
  ungroup() %>%
  mutate(measure = measure2) %>%
  select(measure, model, value)
dd
# A tibble: 6 x 3
  measure model value
  <chr>   <chr> <int>
1 wheel   a         1
2 wheel1  a         2
3 length  a         3
4 width   a         4
5 wheel2  a         5
6 width1  a         6

另一种 tidyverse 可能性是:

dd %>%
 arrange(model, measure) %>%
 group_by(model, measure) %>%
 mutate(var = paste(measure, seq_along(measure), sep = "_")) %>%
 ungroup() %>%
 select(-measure) %>%
 spread(var, value)

  model length_1 wheel_1 wheel_2 wheel_3 width_1 width_2
  <chr>    <int>   <int>   <int>   <int>   <int>   <int>
1 a            3       1       2       5       4       6

您还可以使用 sapply

对值重新编号
sapply(unique(dd$measure), function(x) {
  z <- dd$measure[dd$measure %in% x]
  if (length(z) > 1)
  dd$measure[dd$measure %in% x] <<- paste0(z, ".", seq(length(z)))
})

并在之后使用reshape

reshape(dd, direction="wide", timevar="measure", idvar="model")
#   model value.wheel.1 value.wheel.2 value.length value.width.1 value.wheel.3 value.width.2
# 1     a             1             2            3             4             5             6

数据

dd <- structure(list(measure = c("wheel", "wheel", "length", "width", "wheel", "width"), 
                     value = 1:6, model = c("a", "a", "a", "a", "a", "a")), 
                class = "data.frame", row.names = c(NA, -6L))

make.unique 就是这样做的:

dd$measure <- make.unique(dd$measure,sep = "")  
dd            
#    measure value model
# 1   wheel     1     a
# 2  wheel1     2     a
# 3  length     3     a
# 4   width     4     a
# 5  wheel2     5     a
# 6  width1     6     a