在没有任何标识符的情况下将数据集散布在选定的列上
spreading the dataset on selected columns without any identifier
我想使用几个选定的列来传播数据集,其中没有唯一标识符来标识行。为此,我使用公开可用的鸢尾花数据集。
我试过先删除不需要的列,然后创建没有任何重复项的唯一值。稍后在其上应用点差。
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
spread(Species, Sepal.Length)
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
spread(key=Species, value=Sepal.Length)
但它给出了以下重复标识符错误:
Error: Duplicate identifiers for rows (1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15), (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36), (37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57)
使用row_number()
,创建了一个唯一标识符以便在传播数据时使用并避免错误重复行消息。
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length)
给出以下输出:
# row setosa versicolor virginica
# 1 1 5.1 NA NA
# 2 2 4.9 NA NA
# 3 3 4.7 NA NA
# ...
# 16 16 NA 7.0 NA
# 17 17 NA 6.4 NA
# 18 18 NA 6.9 NA
# ...
# 37 37 NA NA 6.3
# 38 38 NA NA 5.8
# 39 39 NA NA 7.1
但是,由于行号的原因,有很多 NA,这不是预期的。我试图删除 row
数字以获得预期的值,但它没有实现。
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length, -row)
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length, -one_of(row))
预期输出:
tmp <- iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length)
cbind(setosa=unique(tmp$setosa), versicolor=unique(tmp$versicolor), virginica=unique(tmp$virginica))
# setosa versicolor virginica
# [1,] 5.1 7.0 6.3
# [2,] 4.9 6.4 5.8
# [3,] 4.7 6.9 7.1
# [4,] 4.6 5.5 6.5
# [5,] 5.0 6.5 7.6
# [6,] 5.4 5.7 4.9
# [7,] 4.4 6.3 7.3
# [8,] 4.8 4.9 6.7
# [9,] 4.3 6.6 7.2
# [10,] 5.8 5.2 6.4
# [11,] 5.7 5.0 6.8
# [12,] 5.2 5.9 5.7
# [13,] 5.5 6.0 7.7
# [14,] 4.5 6.1 6.0
# [15,] 5.3 5.6 6.9
# [16,] 5.1 6.7 5.6
# [17,] 4.9 5.8 6.2
# [18,] 4.7 6.2 6.1
# [19,] 4.6 6.8 7.4
# [20,] 5.0 5.4 7.9
# [21,] 5.4 5.1 5.9
library(dplyr)
library(tidyr)
tbl_df(iris) %>%
select(Species, Sepal.Length) %>% # select columns of interest
group_by(Species) %>% # for each value
mutate(id = row_number()) %>% # create a row identifier
spread(Species, Sepal.Length) # reshape dataset
# # A tibble: 50 x 4
# id setosa versicolor virginica
# * <int> <dbl> <dbl> <dbl>
# 1 1 5.1 7.0 6.3
# 2 2 4.9 6.4 5.8
# 3 3 4.7 6.9 7.1
# 4 4 4.6 5.5 6.3
# 5 5 5.0 6.5 6.5
# 6 6 5.4 5.7 7.6
# 7 7 4.6 6.3 4.9
# 8 8 5.0 4.9 7.3
# 9 9 4.4 6.6 6.7
# 10 10 4.9 5.2 7.2
# # ... with 40 more rows
请格外注意 create/use 行标识符的方式。上面的代码只是使用了数据集的顺序。如果您以某种方式重新排序,您将获得不同的行组合。检查下面的代码:
tbl_df(iris) %>%
arrange(desc(Sepal.Length)) %>% # order your values descending
select(Species, Sepal.Length) %>% # select columns of interest
group_by(Species) %>% # for each value
mutate(id = row_number()) %>% # create a row identifier
spread(Species, Sepal.Length) # reshape dataset
# # A tibble: 50 x 4
# id setosa versicolor virginica
# * <int> <dbl> <dbl> <dbl>
# 1 1 5.8 7.0 7.9
# 2 2 5.7 6.9 7.7
# 3 3 5.7 6.8 7.7
# 4 4 5.5 6.7 7.7
# 5 5 5.5 6.7 7.7
# 6 6 5.4 6.7 7.6
# 7 7 5.4 6.6 7.4
# 8 8 5.4 6.6 7.3
# 9 9 5.4 6.5 7.2
# 10 10 5.4 6.4 7.2
# # ... with 40 more rows
arrange(desc.))
与之前的不同之处在于,将确保您在顶行(降序)具有较高的值。
我想使用几个选定的列来传播数据集,其中没有唯一标识符来标识行。为此,我使用公开可用的鸢尾花数据集。
我试过先删除不需要的列,然后创建没有任何重复项的唯一值。稍后在其上应用点差。
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
spread(Species, Sepal.Length)
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
spread(key=Species, value=Sepal.Length)
但它给出了以下重复标识符错误:
Error: Duplicate identifiers for rows (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15), (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36), (37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57)
使用row_number()
,创建了一个唯一标识符以便在传播数据时使用并避免错误重复行消息。
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length)
给出以下输出:
# row setosa versicolor virginica
# 1 1 5.1 NA NA
# 2 2 4.9 NA NA
# 3 3 4.7 NA NA
# ...
# 16 16 NA 7.0 NA
# 17 17 NA 6.4 NA
# 18 18 NA 6.9 NA
# ...
# 37 37 NA NA 6.3
# 38 38 NA NA 5.8
# 39 39 NA NA 7.1
但是,由于行号的原因,有很多 NA,这不是预期的。我试图删除 row
数字以获得预期的值,但它没有实现。
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length, -row)
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length, -one_of(row))
预期输出:
tmp <- iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length)
cbind(setosa=unique(tmp$setosa), versicolor=unique(tmp$versicolor), virginica=unique(tmp$virginica))
# setosa versicolor virginica
# [1,] 5.1 7.0 6.3
# [2,] 4.9 6.4 5.8
# [3,] 4.7 6.9 7.1
# [4,] 4.6 5.5 6.5
# [5,] 5.0 6.5 7.6
# [6,] 5.4 5.7 4.9
# [7,] 4.4 6.3 7.3
# [8,] 4.8 4.9 6.7
# [9,] 4.3 6.6 7.2
# [10,] 5.8 5.2 6.4
# [11,] 5.7 5.0 6.8
# [12,] 5.2 5.9 5.7
# [13,] 5.5 6.0 7.7
# [14,] 4.5 6.1 6.0
# [15,] 5.3 5.6 6.9
# [16,] 5.1 6.7 5.6
# [17,] 4.9 5.8 6.2
# [18,] 4.7 6.2 6.1
# [19,] 4.6 6.8 7.4
# [20,] 5.0 5.4 7.9
# [21,] 5.4 5.1 5.9
library(dplyr)
library(tidyr)
tbl_df(iris) %>%
select(Species, Sepal.Length) %>% # select columns of interest
group_by(Species) %>% # for each value
mutate(id = row_number()) %>% # create a row identifier
spread(Species, Sepal.Length) # reshape dataset
# # A tibble: 50 x 4
# id setosa versicolor virginica
# * <int> <dbl> <dbl> <dbl>
# 1 1 5.1 7.0 6.3
# 2 2 4.9 6.4 5.8
# 3 3 4.7 6.9 7.1
# 4 4 4.6 5.5 6.3
# 5 5 5.0 6.5 6.5
# 6 6 5.4 5.7 7.6
# 7 7 4.6 6.3 4.9
# 8 8 5.0 4.9 7.3
# 9 9 4.4 6.6 6.7
# 10 10 4.9 5.2 7.2
# # ... with 40 more rows
请格外注意 create/use 行标识符的方式。上面的代码只是使用了数据集的顺序。如果您以某种方式重新排序,您将获得不同的行组合。检查下面的代码:
tbl_df(iris) %>%
arrange(desc(Sepal.Length)) %>% # order your values descending
select(Species, Sepal.Length) %>% # select columns of interest
group_by(Species) %>% # for each value
mutate(id = row_number()) %>% # create a row identifier
spread(Species, Sepal.Length) # reshape dataset
# # A tibble: 50 x 4
# id setosa versicolor virginica
# * <int> <dbl> <dbl> <dbl>
# 1 1 5.8 7.0 7.9
# 2 2 5.7 6.9 7.7
# 3 3 5.7 6.8 7.7
# 4 4 5.5 6.7 7.7
# 5 5 5.5 6.7 7.7
# 6 6 5.4 6.7 7.6
# 7 7 5.4 6.6 7.4
# 8 8 5.4 6.6 7.3
# 9 9 5.4 6.5 7.2
# 10 10 5.4 6.4 7.2
# # ... with 40 more rows
arrange(desc.))
与之前的不同之处在于,将确保您在顶行(降序)具有较高的值。