如何根据不同条件从先前按特定列分组的数据框中采样行?
How to sample rows from a data frame that was previously grouped by a specific column, according to different conditions?
所以我有这个数据框 'df',包含不同的物种名称、ID、class 和大小。
我正在尝试按物种对数据框进行分组,然后从每个分组的物种中抽取三行。
我想对最初基于 class 的三行进行采样:因此,如果一个物种有 3 行带有“1_no”,1 行带有“2_yes”,另一行带有“[=23” =]”,我想保留前 3 个,因为优先考虑 class 中最小的数字,然后是“是”而不是“否”。因此,如果一行有“3_yes”和“3_no”,则“3_yes”应该保留在数据框中。
但是,如果一个物种,例如“Eutrigla gurnardus”,每一行只有“1_yes”,我想以随机方式对该分组物种的三行进行采样。
Species | ID| class| size
-----------------------------------------------------
Tilapia guineensis | 1| 1_yes| 400
Tilapia guineensis | 1| 1_no | 300
Tilapia guineensis | 1| 2_no| 700
Tilapia guineensis | 1| 3_yes | 900
Tilapia guineensis | 1| 3_yes | 900
Tilapia zillii | 2| 2_yes| 600
Tilapia zillii | 2| 2_no| 200
Tilapia zillii | 2| 1_yes| 500
Tilapia zillii | 2| 3_no| 200
Tilapia zillii | 2| 2_yes| 500
Eutrigla gurnardus | 5| 1_yes| 100
Eutrigla gurnardus | 5| 1_yes| 200
Eutrigla gurnardus | 5| 1_yes| 100
Eutrigla gurnardus | 5| 1_yes| 200
Sprattus sprattus | 6| 4_no| 300
Sprattus sprattus | 6| 3_yes | 400
Sprattus sprattus | 6| 4_yes | 300
Sprattus sprattus | 6| 5_yes| 400
我的输出是这样的:
Species | ID| class| size
-----------------------------------------------------
Tilapia guineensis | 1| 1_yes| 400
Tilapia guineensis | 1| 1_no | 300
Tilapia guineensis | 1| 2_no| 700
Tilapia zillii | 2| 2_yes| 600
Tilapia zillii | 2| 1_yes| 500
Tilapia zillii | 2| 2_yes| 500
Eutrigla gurnardus | 5| 1_yes| 100
Eutrigla gurnardus | 5| 1_yes| 100
Eutrigla gurnardus | 5| 1_yes| 200
Sprattus sprattus | 6| 4_no| 300
Sprattus sprattus | 6| 3_yes | 400
Sprattus sprattus | 6| 4_yes | 300
您可以对数据进行随机排序,然后按 class
的两个组成部分再次排列,优先但随机选择每个 Species
中的前 3 行。
df <- structure(list(Species = c("Tilapia guineensis", "Tilapia guineensis",
"Tilapia guineensis", "Tilapia guineensis", "Tilapia guineensis",
"Tilapia zillii", "Tilapia zillii", "Tilapia zillii", "Tilapia zillii",
"Tilapia zillii", "Eutrigla gurnardus", "Eutrigla gurnardus",
"Eutrigla gurnardus", "Eutrigla gurnardus", "Sprattus sprattus",
"Sprattus sprattus", "Sprattus sprattus", "Sprattus sprattus"
), ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 5, 5, 5, 5, 6, 6, 6,
6), class = c("1_yes", "1_no", "2_no", "3_yes", "3_yes", "2_yes",
"2_no", "1_yes", "3_no", "2_yes", "1_yes", "1_yes", "1_yes",
"1_yes", "4_no", "3_yes", "4_yes", "5_yes"), size = c(400, 300,
700, 900, 900, 600, 200, 500, 200, 500, 100, 200, 100, 200, 300,
400, 300, 400)), class = c("data.frame"), row.names = c(NA,
-18L))
library(dplyr)
library(tidyr)
df %>%
# split class into its two components
separate(class, into = c("number", "yesno"),
remove = FALSE, convert = TRUE) %>%
group_by(Species) %>%
# random order
slice_sample(prop = 1) %>%
# arrange by 1, 2, 3... yes, no on top of random order
arrange(number, desc(yesno)) %>%
# take the first 3
slice_head(n = 3) %>%
select(-number, -yesno)
#> # A tibble: 12 × 4
#> # Groups: Species [4]
#> Species ID class size
#> <chr> <dbl> <chr> <dbl>
#> 1 Eutrigla gurnardus 5 1_yes 200
#> 2 Eutrigla gurnardus 5 1_yes 100
#> 3 Eutrigla gurnardus 5 1_yes 200
#> 4 Sprattus sprattus 6 3_yes 400
#> 5 Sprattus sprattus 6 4_yes 300
#> 6 Sprattus sprattus 6 4_no 300
#> 7 Tilapia guineensis 1 1_yes 400
#> 8 Tilapia guineensis 1 1_no 300
#> 9 Tilapia guineensis 1 2_no 700
#> 10 Tilapia zillii 2 1_yes 500
#> 11 Tilapia zillii 2 2_yes 500
#> 12 Tilapia zillii 2 2_yes 600
由 reprex package (v2.0.1)
于 2022-05-26 创建
所以我有这个数据框 'df',包含不同的物种名称、ID、class 和大小。
我正在尝试按物种对数据框进行分组,然后从每个分组的物种中抽取三行。 我想对最初基于 class 的三行进行采样:因此,如果一个物种有 3 行带有“1_no”,1 行带有“2_yes”,另一行带有“[=23” =]”,我想保留前 3 个,因为优先考虑 class 中最小的数字,然后是“是”而不是“否”。因此,如果一行有“3_yes”和“3_no”,则“3_yes”应该保留在数据框中。
但是,如果一个物种,例如“Eutrigla gurnardus”,每一行只有“1_yes”,我想以随机方式对该分组物种的三行进行采样。
Species | ID| class| size
-----------------------------------------------------
Tilapia guineensis | 1| 1_yes| 400
Tilapia guineensis | 1| 1_no | 300
Tilapia guineensis | 1| 2_no| 700
Tilapia guineensis | 1| 3_yes | 900
Tilapia guineensis | 1| 3_yes | 900
Tilapia zillii | 2| 2_yes| 600
Tilapia zillii | 2| 2_no| 200
Tilapia zillii | 2| 1_yes| 500
Tilapia zillii | 2| 3_no| 200
Tilapia zillii | 2| 2_yes| 500
Eutrigla gurnardus | 5| 1_yes| 100
Eutrigla gurnardus | 5| 1_yes| 200
Eutrigla gurnardus | 5| 1_yes| 100
Eutrigla gurnardus | 5| 1_yes| 200
Sprattus sprattus | 6| 4_no| 300
Sprattus sprattus | 6| 3_yes | 400
Sprattus sprattus | 6| 4_yes | 300
Sprattus sprattus | 6| 5_yes| 400
我的输出是这样的:
Species | ID| class| size
-----------------------------------------------------
Tilapia guineensis | 1| 1_yes| 400
Tilapia guineensis | 1| 1_no | 300
Tilapia guineensis | 1| 2_no| 700
Tilapia zillii | 2| 2_yes| 600
Tilapia zillii | 2| 1_yes| 500
Tilapia zillii | 2| 2_yes| 500
Eutrigla gurnardus | 5| 1_yes| 100
Eutrigla gurnardus | 5| 1_yes| 100
Eutrigla gurnardus | 5| 1_yes| 200
Sprattus sprattus | 6| 4_no| 300
Sprattus sprattus | 6| 3_yes | 400
Sprattus sprattus | 6| 4_yes | 300
您可以对数据进行随机排序,然后按 class
的两个组成部分再次排列,优先但随机选择每个 Species
中的前 3 行。
df <- structure(list(Species = c("Tilapia guineensis", "Tilapia guineensis",
"Tilapia guineensis", "Tilapia guineensis", "Tilapia guineensis",
"Tilapia zillii", "Tilapia zillii", "Tilapia zillii", "Tilapia zillii",
"Tilapia zillii", "Eutrigla gurnardus", "Eutrigla gurnardus",
"Eutrigla gurnardus", "Eutrigla gurnardus", "Sprattus sprattus",
"Sprattus sprattus", "Sprattus sprattus", "Sprattus sprattus"
), ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 5, 5, 5, 5, 6, 6, 6,
6), class = c("1_yes", "1_no", "2_no", "3_yes", "3_yes", "2_yes",
"2_no", "1_yes", "3_no", "2_yes", "1_yes", "1_yes", "1_yes",
"1_yes", "4_no", "3_yes", "4_yes", "5_yes"), size = c(400, 300,
700, 900, 900, 600, 200, 500, 200, 500, 100, 200, 100, 200, 300,
400, 300, 400)), class = c("data.frame"), row.names = c(NA,
-18L))
library(dplyr)
library(tidyr)
df %>%
# split class into its two components
separate(class, into = c("number", "yesno"),
remove = FALSE, convert = TRUE) %>%
group_by(Species) %>%
# random order
slice_sample(prop = 1) %>%
# arrange by 1, 2, 3... yes, no on top of random order
arrange(number, desc(yesno)) %>%
# take the first 3
slice_head(n = 3) %>%
select(-number, -yesno)
#> # A tibble: 12 × 4
#> # Groups: Species [4]
#> Species ID class size
#> <chr> <dbl> <chr> <dbl>
#> 1 Eutrigla gurnardus 5 1_yes 200
#> 2 Eutrigla gurnardus 5 1_yes 100
#> 3 Eutrigla gurnardus 5 1_yes 200
#> 4 Sprattus sprattus 6 3_yes 400
#> 5 Sprattus sprattus 6 4_yes 300
#> 6 Sprattus sprattus 6 4_no 300
#> 7 Tilapia guineensis 1 1_yes 400
#> 8 Tilapia guineensis 1 1_no 300
#> 9 Tilapia guineensis 1 2_no 700
#> 10 Tilapia zillii 2 1_yes 500
#> 11 Tilapia zillii 2 2_yes 500
#> 12 Tilapia zillii 2 2_yes 600
由 reprex package (v2.0.1)
于 2022-05-26 创建