按 R 中具有相同组 ID 的另一个数据框中的值过滤分组数据框
Filter grouped dataframe by value in another dataframe with same group IDs in R
我希望有人能帮助我解决这个问题,我正在使用 R 中的大型数据集。我有一个数据框,其中包含数千棵树 ID、测量年份和给定测量年份的直径。我想过滤此数据框,以便只保留相邻树死亡之前的行。我有第二个数据框,其中包含树 ID 和相邻树死亡的年份,以提供过滤依据的年份。
原始df的一小部分(仅4棵树):
tree_id year diameter
1 T1 1978 48.2
2 T1 1990 48.6
3 T1 1995 49.0
4 T1 2002 49.6
5 T1 2008 50.3
6 T1 2012 50.4
7 T1 2017 50.6
8 T2 1978 76.3
9 T2 1984 76.8
10 T2 1990 77.3
11 T2 1995 78.7
12 T2 2002 79.5
13 T2 2008 80.6
14 T2 2012 81.1
15 T2 2017 81.6
16 T3 1978 15.7
17 T3 1984 16.5
18 T3 1990 17.7
19 T3 1995 18.3
20 T3 2002 19.3
21 T3 2008 20.0
22 T3 2012 20.0
23 T3 2017 20.2
24 T4 1978 50.5
25 T4 1984 51.2
26 T4 1990 51.9
27 T4 1995 52.5
28 T4 2002 53.2
29 T4 2008 54.8
30 T4 2012 53.7
31 T4 2017 54.0
这里是数据框,我想通过以下方式过滤上面的内容:
tree_id neb_death
1 T1 2002
2 T2 2008
3 T3 1995
4 T4 2012
例如,我只想在更大的数据框中保留 tree_id = T1 的行,测量年份在 2002 年之前。我非常感谢任何使用基本 R 或 dplyr 方法的帮助。
谢谢!
我们可以先 left_join
通过 tree_id
然后 filter
:
library(dplyr)
left_join(df, df1, by="tree_id") %>%
filter(year < neb_death) %>%
select(-neb_death)
输出:
tree_id year diameter
<chr> <int> <dbl>
1 T1 1978 48.2
2 T1 1990 48.6
3 T1 1995 49
4 T2 1978 76.3
5 T2 1984 76.8
6 T2 1990 77.3
7 T2 1995 78.7
8 T2 2002 79.5
9 T3 1978 15.7
10 T3 1984 16.5
11 T3 1990 17.7
12 T4 1978 50.5
13 T4 1984 51.2
14 T4 1990 51.9
15 T4 1995 52.5
16 T4 2002 53.2
17 T4 2008 54.8
您可以使用 data.table 加入匹配的 tree_id
和 year < neb_death
。如果第一个table是df
,第二个是df2
:
library(data.table)
setDT(df)
setDT(df2)
df[df2, on = .(tree_id, year < neb_death)]
#> tree_id year diameter
#> <char> <int> <num>
#> 1: T1 2002 48.2
#> 2: T1 2002 48.6
#> 3: T1 2002 49.0
#> 4: T2 2008 76.3
#> 5: T2 2008 76.8
#> 6: T2 2008 77.3
#> 7: T2 2008 78.7
#> 8: T2 2008 79.5
#> 9: T3 1995 15.7
#> 10: T3 1995 16.5
#> 11: T3 1995 17.7
#> 12: T4 2012 50.5
#> 13: T4 2012 51.2
#> 14: T4 2012 51.9
#> 15: T4 2012 52.5
#> 16: T4 2012 53.2
#> 17: T4 2012 54.8
由 reprex package (v2.0.1)
于 2021-12-20 创建
使用的数据
df <- structure(list(tree_id = c("T1", "T1", "T1", "T1", "T1", "T1",
"T1", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T3",
"T3", "T3", "T3", "T3", "T3", "T3", "T4", "T4", "T4", "T4", "T4",
"T4", "T4", "T4"), year = c(1978L, 1990L, 1995L, 2002L, 2008L,
2012L, 2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L,
2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L,
1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L), diameter = c(48.2,
48.6, 49, 49.6, 50.3, 50.4, 50.6, 76.3, 76.8, 77.3, 78.7, 79.5,
80.6, 81.1, 81.6, 15.7, 16.5, 17.7, 18.3, 19.3, 20, 20, 20.2,
50.5, 51.2, 51.9, 52.5, 53.2, 54.8, 53.7, 54)), row.names = c(NA,
-31L), class = "data.frame")
df2 <- structure(list(tree_id = c("T1", "T2", "T3", "T4"), neb_death = c(2002L,
2008L, 1995L, 2012L)), row.names = c(NA, -4L), class = "data.frame")
使用基数 R,我们可以在 df2
上按行循环,识别 df1
中不需要的观察结果并删除它们。
df1[-unlist(apply(df2, 1, \(x) which(df1$tree_id == x[1] & df1$year >= x[2]))), ]
# tree_id year diameter
# 1 T1 1978 48.2
# 2 T1 1990 48.6
# 3 T1 1995 49.0
# 8 T2 1978 76.3
# 9 T2 1984 76.8
# 10 T2 1990 77.3
# 11 T2 1995 78.7
# 12 T2 2002 79.5
# 16 T3 1978 15.7
# 17 T3 1984 16.5
# 18 T3 1990 17.7
# 24 T4 1978 50.5
# 25 T4 1984 51.2
# 26 T4 1990 51.9
# 27 T4 1995 52.5
# 28 T4 2002 53.2
# 29 T4 2008 54.8
数据:
df1 <- structure(list(tree_id = c("T1", "T1", "T1", "T1", "T1", "T1",
"T1", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T3",
"T3", "T3", "T3", "T3", "T3", "T3", "T4", "T4", "T4", "T4", "T4",
"T4", "T4", "T4"), year = c(1978L, 1990L, 1995L, 2002L, 2008L,
2012L, 2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L,
2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L,
1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L), diameter = c(48.2,
48.6, 49, 49.6, 50.3, 50.4, 50.6, 76.3, 76.8, 77.3, 78.7, 79.5,
80.6, 81.1, 81.6, 15.7, 16.5, 17.7, 18.3, 19.3, 20, 20, 20.2,
50.5, 51.2, 51.9, 52.5, 53.2, 54.8, 53.7, 54)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
"25", "26", "27", "28", "29", "30", "31"))
df2 <- structure(list(tree_id = c("T1", "T2", "T3", "T4"), neb_death = c(2002L,
2008L, 1995L, 2012L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
我希望有人能帮助我解决这个问题,我正在使用 R 中的大型数据集。我有一个数据框,其中包含数千棵树 ID、测量年份和给定测量年份的直径。我想过滤此数据框,以便只保留相邻树死亡之前的行。我有第二个数据框,其中包含树 ID 和相邻树死亡的年份,以提供过滤依据的年份。
原始df的一小部分(仅4棵树):
tree_id year diameter
1 T1 1978 48.2
2 T1 1990 48.6
3 T1 1995 49.0
4 T1 2002 49.6
5 T1 2008 50.3
6 T1 2012 50.4
7 T1 2017 50.6
8 T2 1978 76.3
9 T2 1984 76.8
10 T2 1990 77.3
11 T2 1995 78.7
12 T2 2002 79.5
13 T2 2008 80.6
14 T2 2012 81.1
15 T2 2017 81.6
16 T3 1978 15.7
17 T3 1984 16.5
18 T3 1990 17.7
19 T3 1995 18.3
20 T3 2002 19.3
21 T3 2008 20.0
22 T3 2012 20.0
23 T3 2017 20.2
24 T4 1978 50.5
25 T4 1984 51.2
26 T4 1990 51.9
27 T4 1995 52.5
28 T4 2002 53.2
29 T4 2008 54.8
30 T4 2012 53.7
31 T4 2017 54.0
这里是数据框,我想通过以下方式过滤上面的内容:
tree_id neb_death
1 T1 2002
2 T2 2008
3 T3 1995
4 T4 2012
例如,我只想在更大的数据框中保留 tree_id = T1 的行,测量年份在 2002 年之前。我非常感谢任何使用基本 R 或 dplyr 方法的帮助。 谢谢!
我们可以先 left_join
通过 tree_id
然后 filter
:
library(dplyr)
left_join(df, df1, by="tree_id") %>%
filter(year < neb_death) %>%
select(-neb_death)
输出:
tree_id year diameter
<chr> <int> <dbl>
1 T1 1978 48.2
2 T1 1990 48.6
3 T1 1995 49
4 T2 1978 76.3
5 T2 1984 76.8
6 T2 1990 77.3
7 T2 1995 78.7
8 T2 2002 79.5
9 T3 1978 15.7
10 T3 1984 16.5
11 T3 1990 17.7
12 T4 1978 50.5
13 T4 1984 51.2
14 T4 1990 51.9
15 T4 1995 52.5
16 T4 2002 53.2
17 T4 2008 54.8
您可以使用 data.table 加入匹配的 tree_id
和 year < neb_death
。如果第一个table是df
,第二个是df2
:
library(data.table)
setDT(df)
setDT(df2)
df[df2, on = .(tree_id, year < neb_death)]
#> tree_id year diameter
#> <char> <int> <num>
#> 1: T1 2002 48.2
#> 2: T1 2002 48.6
#> 3: T1 2002 49.0
#> 4: T2 2008 76.3
#> 5: T2 2008 76.8
#> 6: T2 2008 77.3
#> 7: T2 2008 78.7
#> 8: T2 2008 79.5
#> 9: T3 1995 15.7
#> 10: T3 1995 16.5
#> 11: T3 1995 17.7
#> 12: T4 2012 50.5
#> 13: T4 2012 51.2
#> 14: T4 2012 51.9
#> 15: T4 2012 52.5
#> 16: T4 2012 53.2
#> 17: T4 2012 54.8
由 reprex package (v2.0.1)
于 2021-12-20 创建使用的数据
df <- structure(list(tree_id = c("T1", "T1", "T1", "T1", "T1", "T1",
"T1", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T3",
"T3", "T3", "T3", "T3", "T3", "T3", "T4", "T4", "T4", "T4", "T4",
"T4", "T4", "T4"), year = c(1978L, 1990L, 1995L, 2002L, 2008L,
2012L, 2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L,
2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L,
1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L), diameter = c(48.2,
48.6, 49, 49.6, 50.3, 50.4, 50.6, 76.3, 76.8, 77.3, 78.7, 79.5,
80.6, 81.1, 81.6, 15.7, 16.5, 17.7, 18.3, 19.3, 20, 20, 20.2,
50.5, 51.2, 51.9, 52.5, 53.2, 54.8, 53.7, 54)), row.names = c(NA,
-31L), class = "data.frame")
df2 <- structure(list(tree_id = c("T1", "T2", "T3", "T4"), neb_death = c(2002L,
2008L, 1995L, 2012L)), row.names = c(NA, -4L), class = "data.frame")
使用基数 R,我们可以在 df2
上按行循环,识别 df1
中不需要的观察结果并删除它们。
df1[-unlist(apply(df2, 1, \(x) which(df1$tree_id == x[1] & df1$year >= x[2]))), ]
# tree_id year diameter
# 1 T1 1978 48.2
# 2 T1 1990 48.6
# 3 T1 1995 49.0
# 8 T2 1978 76.3
# 9 T2 1984 76.8
# 10 T2 1990 77.3
# 11 T2 1995 78.7
# 12 T2 2002 79.5
# 16 T3 1978 15.7
# 17 T3 1984 16.5
# 18 T3 1990 17.7
# 24 T4 1978 50.5
# 25 T4 1984 51.2
# 26 T4 1990 51.9
# 27 T4 1995 52.5
# 28 T4 2002 53.2
# 29 T4 2008 54.8
数据:
df1 <- structure(list(tree_id = c("T1", "T1", "T1", "T1", "T1", "T1",
"T1", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T3",
"T3", "T3", "T3", "T3", "T3", "T3", "T4", "T4", "T4", "T4", "T4",
"T4", "T4", "T4"), year = c(1978L, 1990L, 1995L, 2002L, 2008L,
2012L, 2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L,
2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L,
1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L), diameter = c(48.2,
48.6, 49, 49.6, 50.3, 50.4, 50.6, 76.3, 76.8, 77.3, 78.7, 79.5,
80.6, 81.1, 81.6, 15.7, 16.5, 17.7, 18.3, 19.3, 20, 20, 20.2,
50.5, 51.2, 51.9, 52.5, 53.2, 54.8, 53.7, 54)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
"25", "26", "27", "28", "29", "30", "31"))
df2 <- structure(list(tree_id = c("T1", "T2", "T3", "T4"), neb_death = c(2002L,
2008L, 1995L, 2012L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))