如何在R中获取最新的一行数据?
How to take the most recent row of data in R?
如果我有这个数据框:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010','30-11-2010', '31-12-2010',
'30-04-2011','31-07-2011','30-09-2011','30-11-2011', '31-12-2011',
'31-12-2011', '31-12-2012'),
website = c(
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube"
),
values = c(1, 2, 1, 2, 3, NA, 5, NA, NA, NA, NA, 10, 20, NA)
)
我如何着手创建一个列来标识该组期间和网站的最新非 NA 行?
因此最终输出将如下所示:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010','30-11-2010', '31-12-2010',
'30-04-2011','31-07-2011','30-09-2011','30-11-2011', '31-12-2011',
'31-12-2011', '31-12-2012'),
website = c(
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube"
),
values = c(1, 2, 1, 2, 3, NA, 5, NA, NA, NA, NA, 10, 20, NA),
most_recent = c('no','yes', 'no', 'no', 'no', 'no', 'no','yes','yes','yes','yes','yes','yes','no')
)
我正在尝试确定当按最近日期排序时期间和网站组的第一个非 na 值出现时,然后将此期间和网站的所有值标记为“是” most_recent列
所以你有以下内容:
- google 2011END 是日期的最新值,所以是
- facebook 2011q1 到 2011END 的值为是,因为有一个非 na 值是 2011END,这是最近的日期并且有一个非 na 值
- youtube 2011END 是 - 因为它是我们按日期排序时出现的第一个非 na 值,因为 2012 年没有值,所以它是一个无值
我知道它涉及一个分组但不确定从那里去哪里
为清楚起见,已对此进行更新
下面的代码按网站选择最近的非 NA
行。
由于这不完全是您的预期结果,如评论中所建议,请在必要时澄清您的问题。
data[,most_recent:=fifelse(!is.na(values)&date==.SD[!is.na(values),max(date)],'yes','no'),by=website][]
period date website values most_recent
1: 2010END 31-12-2010 google 1 no
2: 2011END 31-12-2011 google 2 yes
3: 2010Q1 30-04-2010 facebook 1 no
4: 2010Q2 31-07-2010 facebook 2 no
5: 2010Q3 30-09-2010 facebook 3 no
6: 2010Q4 30-11-2010 facebook NA no
7: 2010END 31-12-2010 facebook 5 no
8: 2011Q1 30-04-2011 facebook NA no
9: 2011Q2 31-07-2011 facebook NA no
10: 2011Q3 30-09-2011 facebook NA no
11: 2011Q4 30-11-2011 facebook NA no
12: 2011END 31-12-2011 facebook 10 yes
13: 2011END 31-12-2011 youtube 20 yes
14: 2012END 31-12-2012 youtube NA no
如果我有这个数据框:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010','30-11-2010', '31-12-2010',
'30-04-2011','31-07-2011','30-09-2011','30-11-2011', '31-12-2011',
'31-12-2011', '31-12-2012'),
website = c(
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube"
),
values = c(1, 2, 1, 2, 3, NA, 5, NA, NA, NA, NA, 10, 20, NA)
)
我如何着手创建一个列来标识该组期间和网站的最新非 NA 行?
因此最终输出将如下所示:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010','30-11-2010', '31-12-2010',
'30-04-2011','31-07-2011','30-09-2011','30-11-2011', '31-12-2011',
'31-12-2011', '31-12-2012'),
website = c(
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube"
),
values = c(1, 2, 1, 2, 3, NA, 5, NA, NA, NA, NA, 10, 20, NA),
most_recent = c('no','yes', 'no', 'no', 'no', 'no', 'no','yes','yes','yes','yes','yes','yes','no')
)
我正在尝试确定当按最近日期排序时期间和网站组的第一个非 na 值出现时,然后将此期间和网站的所有值标记为“是” most_recent列
所以你有以下内容:
- google 2011END 是日期的最新值,所以是
- facebook 2011q1 到 2011END 的值为是,因为有一个非 na 值是 2011END,这是最近的日期并且有一个非 na 值
- youtube 2011END 是 - 因为它是我们按日期排序时出现的第一个非 na 值,因为 2012 年没有值,所以它是一个无值
我知道它涉及一个分组但不确定从那里去哪里
为清楚起见,已对此进行更新
下面的代码按网站选择最近的非 NA
行。
由于这不完全是您的预期结果,如评论中所建议,请在必要时澄清您的问题。
data[,most_recent:=fifelse(!is.na(values)&date==.SD[!is.na(values),max(date)],'yes','no'),by=website][]
period date website values most_recent
1: 2010END 31-12-2010 google 1 no
2: 2011END 31-12-2011 google 2 yes
3: 2010Q1 30-04-2010 facebook 1 no
4: 2010Q2 31-07-2010 facebook 2 no
5: 2010Q3 30-09-2010 facebook 3 no
6: 2010Q4 30-11-2010 facebook NA no
7: 2010END 31-12-2010 facebook 5 no
8: 2011Q1 30-04-2011 facebook NA no
9: 2011Q2 31-07-2011 facebook NA no
10: 2011Q3 30-09-2011 facebook NA no
11: 2011Q4 30-11-2011 facebook NA no
12: 2011END 31-12-2011 facebook 10 yes
13: 2011END 31-12-2011 youtube 20 yes
14: 2012END 31-12-2012 youtube NA no