如何在 r 中包含缺失的数据点
How to include missing data points in r
这个问题是我上次 post () 的 spin-off。
我有一个关于并购 (M&As) 的大数据框(90 万行)。
df有四列:日期(并购完成时间),target_nation(其中一家公司国家是 merged/acquired),acquiror_nation(哪个国家的公司是收购方),以及 big_corp_TF (无论收购方是否是一家大公司,其中 TRUE 表示该公司是大公司)。这是我的数据示例:
> df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany", "Germany", "Germany", "France", "France", "Germany",
"France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA,
-15L))
> df
date target_nation acquiror_nation big_corp_TF
1: 2000 Uganda France TRUE
2: 2000 Uganda Germany FALSE
3: 2001 Uganda France TRUE
4: 2001 Uganda France FALSE
5: 2001 Uganda Germany FALSE
6: 2002 Uganda France TRUE
7: 2002 Uganda France TRUE
8: 2002 Uganda Germany TRUE
9: 2003 Uganda Germany TRUE
10: 2003 Uganda Germany FALSE
11: 2004 Uganda France TRUE
12: 2004 Uganda France FALSE
13: 2004 Uganda Germany TRUE
14: 2006 Uganda France TRUE
15: 2006 Uganda France TRUE
注意: 法国在 2003 年没有行;并且没有 2005 年。
根据这些数据,我想创建一个新变量来表示特定收购国的大公司完成的并购份额,计算 2 年的平均值。(对于我的实际练习,我将计算 5 年的平均值,但让我们在这里让事情更简单)。所以法国的大企业会有新的变量,德国的大企业也会有新的变量。
有人建议我使用以下代码:
library(runner)
library(tidyverse)
df <- df %>% as.data.frame()
param <- 'France'
df %>%
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
输出这个小标题:
date target_nation acquiror_nation n1 n2 share
<int> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2004 Uganda France 3 1 0.5
5 2006 Uganda France 2 2 0.6
注意:法国没有 2003 年和 2005 年的结果;我希望有 2003 年和 2005 年的结果(因为我们正在计算 2 年的平均值,因此我们应该能够获得 2003 年和 2005 年的结果)。另外,2006年的份额实际上是不正确的,因为它应该是1(它应该取2005年的值(0)而不是2004年的值来计算平均值)。
我希望能够收到以下小标题:
date target_nation acquiror_nation n1 n2 share
<int> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 2 0 0.4
5 2004 Uganda France 3 1 0.2
6 2005 Uganda France 0 0 0.33
7 2006 Uganda France 2 2 1.0
注意: 注意 2006 年的结果也不同(因为我们现在取 2005 年而不是 2004 年的 two-year 平均值)。
我知道这是原始数据的问题:它只是缺少某些数据点。然而,将它们包含到原始数据集中似乎非常不方便;最好包括它们 mid-way,例如在计算了 n1 和 n2 之后。但是最方便的方法是什么?
非常感谢任何建议。
df2 = df %>%
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param)
dates = seq(min(df2$date), max(df2$date), by = 1)
dates = setdiff(dates, df2$date)
df3 = df2[rep(nrow(df2), each = length(dates)), ]
df3$n1 = 0; df3$n2 = 0; df3$date = dates
df2 = arrange(rbind(df2,df3), date)
df2 = df2 %>% mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
df2
# A tibble: 7 x 6
date target_nation acquiror_nation n1 n2 share
<dbl> <fct> <fct> <dbl> <dbl> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 0 0 0.667
5 2004 Uganda France 3 1 0.333
6 2005 Uganda France 0 0 0.333
7 2006 Uganda France 2 2 1
说明
首先,根据您的 df
创建 df2
,但不计算 share
。创建从最小到最大的日期序列:
dates = seq(min(df2$date), max(df2$date), by = 1)
只保留 df2
中缺少的那些:
dates = setdiff(dates, df2$date)
为每个缺失的日期创建一行并将 n1
和 n2
设置为 0:
df3 = df2[rep(nrow(df2), each = length(dates)), ]
df3$n1 = 0; df3$n2 = 0; df3$date = dates
合并行并按日期排序:
df2 = arrange(rbind(df2,df3), date)
最后,计算share
:
df2 = df2 %>% mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
很抱歉这不符合 tidyverse 语法
使用 tidyr::complete
及其参数 nesting
和 fill
。可能使用的完整代码。
param <- 'France'
df %>%
mutate(d = 1) %>%
complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
# A tibble: 7 x 6
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 2 0 0.4
5 2004 Uganda France 3 1 0.2
6 2005 Uganda France 0 0 0.333
7 2006 Uganda France 2 2 1
这个问题是我上次 post (
我有一个关于并购 (M&As) 的大数据框(90 万行)。
df有四列:日期(并购完成时间),target_nation(其中一家公司国家是 merged/acquired),acquiror_nation(哪个国家的公司是收购方),以及 big_corp_TF (无论收购方是否是一家大公司,其中 TRUE 表示该公司是大公司)。这是我的数据示例:
> df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany", "Germany", "Germany", "France", "France", "Germany",
"France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA,
-15L))
> df
date target_nation acquiror_nation big_corp_TF
1: 2000 Uganda France TRUE
2: 2000 Uganda Germany FALSE
3: 2001 Uganda France TRUE
4: 2001 Uganda France FALSE
5: 2001 Uganda Germany FALSE
6: 2002 Uganda France TRUE
7: 2002 Uganda France TRUE
8: 2002 Uganda Germany TRUE
9: 2003 Uganda Germany TRUE
10: 2003 Uganda Germany FALSE
11: 2004 Uganda France TRUE
12: 2004 Uganda France FALSE
13: 2004 Uganda Germany TRUE
14: 2006 Uganda France TRUE
15: 2006 Uganda France TRUE
注意: 法国在 2003 年没有行;并且没有 2005 年。
根据这些数据,我想创建一个新变量来表示特定收购国的大公司完成的并购份额,计算 2 年的平均值。(对于我的实际练习,我将计算 5 年的平均值,但让我们在这里让事情更简单)。所以法国的大企业会有新的变量,德国的大企业也会有新的变量。
有人建议我使用以下代码:
library(runner)
library(tidyverse)
df <- df %>% as.data.frame()
param <- 'France'
df %>%
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
输出这个小标题:
date target_nation acquiror_nation n1 n2 share
<int> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2004 Uganda France 3 1 0.5
5 2006 Uganda France 2 2 0.6
注意:法国没有 2003 年和 2005 年的结果;我希望有 2003 年和 2005 年的结果(因为我们正在计算 2 年的平均值,因此我们应该能够获得 2003 年和 2005 年的结果)。另外,2006年的份额实际上是不正确的,因为它应该是1(它应该取2005年的值(0)而不是2004年的值来计算平均值)。
我希望能够收到以下小标题:
date target_nation acquiror_nation n1 n2 share
<int> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 2 0 0.4
5 2004 Uganda France 3 1 0.2
6 2005 Uganda France 0 0 0.33
7 2006 Uganda France 2 2 1.0
注意: 注意 2006 年的结果也不同(因为我们现在取 2005 年而不是 2004 年的 two-year 平均值)。
我知道这是原始数据的问题:它只是缺少某些数据点。然而,将它们包含到原始数据集中似乎非常不方便;最好包括它们 mid-way,例如在计算了 n1 和 n2 之后。但是最方便的方法是什么?
非常感谢任何建议。
df2 = df %>%
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param)
dates = seq(min(df2$date), max(df2$date), by = 1)
dates = setdiff(dates, df2$date)
df3 = df2[rep(nrow(df2), each = length(dates)), ]
df3$n1 = 0; df3$n2 = 0; df3$date = dates
df2 = arrange(rbind(df2,df3), date)
df2 = df2 %>% mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
df2
# A tibble: 7 x 6
date target_nation acquiror_nation n1 n2 share
<dbl> <fct> <fct> <dbl> <dbl> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 0 0 0.667
5 2004 Uganda France 3 1 0.333
6 2005 Uganda France 0 0 0.333
7 2006 Uganda France 2 2 1
说明
首先,根据您的 df
创建 df2
,但不计算 share
。创建从最小到最大的日期序列:
dates = seq(min(df2$date), max(df2$date), by = 1)
只保留 df2
中缺少的那些:
dates = setdiff(dates, df2$date)
为每个缺失的日期创建一行并将 n1
和 n2
设置为 0:
df3 = df2[rep(nrow(df2), each = length(dates)), ]
df3$n1 = 0; df3$n2 = 0; df3$date = dates
合并行并按日期排序:
df2 = arrange(rbind(df2,df3), date)
最后,计算share
:
df2 = df2 %>% mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
很抱歉这不符合 tidyverse 语法
使用 tidyr::complete
及其参数 nesting
和 fill
。可能使用的完整代码。
param <- 'France'
df %>%
mutate(d = 1) %>%
complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
# A tibble: 7 x 6
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 2 0 0.4
5 2004 Uganda France 3 1 0.2
6 2005 Uganda France 0 0 0.333
7 2006 Uganda France 2 2 1