如何计算r中两年的移动平均值
How to calculate moving average for two years in r
我有一个关于并购 (M&As) 的大数据框(90 万行)。
df有四列:日期(并购完成时),target_nation(其中一家公司国家是 merged/acquired),acquiror_nation(哪个国家的公司是收购方),以及 big_corp (无论收购方是否为大公司,TRUE 表示该公司为大公司)。
这是我的 df 示例:
> df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L), target_nation = c("Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE)), row.names = c(NA, -8L))
> df
date target_nation acquiror_nation big_corp_TF
1: 2000 Uganda France TRUE
2: 2000 Uganda Germany FALSE
3: 2001 Uganda France TRUE
4: 2001 Uganda France FALSE
5: 2001 Uganda Germany FALSE
6: 2002 Uganda France TRUE
7: 2002 Uganda France TRUE
8: 2002 Uganda Germany TRUE
根据这些数据,我想创建一个新变量来表示特定收购国的大公司完成的并购份额,计算 2 年的平均值。(对于我的实际练习,我将计算 5 年的平均值,但让我们在这里让事情更简单)。所以法国的大企业会有新的变量,德国的大企业也会有新的变量
到目前为止,我设法做到的是 1) 计算特定年份的特定 target_nation 中的并购总数; 2) 统计某acquiror_nation某大公司在某年某某target_nation的并购总数。我加入了这两个df,方便计算我想要的平均值。这是我使用的代码和生成的新 df:
##counting total rows for target nations
df2 <- df %>%
group_by(date, target_nation) %>%
count(target_nation)
##counting total rows conducted by small or big corps for certain acquiror nations
df3 <- df %>%
group_by(date, target_nation, acquiror_nation) %>%
count(big_corp_TF)
##selecting rows that were conducted by big corps
df33 <- df3 %>%
filter(big_corp_TF == TRUE)
##merging df2 and df33
df4 <- df2 %>%
left_join(df33, by = c("date" = "date", "target_nation" = "target_nation"))
df4 <- as.data.frame(df4)
> df4
date target_nation n.x acquiror_nation big_corp_TF n.y
1 2000 Uganda 2 France TRUE 1
2 2001 Uganda 3 France TRUE 1
3 2002 Uganda 3 France TRUE 2
4 2002 Uganda 3 Germany TRUE 1
n.x这里是某一年特定target_nation的并购总数(行); n.y 是特定 acquiror_nation 的大公司在某个 target_nation.
进行的并购总数(行)
有了这个新的数据框 df4,我现在可以很容易地计算出特定 acquiror_nation 的大公司在特定年份的特定 target_nation 中进行的并购所占的份额。例如,让我们算一下法国的这个份额:
df5 <- df4 %>%
filter(acquiror_nation == "France") %>%
mutate(France_bigcorp_share_1year = n.y / n.x)
date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_1year
1 2000 Uganda 2 France TRUE 1 0.5000000
2 2001 Uganda 3 France TRUE 1 0.3333333
3 2002 Uganda 3 France TRUE 2 0.6666667
但是,我无法弄清楚如何计算特定收购国的大公司完成的并购份额,计算 2 年的平均值。
这是所需变量的样子:
date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_2years
1 2000 Uganda 2 France TRUE 1 0.5000000
2 2001 Uganda 3 France TRUE 1 0.4000000
3 2002 Uganda 3 France TRUE 2 0.5000000
请注意,2000 年的份额将保持不变,因为没有前一年使其成为 2 年的平均值;对于 2001 年,它将变为 0.4(因为 (1+1)/(2+3) = 0.4);对于 2002 年,它将变为 0.5(因为 (1+2)/(3+3) = 0.5)。
你知道如何编写计算两年平均份额的代码吗?我想我需要在这里使用 for 循环,但我不知道如何使用。如有任何建议,我们将不胜感激。
--
编辑: AnilGoyal 的代码与示例数据完美配合,但我的实际数据显然更加混乱,因此我想知道是否有解决我遇到的问题的方法。
我的实际数据集有时会跳过一年,或者有时不包括前几行中包含的 acquiror_nation。请查看我的实际数据的更准确样本:
> df_new <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany", "Germany", "Germany", "France", "France", "Germany",
"France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA,
-15L))
> df_new
date target_nation acquiror_nation big_corp_TF
1: 2000 Uganda France TRUE
2: 2000 Uganda Germany FALSE
3: 2001 Uganda France TRUE
4: 2001 Uganda France FALSE
5: 2001 Uganda Germany FALSE
6: 2002 Uganda France TRUE
7: 2002 Uganda France TRUE
8: 2002 Uganda Germany TRUE
9: 2003 Uganda Germany TRUE
10: 2003 Uganda Germany FALSE
11: 2004 Uganda France TRUE
12: 2004 Uganda France FALSE
13: 2004 Uganda Germany TRUE
14: 2006 Uganda France TRUE
15: 2006 Uganda France TRUE
注意:2003 年法国没有行;并且没有 2005 年。
如果我运行 Anil的第一个代码,结果是下面的tibble:
date target_nation acquiror_nation n1 n2 share
<int> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2004 Uganda France 3 1 0.5
5 2006 Uganda France 2 2 0.6
注意:法国没有 2003 年和 2005 年的结果;我希望有 2003 年和 2005 年的结果(因为我们正在计算 2 年的平均值,因此我们应该能够获得 2003 年和 2005 年的结果)。另外,2006年的份额实际上是不正确的,因为它应该是1(它应该取2005年的值(为0)而不是2004年的值来计算平均值)。
我希望能够收到以下小标题:
date target_nation acquiror_nation n1 n2 share
<int> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 2 0 0.4
5 2004 Uganda France 3 1 0.2
6 2005 Uganda France 0 0 0.33
7 2006 Uganda France 2 2 1.0
注意:请注意 2006 年的结果也有所不同(因为我们现在采用 2005 年而不是 2004 年的 two-year 平均值)。
您认为有可能找到一种方法来输出所需的小标题吗?我知道这是原始数据的问题:它只是缺少某些数据点。然而,将它们包含到原始数据集中似乎非常不方便;最好包括它们 mid-way,例如在计算了 n1 和 n2 之后。但是最方便的方法是什么?
EDIT2: Anil 的新代码可以很好地处理上面的数据样本,但是 运行 在处理更复杂的数据样本(即包括多个 target_nation)。这是一个更短但更复杂的数据示例:
> df_new_complex <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2003L,
2003L, 1999L, 2001L, 2002L, 2002L), target_nation = c("Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Mozambique",
"Mozambique", "Mozambique", "Mozambique"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "Germany", "Germany",
"Germany", "France", "France", "Germany"), big_corp_TF = c(TRUE,
FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE
)), row.names = c(NA, -11L))
> df_new_complex
date target_nation acquiror_nation big_corp_TF
1: 2000 Uganda France TRUE
2: 2000 Uganda Germany FALSE
3: 2001 Uganda France TRUE
4: 2001 Uganda France FALSE
5: 2001 Uganda Germany FALSE
6: 2003 Uganda Germany TRUE
7: 2003 Uganda Germany FALSE
8: 1999 Mozambique Germany FALSE
9: 2001 Mozambique France TRUE
10: 2002 Mozambique France FALSE
11: 2002 Mozambique Germany TRUE
如您所见,此数据样本包含两个 target_nation。 Anil 的代码,其中 param <- c("France", "Germany")
,产生以下小标题:
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 1999 Mozambique France 1 0 0
2 1999 Mozambique Germany 1 0 0
3 1999 Uganda France 0 0 0
4 1999 Uganda Germany 0 0 0
5 2000 Mozambique France 0 0 0
6 2000 Mozambique Germany 0 0 0
7 2000 Uganda France 2 1 0.25
8 2000 Uganda Germany 2 0 0.167
9 2001 Mozambique France 1 1 0.4
10 2001 Mozambique Germany 1 0 0.333
11 2001 Uganda France 3 1 0.333
12 2001 Uganda Germany 3 0 0.25
13 2002 Mozambique France 2 0 0.2
14 2002 Mozambique Germany 2 1 0.25
15 2002 Uganda France 0 0 0.25
16 2002 Uganda Germany 0 0 0.25
17 2003 Mozambique France 0 0 0.25
18 2003 Mozambique Germany 0 0 0.25
19 2003 Uganda France 2 0 0.167
20 2003 Uganda Germany 2 1 0.25
这里不希望看到代码为乌干达创建了 1999 年,为莫桑比克创建了 2003 年(后者不是什么大问题)。在 1999 年,乌干达没有投资,如数据样本中所示,因此为其提供数值是没有意义的(它可能有 NA,或者根本没有)。莫桑比克在 2003 年也没有投资,所以我不想计算莫桑比克当年的份额。
我找到了一个解决方法,我在代码的早期过滤了一个特定的目标国家,就像这样:
correct1 <- df_new_complex %>%
filter(target_nation == "Mozambique") %>%
mutate(d = 1) %>% ...
#I do the same for another target_nation
correct2 <- df_new_complex %>%
filter(target_nation == "Uganda") %>%
mutate(d = 1) %>% ...
#I then use rbind
correct <- rbind(correct1, correct2)
#which produces the desired tibble (without a year 2003 for Mozambique and 1999 for Uganda).
> correct
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 1999 Mozambique France 1 0 0
2 1999 Mozambique Germany 1 0 0
3 2000 Mozambique France 0 0 0
4 2000 Mozambique Germany 0 0 0
5 2001 Mozambique France 1 1 1
6 2001 Mozambique Germany 1 0 0
7 2002 Mozambique France 2 0 0.33
8 2002 Mozambique Germany 2 1 0.333
9 2000 Uganda France 2 1 0.5
10 2000 Uganda Germany 2 0 0.25
11 2001 Uganda France 3 1 0.286
12 2001 Uganda Germany 3 0 0.2
13 2002 Uganda France 0 0 0.167
14 2002 Uganda Germany 0 0 0.167
15 2003 Uganda France 2 0 0
16 2003 Uganda Germany 2 1 0.25
执行此操作的更快方法是什么?我有一个所需 target_nation 的列表。也许可以创建一个循环,我可以计算一个 target_nation,然后计算另一个;然后绑定他们;然后是另一个;然后是rbind等。还是有更好的方法?
使用包 runner
你可以做这样的事情
df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L), target_nation = c("Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE)), row.names = c(NA, -8L))
library(runner)
library(tidyverse)
df <- df %>% as.data.frame()
param <- 'France'
df %>%
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
#> # A tibble: 3 x 6
#> date target_nation acquiror_nation n1 n2 share
#> <int> <chr> <chr> <dbl> <int> <dbl>
#> 1 2000 Uganda France 2 1 0.5
#> 2 2001 Uganda France 3 1 0.4
#> 3 2002 Uganda France 3 2 0.5
甚至你也可以同时为所有国家做事
df %>%
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
group_by(acquiror_nation) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
#> # A tibble: 6 x 6
#> # Groups: acquiror_nation [2]
#> date target_nation acquiror_nation n1 n2 share
#> <int> <chr> <chr> <dbl> <int> <dbl>
#> 1 2000 Uganda France 2 1 0.5
#> 2 2000 Uganda Germany 2 0 0
#> 3 2001 Uganda France 3 1 0.4
#> 4 2001 Uganda Germany 3 0 0
#> 5 2002 Uganda France 3 2 0.5
#> 6 2002 Uganda Germany 3 1 0.167
鉴于修改后的场景,你需要做两件事-
- 在两个
sum_run
函数中包含参数 idx = date
。这将根据需要更正输出,但不会包括缺少 rows/years. 的份额
- 要包括缺失的年份,您需要
tidyr::complete
如下所示-
param <- 'France'
df_new %>%
mutate(d = 1) %>%
complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2, idx = date)/sum_run(n1, k=2, idx = date))
# A tibble: 7 x 6
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 2 0 0.4
5 2004 Uganda France 3 1 0.2
6 2005 Uganda France 0 0 0.333
7 2006 Uganda France 2 2 1
与上面类似,您可以一次对所有国家执行此操作(替换过滤器 group_by)
df_new %>%
mutate(d = 1) %>%
complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
group_by(acquiror_nation) %>%
mutate(share = sum_run(n2, k=2, idx = date)/sum_run(n1, k=2, idx = date))
# A tibble: 14 x 6
# Groups: acquiror_nation [2]
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2000 Uganda Germany 2 0 0
3 2001 Uganda France 3 1 0.4
4 2001 Uganda Germany 3 0 0
5 2002 Uganda France 3 2 0.5
6 2002 Uganda Germany 3 1 0.167
7 2003 Uganda France 2 0 0.4
8 2003 Uganda Germany 2 1 0.4
9 2004 Uganda France 3 1 0.2
10 2004 Uganda Germany 3 1 0.4
11 2005 Uganda France 0 0 0.333
12 2005 Uganda Germany 0 0 0.333
13 2006 Uganda France 2 2 1
14 2006 Uganda Germany 2 0 0
进一步编辑
- 这很容易。从
nesting
中删除 target_nation
并在 complete
. 之前添加一个 group_by
简单。是不是
df_new_complex %>%
mutate(d = 1) %>%
group_by(target_nation) %>%
complete(date = seq(min(date), max(date), 1), nesting(acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
group_by(acquiror_nation) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
# A tibble: 16 x 6
# Groups: acquiror_nation [2]
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 1999 Mozambique France 1 0 0
2 1999 Mozambique Germany 1 0 0
3 2000 Mozambique France 0 0 0
4 2000 Mozambique Germany 0 0 0
5 2000 Uganda France 2 1 0.5
6 2000 Uganda Germany 2 0 0
7 2001 Mozambique France 1 1 0.667
8 2001 Mozambique Germany 1 0 0
9 2001 Uganda France 3 1 0.5
10 2001 Uganda Germany 3 0 0
11 2002 Mozambique France 2 0 0.2
12 2002 Mozambique Germany 2 1 0.2
13 2002 Uganda France 0 0 0
14 2002 Uganda Germany 0 0 0.5
15 2003 Uganda France 2 0 0
16 2003 Uganda Germany 2 1 0.5
我注意到你已经删除了原来的问题。
在我的解决方案中,即使没有行 2003 和 2005,我也可以直接计算 bigcorp_share_2years
。
library(data.table)
df_new <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany", "Germany", "Germany", "France", "France", "Germany",
"France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA,
-15L))
setDT(df_new)
# NY is the total observation number for two consecutive years.
this = 0
df_new[, NR := .N,by = date] # NR is each group's length
df_new[, NY := { last = this; this = last(NR); last + this }, by = date]
# special deal with single year, e.g. 2006.
df_new[, NY := ifelse( (date - 1) %in% date, NY, NR)]
# snx: count big_corp_TF for acquiror_nation, which will be used to calculate NX
df_new[, snx := sum(big_corp_TF), by = .(date,acquiror_nation)]
# df2: remove column big_crop_TF for unique operation
df2 <- df_new[,c(1:3,5:7)][,unique(.SD)]
# roll count for two consecutive years
df2[, NX := frollsum(snx,2),by=.(acquiror_nation)]
df2[, NX := ifelse( (date - 1) %in% date, NX, snx),acquiror_nation][]
#> date target_nation acquiror_nation NR NY snx NX
#> 1: 2000 Uganda France 2 2 1 1
#> 2: 2000 Uganda Germany 2 2 0 0
#> 3: 2001 Uganda France 3 5 1 2
#> 4: 2001 Uganda Germany 3 5 0 0
#> 5: 2002 Uganda France 3 6 2 3
#> 6: 2002 Uganda Germany 3 6 1 1
#> 7: 2003 Uganda Germany 2 5 1 2
#> 8: 2004 Uganda France 3 5 1 1
#> 9: 2004 Uganda Germany 3 5 1 2
#> 10: 2006 Uganda France 2 2 2 2
df2[, bigcorp_share_2years := NX/NY]
df2[, .(date,target_nation,NY,NX,bigcorp_share_2years),by=.(acquiror_nation)]
#> acquiror_nation date target_nation NY NX bigcorp_share_2years
#> 1: France 2000 Uganda 2 1 0.5000000
#> 2: France 2001 Uganda 5 2 0.4000000
#> 3: France 2002 Uganda 6 3 0.5000000
#> 4: France 2004 Uganda 5 1 0.2000000
#> 5: France 2006 Uganda 2 2 1.0000000
#> 6: Germany 2000 Uganda 2 0 0.0000000
#> 7: Germany 2001 Uganda 5 0 0.0000000
#> 8: Germany 2002 Uganda 6 1 0.1666667
#> 9: Germany 2003 Uganda 5 2 0.4000000
#> 10: Germany 2004 Uganda 5 2 0.4000000
由 reprex package (v2.0.0)
于 2021-05-03 创建
我有一个关于并购 (M&As) 的大数据框(90 万行)。
df有四列:日期(并购完成时),target_nation(其中一家公司国家是 merged/acquired),acquiror_nation(哪个国家的公司是收购方),以及 big_corp (无论收购方是否为大公司,TRUE 表示该公司为大公司)。
这是我的 df 示例:
> df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L), target_nation = c("Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE)), row.names = c(NA, -8L))
> df
date target_nation acquiror_nation big_corp_TF
1: 2000 Uganda France TRUE
2: 2000 Uganda Germany FALSE
3: 2001 Uganda France TRUE
4: 2001 Uganda France FALSE
5: 2001 Uganda Germany FALSE
6: 2002 Uganda France TRUE
7: 2002 Uganda France TRUE
8: 2002 Uganda Germany TRUE
根据这些数据,我想创建一个新变量来表示特定收购国的大公司完成的并购份额,计算 2 年的平均值。(对于我的实际练习,我将计算 5 年的平均值,但让我们在这里让事情更简单)。所以法国的大企业会有新的变量,德国的大企业也会有新的变量
到目前为止,我设法做到的是 1) 计算特定年份的特定 target_nation 中的并购总数; 2) 统计某acquiror_nation某大公司在某年某某target_nation的并购总数。我加入了这两个df,方便计算我想要的平均值。这是我使用的代码和生成的新 df:
##counting total rows for target nations
df2 <- df %>%
group_by(date, target_nation) %>%
count(target_nation)
##counting total rows conducted by small or big corps for certain acquiror nations
df3 <- df %>%
group_by(date, target_nation, acquiror_nation) %>%
count(big_corp_TF)
##selecting rows that were conducted by big corps
df33 <- df3 %>%
filter(big_corp_TF == TRUE)
##merging df2 and df33
df4 <- df2 %>%
left_join(df33, by = c("date" = "date", "target_nation" = "target_nation"))
df4 <- as.data.frame(df4)
> df4
date target_nation n.x acquiror_nation big_corp_TF n.y
1 2000 Uganda 2 France TRUE 1
2 2001 Uganda 3 France TRUE 1
3 2002 Uganda 3 France TRUE 2
4 2002 Uganda 3 Germany TRUE 1
n.x这里是某一年特定target_nation的并购总数(行); n.y 是特定 acquiror_nation 的大公司在某个 target_nation.
进行的并购总数(行)有了这个新的数据框 df4,我现在可以很容易地计算出特定 acquiror_nation 的大公司在特定年份的特定 target_nation 中进行的并购所占的份额。例如,让我们算一下法国的这个份额:
df5 <- df4 %>%
filter(acquiror_nation == "France") %>%
mutate(France_bigcorp_share_1year = n.y / n.x)
date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_1year
1 2000 Uganda 2 France TRUE 1 0.5000000
2 2001 Uganda 3 France TRUE 1 0.3333333
3 2002 Uganda 3 France TRUE 2 0.6666667
但是,我无法弄清楚如何计算特定收购国的大公司完成的并购份额,计算 2 年的平均值。
这是所需变量的样子:
date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_2years
1 2000 Uganda 2 France TRUE 1 0.5000000
2 2001 Uganda 3 France TRUE 1 0.4000000
3 2002 Uganda 3 France TRUE 2 0.5000000
请注意,2000 年的份额将保持不变,因为没有前一年使其成为 2 年的平均值;对于 2001 年,它将变为 0.4(因为 (1+1)/(2+3) = 0.4);对于 2002 年,它将变为 0.5(因为 (1+2)/(3+3) = 0.5)。
你知道如何编写计算两年平均份额的代码吗?我想我需要在这里使用 for 循环,但我不知道如何使用。如有任何建议,我们将不胜感激。
--
编辑: AnilGoyal 的代码与示例数据完美配合,但我的实际数据显然更加混乱,因此我想知道是否有解决我遇到的问题的方法。
我的实际数据集有时会跳过一年,或者有时不包括前几行中包含的 acquiror_nation。请查看我的实际数据的更准确样本:
> df_new <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany", "Germany", "Germany", "France", "France", "Germany",
"France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA,
-15L))
> df_new
date target_nation acquiror_nation big_corp_TF
1: 2000 Uganda France TRUE
2: 2000 Uganda Germany FALSE
3: 2001 Uganda France TRUE
4: 2001 Uganda France FALSE
5: 2001 Uganda Germany FALSE
6: 2002 Uganda France TRUE
7: 2002 Uganda France TRUE
8: 2002 Uganda Germany TRUE
9: 2003 Uganda Germany TRUE
10: 2003 Uganda Germany FALSE
11: 2004 Uganda France TRUE
12: 2004 Uganda France FALSE
13: 2004 Uganda Germany TRUE
14: 2006 Uganda France TRUE
15: 2006 Uganda France TRUE
注意:2003 年法国没有行;并且没有 2005 年。
如果我运行 Anil的第一个代码,结果是下面的tibble:
date target_nation acquiror_nation n1 n2 share
<int> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2004 Uganda France 3 1 0.5
5 2006 Uganda France 2 2 0.6
注意:法国没有 2003 年和 2005 年的结果;我希望有 2003 年和 2005 年的结果(因为我们正在计算 2 年的平均值,因此我们应该能够获得 2003 年和 2005 年的结果)。另外,2006年的份额实际上是不正确的,因为它应该是1(它应该取2005年的值(为0)而不是2004年的值来计算平均值)。
我希望能够收到以下小标题:
date target_nation acquiror_nation n1 n2 share
<int> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 2 0 0.4
5 2004 Uganda France 3 1 0.2
6 2005 Uganda France 0 0 0.33
7 2006 Uganda France 2 2 1.0
注意:请注意 2006 年的结果也有所不同(因为我们现在采用 2005 年而不是 2004 年的 two-year 平均值)。
您认为有可能找到一种方法来输出所需的小标题吗?我知道这是原始数据的问题:它只是缺少某些数据点。然而,将它们包含到原始数据集中似乎非常不方便;最好包括它们 mid-way,例如在计算了 n1 和 n2 之后。但是最方便的方法是什么?
EDIT2: Anil 的新代码可以很好地处理上面的数据样本,但是 运行 在处理更复杂的数据样本(即包括多个 target_nation)。这是一个更短但更复杂的数据示例:
> df_new_complex <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2003L,
2003L, 1999L, 2001L, 2002L, 2002L), target_nation = c("Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Mozambique",
"Mozambique", "Mozambique", "Mozambique"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "Germany", "Germany",
"Germany", "France", "France", "Germany"), big_corp_TF = c(TRUE,
FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE
)), row.names = c(NA, -11L))
> df_new_complex
date target_nation acquiror_nation big_corp_TF
1: 2000 Uganda France TRUE
2: 2000 Uganda Germany FALSE
3: 2001 Uganda France TRUE
4: 2001 Uganda France FALSE
5: 2001 Uganda Germany FALSE
6: 2003 Uganda Germany TRUE
7: 2003 Uganda Germany FALSE
8: 1999 Mozambique Germany FALSE
9: 2001 Mozambique France TRUE
10: 2002 Mozambique France FALSE
11: 2002 Mozambique Germany TRUE
如您所见,此数据样本包含两个 target_nation。 Anil 的代码,其中 param <- c("France", "Germany")
,产生以下小标题:
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 1999 Mozambique France 1 0 0
2 1999 Mozambique Germany 1 0 0
3 1999 Uganda France 0 0 0
4 1999 Uganda Germany 0 0 0
5 2000 Mozambique France 0 0 0
6 2000 Mozambique Germany 0 0 0
7 2000 Uganda France 2 1 0.25
8 2000 Uganda Germany 2 0 0.167
9 2001 Mozambique France 1 1 0.4
10 2001 Mozambique Germany 1 0 0.333
11 2001 Uganda France 3 1 0.333
12 2001 Uganda Germany 3 0 0.25
13 2002 Mozambique France 2 0 0.2
14 2002 Mozambique Germany 2 1 0.25
15 2002 Uganda France 0 0 0.25
16 2002 Uganda Germany 0 0 0.25
17 2003 Mozambique France 0 0 0.25
18 2003 Mozambique Germany 0 0 0.25
19 2003 Uganda France 2 0 0.167
20 2003 Uganda Germany 2 1 0.25
这里不希望看到代码为乌干达创建了 1999 年,为莫桑比克创建了 2003 年(后者不是什么大问题)。在 1999 年,乌干达没有投资,如数据样本中所示,因此为其提供数值是没有意义的(它可能有 NA,或者根本没有)。莫桑比克在 2003 年也没有投资,所以我不想计算莫桑比克当年的份额。
我找到了一个解决方法,我在代码的早期过滤了一个特定的目标国家,就像这样:
correct1 <- df_new_complex %>%
filter(target_nation == "Mozambique") %>%
mutate(d = 1) %>% ...
#I do the same for another target_nation
correct2 <- df_new_complex %>%
filter(target_nation == "Uganda") %>%
mutate(d = 1) %>% ...
#I then use rbind
correct <- rbind(correct1, correct2)
#which produces the desired tibble (without a year 2003 for Mozambique and 1999 for Uganda).
> correct
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 1999 Mozambique France 1 0 0
2 1999 Mozambique Germany 1 0 0
3 2000 Mozambique France 0 0 0
4 2000 Mozambique Germany 0 0 0
5 2001 Mozambique France 1 1 1
6 2001 Mozambique Germany 1 0 0
7 2002 Mozambique France 2 0 0.33
8 2002 Mozambique Germany 2 1 0.333
9 2000 Uganda France 2 1 0.5
10 2000 Uganda Germany 2 0 0.25
11 2001 Uganda France 3 1 0.286
12 2001 Uganda Germany 3 0 0.2
13 2002 Uganda France 0 0 0.167
14 2002 Uganda Germany 0 0 0.167
15 2003 Uganda France 2 0 0
16 2003 Uganda Germany 2 1 0.25
执行此操作的更快方法是什么?我有一个所需 target_nation 的列表。也许可以创建一个循环,我可以计算一个 target_nation,然后计算另一个;然后绑定他们;然后是另一个;然后是rbind等。还是有更好的方法?
使用包 runner
你可以做这样的事情
df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L), target_nation = c("Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE)), row.names = c(NA, -8L))
library(runner)
library(tidyverse)
df <- df %>% as.data.frame()
param <- 'France'
df %>%
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
#> # A tibble: 3 x 6
#> date target_nation acquiror_nation n1 n2 share
#> <int> <chr> <chr> <dbl> <int> <dbl>
#> 1 2000 Uganda France 2 1 0.5
#> 2 2001 Uganda France 3 1 0.4
#> 3 2002 Uganda France 3 2 0.5
甚至你也可以同时为所有国家做事
df %>%
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
group_by(acquiror_nation) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
#> # A tibble: 6 x 6
#> # Groups: acquiror_nation [2]
#> date target_nation acquiror_nation n1 n2 share
#> <int> <chr> <chr> <dbl> <int> <dbl>
#> 1 2000 Uganda France 2 1 0.5
#> 2 2000 Uganda Germany 2 0 0
#> 3 2001 Uganda France 3 1 0.4
#> 4 2001 Uganda Germany 3 0 0
#> 5 2002 Uganda France 3 2 0.5
#> 6 2002 Uganda Germany 3 1 0.167
鉴于修改后的场景,你需要做两件事-
- 在两个
sum_run
函数中包含参数idx = date
。这将根据需要更正输出,但不会包括缺少 rows/years. 的份额
- 要包括缺失的年份,您需要
tidyr::complete
如下所示-
param <- 'France'
df_new %>%
mutate(d = 1) %>%
complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2, idx = date)/sum_run(n1, k=2, idx = date))
# A tibble: 7 x 6
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 2 0 0.4
5 2004 Uganda France 3 1 0.2
6 2005 Uganda France 0 0 0.333
7 2006 Uganda France 2 2 1
与上面类似,您可以一次对所有国家执行此操作(替换过滤器 group_by)
df_new %>%
mutate(d = 1) %>%
complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
group_by(acquiror_nation) %>%
mutate(share = sum_run(n2, k=2, idx = date)/sum_run(n1, k=2, idx = date))
# A tibble: 14 x 6
# Groups: acquiror_nation [2]
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2000 Uganda Germany 2 0 0
3 2001 Uganda France 3 1 0.4
4 2001 Uganda Germany 3 0 0
5 2002 Uganda France 3 2 0.5
6 2002 Uganda Germany 3 1 0.167
7 2003 Uganda France 2 0 0.4
8 2003 Uganda Germany 2 1 0.4
9 2004 Uganda France 3 1 0.2
10 2004 Uganda Germany 3 1 0.4
11 2005 Uganda France 0 0 0.333
12 2005 Uganda Germany 0 0 0.333
13 2006 Uganda France 2 2 1
14 2006 Uganda Germany 2 0 0
进一步编辑
- 这很容易。从
nesting
中删除target_nation
并在complete
. 之前添加一个
group_by
简单。是不是
df_new_complex %>%
mutate(d = 1) %>%
group_by(target_nation) %>%
complete(date = seq(min(date), max(date), 1), nesting(acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
group_by(acquiror_nation) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
# A tibble: 16 x 6
# Groups: acquiror_nation [2]
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 1999 Mozambique France 1 0 0
2 1999 Mozambique Germany 1 0 0
3 2000 Mozambique France 0 0 0
4 2000 Mozambique Germany 0 0 0
5 2000 Uganda France 2 1 0.5
6 2000 Uganda Germany 2 0 0
7 2001 Mozambique France 1 1 0.667
8 2001 Mozambique Germany 1 0 0
9 2001 Uganda France 3 1 0.5
10 2001 Uganda Germany 3 0 0
11 2002 Mozambique France 2 0 0.2
12 2002 Mozambique Germany 2 1 0.2
13 2002 Uganda France 0 0 0
14 2002 Uganda Germany 0 0 0.5
15 2003 Uganda France 2 0 0
16 2003 Uganda Germany 2 1 0.5
我注意到你已经删除了原来的问题。
在我的解决方案中,即使没有行 2003 和 2005,我也可以直接计算 bigcorp_share_2years
。
library(data.table)
df_new <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany", "Germany", "Germany", "France", "France", "Germany",
"France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA,
-15L))
setDT(df_new)
# NY is the total observation number for two consecutive years.
this = 0
df_new[, NR := .N,by = date] # NR is each group's length
df_new[, NY := { last = this; this = last(NR); last + this }, by = date]
# special deal with single year, e.g. 2006.
df_new[, NY := ifelse( (date - 1) %in% date, NY, NR)]
# snx: count big_corp_TF for acquiror_nation, which will be used to calculate NX
df_new[, snx := sum(big_corp_TF), by = .(date,acquiror_nation)]
# df2: remove column big_crop_TF for unique operation
df2 <- df_new[,c(1:3,5:7)][,unique(.SD)]
# roll count for two consecutive years
df2[, NX := frollsum(snx,2),by=.(acquiror_nation)]
df2[, NX := ifelse( (date - 1) %in% date, NX, snx),acquiror_nation][]
#> date target_nation acquiror_nation NR NY snx NX
#> 1: 2000 Uganda France 2 2 1 1
#> 2: 2000 Uganda Germany 2 2 0 0
#> 3: 2001 Uganda France 3 5 1 2
#> 4: 2001 Uganda Germany 3 5 0 0
#> 5: 2002 Uganda France 3 6 2 3
#> 6: 2002 Uganda Germany 3 6 1 1
#> 7: 2003 Uganda Germany 2 5 1 2
#> 8: 2004 Uganda France 3 5 1 1
#> 9: 2004 Uganda Germany 3 5 1 2
#> 10: 2006 Uganda France 2 2 2 2
df2[, bigcorp_share_2years := NX/NY]
df2[, .(date,target_nation,NY,NX,bigcorp_share_2years),by=.(acquiror_nation)]
#> acquiror_nation date target_nation NY NX bigcorp_share_2years
#> 1: France 2000 Uganda 2 1 0.5000000
#> 2: France 2001 Uganda 5 2 0.4000000
#> 3: France 2002 Uganda 6 3 0.5000000
#> 4: France 2004 Uganda 5 1 0.2000000
#> 5: France 2006 Uganda 2 2 1.0000000
#> 6: Germany 2000 Uganda 2 0 0.0000000
#> 7: Germany 2001 Uganda 5 0 0.0000000
#> 8: Germany 2002 Uganda 6 1 0.1666667
#> 9: Germany 2003 Uganda 5 2 0.4000000
#> 10: Germany 2004 Uganda 5 2 0.4000000
由 reprex package (v2.0.0)
于 2021-05-03 创建