基于 SIC 代码在行业层面汇总公司特定数据
Aggregating firm specific data on an industry level based on SIC codes
我有大约 250,000 行特定于公司的年度数据(2000-2019)以及每个公司的行业 SIC 代码。目的是根据年份对每个单独的 SIC 代码的每个变量列中的值求和。前几行的数据如下所示:
>head(compustat)
gvkey datadate fyear indfmt consol popsrc datafmt curcd at capx ceq emp ni revt xrd costat sic
1 1004 20000531 1999 INDL C D STD USD 740.998 22.344 339.515 2.9 35.163 1024.333 NA A 5080
2 1004 20010531 2000 INDL C D STD USD 701.854 13.134 340.212 2.5 18.531 874.255 NA A 5080
3 1004 20020531 2001 INDL C D STD USD 710.199 12.112 310.235 2.2 -58.939 638.721 NA A 5080
4 1004 20030531 2002 INDL C D STD USD 686.621 9.930 294.988 2.1 -12.410 606.337 NA A 5080
对于列“at”、“capx”、“ceq”、“emp”、“ni”、“revt”、“xrd”,我想要每年所有具有相同 SIC 代码的公司的总和.因此,我的输出将是同一行业 SIC 中所有变量的总值,从 2000 年到 2019 年的每一年。
有人可以帮我实现这个吗?
谢谢,
试试这个 tidyverse
解决方案。您可以按照选择所需变量的策略,设置 group_by()
然后使用 summarise_all()
来计算总和。您的共享数据很小,但它应该与您的较大数据一起使用。这里的代码:
library(tidyverse)
#Code
df %>%
#Filter years
filter(fyear>=2000 & fyear<=2019) %>%
#Select variables
select(sic,fyear,at,capx,ceq,emp,ni,revt,xrd) %>%
#Group by sic and year
group_by(sic,fyear) %>%
#Compute total
summarise_all(sum,na.rm=T)
输出:
# A tibble: 3 x 9
# Groups: sic [1]
sic fyear at capx ceq emp ni revt xrd
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 5080 2000 702. 13.1 340. 2.5 18.5 874. 0
2 5080 2001 710. 12.1 310. 2.2 -58.9 639. 0
3 5080 2002 687. 9.93 295. 2.1 -12.4 606. 0
使用了一些数据:
#Data
df <- structure(list(gvkey = c(1004L, 1004L, 1004L, 1004L), datadate = c(20000531L,
20010531L, 20020531L, 20030531L), fyear = 1999:2002, indfmt = c("INDL",
"INDL", "INDL", "INDL"), consol = c("C", "C", "C", "C"), popsrc = c("D",
"D", "D", "D"), datafmt = c("STD", "STD", "STD", "STD"), curcd = c("USD",
"USD", "USD", "USD"), at = c(740.998, 701.854, 710.199, 686.621
), capx = c(22.344, 13.134, 12.112, 9.93), ceq = c(339.515, 340.212,
310.235, 294.988), emp = c(2.9, 2.5, 2.2, 2.1), ni = c(35.163,
18.531, -58.939, -12.41), revt = c(1024.333, 874.255, 638.721,
606.337), xrd = c(NA, NA, NA, NA), costat = c("A", "A", "A",
"A"), sic = c(5080L, 5080L, 5080L, 5080L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
您可以使用 dplyr
库来实现:
考虑到您有这样的数据框 dw
:
dw <- read.table(header=T, text='
gvkey datadate fyear indfmt consol popsrc datafmt curcd at capx ceq emp ni revt xrd costat sic
1004 20000531 1999 INDL C D STD USD 740.998 22.344 339.515 2.9 35.163 1024.333 NA A 5080
1004 20010531 2000 INDL C D STD USD 701.854 13.134 340.212 2.5 18.531 874.255 NA A 5080
1004 20020531 2001 INDL C D STD USD 710.199 12.112 310.235 2.2 -58.939 638.721 NA A 5080
1004 20010531 2000 INDL C D STD USD 701.854 13.134 340.212 2.5 18.531 874.255 NA A 5080
1004 20020531 2008 INDL C D STD USD 710.199 12.112 310.235 2.2 -58.939 638.721 NA A 5080
1004 20030531 2002 INDL C D STD USD 686.621 9.930 294.988 2.1 -12.410 606.337 NA A 5080
1004 20030531 2002 INDL C D STD USD 686.621 9.930 294.988 2.1 -12.410 606.337 NA A 5080
')
下面的代码可以按sic和fyear分组,然后selectfyear大于2000的行
library(dplyr)
df = as.data.frame(dw %>% group_by(sic, fyear) %>% summarise(capx=sum(capx), ceq=sum(ceq),emp=sum(emp), ni=sum(ni), revt=sum(revt), xrd=sum(xrd)))
df = df[df$fyear >=2000, ]
print(df)
最终输出如下:
sic fyear capx ceq emp ni revt xrd
5080 2000 26.268 680.424 5.0 37.062 1748.510 NA
5080 2001 12.112 310.235 2.2 -58.939 638.721 NA
5080 2002 19.860 589.976 4.2 -24.820 1212.674 NA
5080 2008 12.112 310.235 2.2 -58.939 638.721 NA
我有大约 250,000 行特定于公司的年度数据(2000-2019)以及每个公司的行业 SIC 代码。目的是根据年份对每个单独的 SIC 代码的每个变量列中的值求和。前几行的数据如下所示:
>head(compustat)
gvkey datadate fyear indfmt consol popsrc datafmt curcd at capx ceq emp ni revt xrd costat sic
1 1004 20000531 1999 INDL C D STD USD 740.998 22.344 339.515 2.9 35.163 1024.333 NA A 5080
2 1004 20010531 2000 INDL C D STD USD 701.854 13.134 340.212 2.5 18.531 874.255 NA A 5080
3 1004 20020531 2001 INDL C D STD USD 710.199 12.112 310.235 2.2 -58.939 638.721 NA A 5080
4 1004 20030531 2002 INDL C D STD USD 686.621 9.930 294.988 2.1 -12.410 606.337 NA A 5080
对于列“at”、“capx”、“ceq”、“emp”、“ni”、“revt”、“xrd”,我想要每年所有具有相同 SIC 代码的公司的总和.因此,我的输出将是同一行业 SIC 中所有变量的总值,从 2000 年到 2019 年的每一年。
有人可以帮我实现这个吗?
谢谢,
试试这个 tidyverse
解决方案。您可以按照选择所需变量的策略,设置 group_by()
然后使用 summarise_all()
来计算总和。您的共享数据很小,但它应该与您的较大数据一起使用。这里的代码:
library(tidyverse)
#Code
df %>%
#Filter years
filter(fyear>=2000 & fyear<=2019) %>%
#Select variables
select(sic,fyear,at,capx,ceq,emp,ni,revt,xrd) %>%
#Group by sic and year
group_by(sic,fyear) %>%
#Compute total
summarise_all(sum,na.rm=T)
输出:
# A tibble: 3 x 9
# Groups: sic [1]
sic fyear at capx ceq emp ni revt xrd
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 5080 2000 702. 13.1 340. 2.5 18.5 874. 0
2 5080 2001 710. 12.1 310. 2.2 -58.9 639. 0
3 5080 2002 687. 9.93 295. 2.1 -12.4 606. 0
使用了一些数据:
#Data
df <- structure(list(gvkey = c(1004L, 1004L, 1004L, 1004L), datadate = c(20000531L,
20010531L, 20020531L, 20030531L), fyear = 1999:2002, indfmt = c("INDL",
"INDL", "INDL", "INDL"), consol = c("C", "C", "C", "C"), popsrc = c("D",
"D", "D", "D"), datafmt = c("STD", "STD", "STD", "STD"), curcd = c("USD",
"USD", "USD", "USD"), at = c(740.998, 701.854, 710.199, 686.621
), capx = c(22.344, 13.134, 12.112, 9.93), ceq = c(339.515, 340.212,
310.235, 294.988), emp = c(2.9, 2.5, 2.2, 2.1), ni = c(35.163,
18.531, -58.939, -12.41), revt = c(1024.333, 874.255, 638.721,
606.337), xrd = c(NA, NA, NA, NA), costat = c("A", "A", "A",
"A"), sic = c(5080L, 5080L, 5080L, 5080L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
您可以使用 dplyr
库来实现:
考虑到您有这样的数据框 dw
:
dw <- read.table(header=T, text='
gvkey datadate fyear indfmt consol popsrc datafmt curcd at capx ceq emp ni revt xrd costat sic
1004 20000531 1999 INDL C D STD USD 740.998 22.344 339.515 2.9 35.163 1024.333 NA A 5080
1004 20010531 2000 INDL C D STD USD 701.854 13.134 340.212 2.5 18.531 874.255 NA A 5080
1004 20020531 2001 INDL C D STD USD 710.199 12.112 310.235 2.2 -58.939 638.721 NA A 5080
1004 20010531 2000 INDL C D STD USD 701.854 13.134 340.212 2.5 18.531 874.255 NA A 5080
1004 20020531 2008 INDL C D STD USD 710.199 12.112 310.235 2.2 -58.939 638.721 NA A 5080
1004 20030531 2002 INDL C D STD USD 686.621 9.930 294.988 2.1 -12.410 606.337 NA A 5080
1004 20030531 2002 INDL C D STD USD 686.621 9.930 294.988 2.1 -12.410 606.337 NA A 5080
')
下面的代码可以按sic和fyear分组,然后selectfyear大于2000的行
library(dplyr)
df = as.data.frame(dw %>% group_by(sic, fyear) %>% summarise(capx=sum(capx), ceq=sum(ceq),emp=sum(emp), ni=sum(ni), revt=sum(revt), xrd=sum(xrd)))
df = df[df$fyear >=2000, ]
print(df)
最终输出如下:
sic fyear capx ceq emp ni revt xrd
5080 2000 26.268 680.424 5.0 37.062 1748.510 NA
5080 2001 12.112 310.235 2.2 -58.939 638.721 NA
5080 2002 19.860 589.976 4.2 -24.820 1212.674 NA
5080 2008 12.112 310.235 2.2 -58.939 638.721 NA