如何删除与我的研究无关的数据?
How do I remove data that is not relevant for my research?
我是 R 的新手。我正在参加考试,我选择只对我的数据集的一部分感兴趣。该数据集与美国公司有关。我只对“金融保险”和“房地产及租赁”行业的公司感兴趣。行业通过“北美行业分类代码”表示,其中行业为6位中的前两位'code'.
正如我所说,我是 R 的新手。但我已经尝试了很长时间来弄清楚这一点。在我看来,最有意义的做法是创建一个带有二进制变量的新列,该列指示公司是否属于这两个行业之一,然后再排除该背景的数据。但是我无法创建这个新列。
如果您能提供有关如何执行此操作的任何帮助,我将不胜感激。用于创建二进制变量或仅排除不相关的数据。
#### Data ####
lobby_clean
compusat
politicians
#### Clean the "gvkey" for characters and convert to integers ####
lobby_clean[,c(1)]<-sapply(lobby_clean[,c(1)],as.numeric)
#### Merge the different datasets into one ####
lobby_compusat<-inner_join(lobby_clean, compusat, by ="gvkey")
lobby_compusat_politician <- inner_join(lobby_compusat, politicians, by="gvkey")
#### Group by year ####
mean_expend_by_year <- lobby_compusat_politician %>%
group_by(year.x) %>%
summarise(mean_expend=mean(expend))
#### Construct a plot of the data showing the development of the lobbying expenditures over the years among all companies####
lobbying_development <- ggplot(data = mean_expend_by_year,mapping=aes(x=year.x,y=mean_expend))+
geom_col() +
labs(title = "Development in lobbying expenditure over time", x="Year", y="Average lobbying expenditures")
show(lobbying_development)
#### Exclude data that does not belong to the relevant sectors ####
#### Relevant sectors are :
####"Finance and Insurance", code starts with: 52
#### "Real Estate and Rental and Leasing", code starts with: 53
## Create a new column based on the two first numbers in "naics" that defines the sector to which the company belongs##
您正在使用 tidyverse 和基本 R 代码的组合,但我会使用 tidyverse 给出一些提示。一般来说,如果您提供更多信息以供我们使用,这会很有帮助 - 即使是您的一小段数据也会有所帮助。
要从“北美行业分类代码”中提取前两位数字,您可以添加一个 mutate 语句,例如
library(tidyverse)
df <- df %>% mutate(sector = str_sub(naicc, start = 1, end = 2))
然后您可以过滤以仅包含您感兴趣的两个行业
df <- df %>% filter(sector %in% c("52", "53") )
希望这会让您朝着正确的方向前进。
我是 R 的新手。我正在参加考试,我选择只对我的数据集的一部分感兴趣。该数据集与美国公司有关。我只对“金融保险”和“房地产及租赁”行业的公司感兴趣。行业通过“北美行业分类代码”表示,其中行业为6位中的前两位'code'.
正如我所说,我是 R 的新手。但我已经尝试了很长时间来弄清楚这一点。在我看来,最有意义的做法是创建一个带有二进制变量的新列,该列指示公司是否属于这两个行业之一,然后再排除该背景的数据。但是我无法创建这个新列。
如果您能提供有关如何执行此操作的任何帮助,我将不胜感激。用于创建二进制变量或仅排除不相关的数据。
#### Data ####
lobby_clean
compusat
politicians
#### Clean the "gvkey" for characters and convert to integers ####
lobby_clean[,c(1)]<-sapply(lobby_clean[,c(1)],as.numeric)
#### Merge the different datasets into one ####
lobby_compusat<-inner_join(lobby_clean, compusat, by ="gvkey")
lobby_compusat_politician <- inner_join(lobby_compusat, politicians, by="gvkey")
#### Group by year ####
mean_expend_by_year <- lobby_compusat_politician %>%
group_by(year.x) %>%
summarise(mean_expend=mean(expend))
#### Construct a plot of the data showing the development of the lobbying expenditures over the years among all companies####
lobbying_development <- ggplot(data = mean_expend_by_year,mapping=aes(x=year.x,y=mean_expend))+
geom_col() +
labs(title = "Development in lobbying expenditure over time", x="Year", y="Average lobbying expenditures")
show(lobbying_development)
#### Exclude data that does not belong to the relevant sectors ####
#### Relevant sectors are :
####"Finance and Insurance", code starts with: 52
#### "Real Estate and Rental and Leasing", code starts with: 53
## Create a new column based on the two first numbers in "naics" that defines the sector to which the company belongs##
您正在使用 tidyverse 和基本 R 代码的组合,但我会使用 tidyverse 给出一些提示。一般来说,如果您提供更多信息以供我们使用,这会很有帮助 - 即使是您的一小段数据也会有所帮助。
要从“北美行业分类代码”中提取前两位数字,您可以添加一个 mutate 语句,例如
library(tidyverse)
df <- df %>% mutate(sector = str_sub(naicc, start = 1, end = 2))
然后您可以过滤以仅包含您感兴趣的两个行业
df <- df %>% filter(sector %in% c("52", "53") )
希望这会让您朝着正确的方向前进。