如何删除与我的研究无关的数据？

Question

我是 R 的新手。我正在参加考试，我选择只对我的数据集的一部分感兴趣。该数据集与美国公司有关。我只对“金融保险”和“房地产及租赁”行业的公司感兴趣。行业通过“北美行业分类代码”表示，其中行业为6位中的前两位'code'.

正如我所说，我是 R 的新手。但我已经尝试了很长时间来弄清楚这一点。在我看来，最有意义的做法是创建一个带有二进制变量的新列，该列指示公司是否属于这两个行业之一，然后再排除该背景的数据。但是我无法创建这个新列。

如果您能提供有关如何执行此操作的任何帮助，我将不胜感激。用于创建二进制变量或仅排除不相关的数据。

#### Data ####

lobby_clean 

compusat 

politicians


#### Clean the "gvkey" for characters and convert to integers ####

lobby_clean[,c(1)]<-sapply(lobby_clean[,c(1)],as.numeric)

#### Merge the different datasets into one ####

lobby_compusat<-inner_join(lobby_clean, compusat, by ="gvkey")

lobby_compusat_politician <- inner_join(lobby_compusat, politicians, by="gvkey")

#### Group by year ####

mean_expend_by_year <- lobby_compusat_politician %>% 
  group_by(year.x) %>% 
  summarise(mean_expend=mean(expend))

#### Construct a plot of the data showing the development of the lobbying expenditures over the years among all companies####

lobbying_development <- ggplot(data = mean_expend_by_year,mapping=aes(x=year.x,y=mean_expend))+
  geom_col() +
  labs(title = "Development in lobbying expenditure over time", x="Year", y="Average lobbying expenditures")

show(lobbying_development)

#### Exclude data that does not belong to the relevant sectors ####
#### Relevant sectors are :
####"Finance and Insurance", code starts with: 52
#### "Real Estate and Rental and Leasing", code starts with: 53

## Create a new column based on the two first numbers in "naics" that defines the sector to which the company belongs##

Answer 1

您正在使用 tidyverse 和基本 R 代码的组合，但我会使用 tidyverse 给出一些提示。一般来说，如果您提供更多信息以供我们使用，这会很有帮助 - 即使是您的一小段数据也会有所帮助。

要从“北美行业分类代码”中提取前两位数字，您可以添加一个 mutate 语句，例如

library(tidyverse)
df <- df %>% mutate(sector = str_sub(naicc, start = 1, end = 2))

然后您可以过滤以仅包含您感兴趣的两个行业

df <- df %>% filter(sector %in% c("52", "53") )

希望这会让您朝着正确的方向前进。

如何删除与我的研究无关的数据？

How do I remove data that is not relevant for my research?

r

dataframe

exclude