如何缩小R中的数据框

Question

请原谅我的标题不够完美，但在理解这个问题时遇到了一些问题。

这是手动创建的数据。一共有三个字段；状态、代码类型和代码。这样做的原因是我试图将一个更广泛的版本加入到一个由 160 万行和运行组成的数据框中，以解决内存不足的问题。我的想法是我会大大减少这个 table; 中的行数。行业。

state <- c(32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32)
codetype <- c(10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10)
code <- c(522,523,524,532,533,534,544,545,546,551,552,552,561,562,563,571,572,573,574)



industry = data.frame(state,codetype,code)

期望的结果是双重操作。首先，我会将六位数代码缩短为 2。这是通过

完成的

industry<-industry %>% mutate(twodigit = substr(code,1,2).

这将产生第五列，两位数。目前有19个值。但是两位数只有7个唯一值； 52、53、54、55、56、57。如何告诉它删除两位数的所有非唯一值？

Answer 1

我们可以使用 distinct 并将 .keep_all 指定为 TRUE 以获得整个列

library(dplyr)
industry %>%
   distinct(twodigit, .keep_all = TRUE)

另一种选择是在 filter

中使用 duplicated

industry %>%
    filter(!duplicated(twodigit))

为了提高效率，或许可以使用 data.table 方法

library(data.table)
setDT(industry)[!duplicated(substr(code, 1, 2))]

Answer 2

使用unique()方法：

library(tidyverse)

state <- c(32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32)
codetype <- c(10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10)
code <- c(522,523,524,532,533,534,544,545,546,551,552,552,561,562,563,571,572,573,574)
industry = data.frame(state,codetype,code)
industry<-industry %>% mutate(twodigit = substr(code,1,2))


unique(industry$twodigit) %>%
    map_dfr(~filter(industry, twodigit == .x)[1, ])
#>   state codetype code twodigit
#> 1    32       10  522       52
#> 2    32       10  532       53
#> 3    32       10  544       54
#> 4    32       10  551       55
#> 5    32       10  561       56
#> 6    32       10  571       57

^{由 reprex package (v2.0.0)}

于 2021 年 6 月 10 日创建

如何缩小R中的数据框

How to narrow down data frame in R

r

duplicates

dplyr