如何汇总 R 中不完全匹配的列中的值?

How to summarise values in a column with non-exact match in R?

我有一个 data.table 超过一万行。我想在一列中计算变量出现的次数,但我想使用非精确匹配。 数据如下所示:

dt1 <- data.table (place = c("a north", "a south", "b south", "a north", "c west", "b north", "c south", "a west", "b west"))

     place
1: a north
2: a south
3: b south
4: a north
5: c west
6: b north
7: c south
8: a west
9  b west

我只想计算“a”、“b”和“c”独立于后面的单词出现了多少次。我希望结果看起来像这样:

   a b c
1: 4 3 2

我尝试了 summarise、charmath 和 pmatch,但它们没有用。有人可以帮忙吗?

一切都取决于位置的变化程度和其他场景形状。

您可以将列分成 2,然后分组并计数

dt1
separate(dt1, place, into = c('letter', 'direction')) %>%
  group_by(letter) %>%
  count() %>%
  pivot_wider(names_from = letter, values_from = n) 

您可以使用 mutate()substr() 创建一个仅包含您想要的字符串的新列,然后像这样使用 count() 计算出现次数。

library("data.table")
library("dplyr")

dt1 <- data.table(place = c("a north", "a south", "b south", "a north", "c west", "b north", "c south", "a west", "b west"))

dt1 |>
  mutate(first_letter = substr(place,1,1)) |>
  count(first_letter)

输出:

   first_letter n
1:            a 4
2:            b 3
3:            c 2

如果您想要不同的匹配,您可能需要在 mutate.

中使用正则表达式和 case_when

您可以尝试完整的 data.table 解决方案:

 dt1[,'.'(var = sub(" .*", "",place))
   ][,'.'(cnt = .N), by = var
   ][,data.table::transpose(.SD, make.names= 'var')]

   a b c
1: 4 3 2

一个简单的完整 data.table 解决方案:

library(data.table)

dt1[,lapply(.SD, substr,1,1)][,.N, by = place]
#>    place N
#> 1:     a 4
#> 2:     b 3
#> 3:     c 2

如果您需要矢量格式的结果:

res <- dt1[,lapply(.SD, substr,1,1)][,.N, by = place]$N
names(res) <- dt1[,lapply(.SD, substr,1,1)][,.N, by = place]$place

res
#> a b c 
#> 4 3 2

reprex package (v2.0.1)

于 2021-10-11 创建

table 与来自 base R

trimws 结合使用
table(trimws(dt1$place, whitespace = "\s+.*"))

a b c 
4 3 2 

初学者更简单的方法:

library("data.table")

library("dplyr")

dt1 <- data.table(place = c("a north", "a south", "b south", "a north", 
                            "c west", "b north", "c south", "a west", "b west"))

answer <- cbind(a = sum(startsWith(dt1$place, "a")) , 
        b = sum(startsWith(dt1$place, "b")),c = sum(startsWith(dt1$place, "c")))