如何汇总 R 中不完全匹配的列中的值?
How to summarise values in a column with non-exact match in R?
我有一个 data.table 超过一万行。我想在一列中计算变量出现的次数,但我想使用非精确匹配。
数据如下所示:
dt1 <- data.table (place = c("a north", "a south", "b south", "a north", "c west", "b north", "c south", "a west", "b west"))
place
1: a north
2: a south
3: b south
4: a north
5: c west
6: b north
7: c south
8: a west
9 b west
我只想计算“a”、“b”和“c”独立于后面的单词出现了多少次。我希望结果看起来像这样:
a b c
1: 4 3 2
我尝试了 summarise、charmath 和 pmatch,但它们没有用。有人可以帮忙吗?
一切都取决于位置的变化程度和其他场景形状。
您可以将列分成 2,然后分组并计数
dt1
separate(dt1, place, into = c('letter', 'direction')) %>%
group_by(letter) %>%
count() %>%
pivot_wider(names_from = letter, values_from = n)
您可以使用 mutate()
和 substr()
创建一个仅包含您想要的字符串的新列,然后像这样使用 count()
计算出现次数。
library("data.table")
library("dplyr")
dt1 <- data.table(place = c("a north", "a south", "b south", "a north", "c west", "b north", "c south", "a west", "b west"))
dt1 |>
mutate(first_letter = substr(place,1,1)) |>
count(first_letter)
输出:
first_letter n
1: a 4
2: b 3
3: c 2
如果您想要不同的匹配,您可能需要在 mutate
.
中使用正则表达式和 case_when
您可以尝试完整的 data.table
解决方案:
dt1[,'.'(var = sub(" .*", "",place))
][,'.'(cnt = .N), by = var
][,data.table::transpose(.SD, make.names= 'var')]
a b c
1: 4 3 2
一个简单的完整 data.table
解决方案:
library(data.table)
dt1[,lapply(.SD, substr,1,1)][,.N, by = place]
#> place N
#> 1: a 4
#> 2: b 3
#> 3: c 2
如果您需要矢量格式的结果:
res <- dt1[,lapply(.SD, substr,1,1)][,.N, by = place]$N
names(res) <- dt1[,lapply(.SD, substr,1,1)][,.N, by = place]$place
res
#> a b c
#> 4 3 2
由 reprex package (v2.0.1)
于 2021-10-11 创建
将 table
与来自 base R
的 trimws
结合使用
table(trimws(dt1$place, whitespace = "\s+.*"))
a b c
4 3 2
初学者更简单的方法:
library("data.table")
library("dplyr")
dt1 <- data.table(place = c("a north", "a south", "b south", "a north",
"c west", "b north", "c south", "a west", "b west"))
answer <- cbind(a = sum(startsWith(dt1$place, "a")) ,
b = sum(startsWith(dt1$place, "b")),c = sum(startsWith(dt1$place, "c")))
我有一个 data.table 超过一万行。我想在一列中计算变量出现的次数,但我想使用非精确匹配。 数据如下所示:
dt1 <- data.table (place = c("a north", "a south", "b south", "a north", "c west", "b north", "c south", "a west", "b west"))
place
1: a north
2: a south
3: b south
4: a north
5: c west
6: b north
7: c south
8: a west
9 b west
我只想计算“a”、“b”和“c”独立于后面的单词出现了多少次。我希望结果看起来像这样:
a b c
1: 4 3 2
我尝试了 summarise、charmath 和 pmatch,但它们没有用。有人可以帮忙吗?
一切都取决于位置的变化程度和其他场景形状。
您可以将列分成 2,然后分组并计数
dt1
separate(dt1, place, into = c('letter', 'direction')) %>%
group_by(letter) %>%
count() %>%
pivot_wider(names_from = letter, values_from = n)
您可以使用 mutate()
和 substr()
创建一个仅包含您想要的字符串的新列,然后像这样使用 count()
计算出现次数。
library("data.table")
library("dplyr")
dt1 <- data.table(place = c("a north", "a south", "b south", "a north", "c west", "b north", "c south", "a west", "b west"))
dt1 |>
mutate(first_letter = substr(place,1,1)) |>
count(first_letter)
输出:
first_letter n
1: a 4
2: b 3
3: c 2
如果您想要不同的匹配,您可能需要在 mutate
.
case_when
您可以尝试完整的 data.table
解决方案:
dt1[,'.'(var = sub(" .*", "",place))
][,'.'(cnt = .N), by = var
][,data.table::transpose(.SD, make.names= 'var')]
a b c
1: 4 3 2
一个简单的完整 data.table
解决方案:
library(data.table)
dt1[,lapply(.SD, substr,1,1)][,.N, by = place]
#> place N
#> 1: a 4
#> 2: b 3
#> 3: c 2
如果您需要矢量格式的结果:
res <- dt1[,lapply(.SD, substr,1,1)][,.N, by = place]$N
names(res) <- dt1[,lapply(.SD, substr,1,1)][,.N, by = place]$place
res
#> a b c
#> 4 3 2
由 reprex package (v2.0.1)
于 2021-10-11 创建将 table
与来自 base R
trimws
结合使用
table(trimws(dt1$place, whitespace = "\s+.*"))
a b c
4 3 2
初学者更简单的方法:
library("data.table")
library("dplyr")
dt1 <- data.table(place = c("a north", "a south", "b south", "a north",
"c west", "b north", "c south", "a west", "b west"))
answer <- cbind(a = sum(startsWith(dt1$place, "a")) ,
b = sum(startsWith(dt1$place, "b")),c = sum(startsWith(dt1$place, "c")))