如何计算 R 单元格中由逗号分隔的唯一 2 个单词短语?
How can I count unique 2 word phrases that are seperated by a comma within a cell in R?
我有一个包含不同位置 (Location
) 以及在每个位置发现的动物种类 (Spp
) 的数据框。动物种类使用其独特的 属种 名称进行编码。我想知道每个独特的属物种在数据框中的频率。
示例数据
df1 <- data.frame(matrix(ncol = 2, nrow = 3))
x <- c("Location","Spp")
colnames(df1) <- x
df1$Location <- seq(1,3,1)
df1[1,2] <- c("Genus1 species1")
df1[2,2] <- c("Genus1 species1, Genus1 species2")
df1[3,2] <- c("Genus1 species1, Genus1 species2, Genus2 species1")
输出应如下所示
Spp Freq
Genus1 species1 3
Genus1 species2 2
Genus2 species1 1
我已经尝试使用 corpus
包来解决这个问题,但只能让它计算独特的单词而不是独特的 属种 短语。
library(tm)
library(corpus)
library(dplyr)
text <- df1[,2]
docs <- Corpus(VectorSource(text))
docs <- docs %>%
tm_map(removePunctuation)
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix), decreasing = TRUE)
words ### only provides count of unique individual Genus and species words. I want similar but need to keep Genus and species together.
这是一个快速的解决方案:
df1 <- data.frame(matrix(ncol = 2, nrow = 3))
x <- c("Location","Spp")
colnames(df1) <- x
df1$Location <- seq(1,3,1)
df1[1,2] <- c("Genus1 species1")
df1[2,2] <- c("Genus1 species1, Genus1 species2")
df1[3,2] <- c("Genus1 species1, Genus1 species2, Genus2 species1")
table(unlist(strsplit(df1$Spp,', ')))
#>
#> Genus1 species1 Genus1 species2 Genus2 species1
#> 3 2 1
由 reprex package (v2.0.1)
于 2021-10-04 创建
我们可以使用 separate_rows
和 count
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(Spp, sep = ",\s+") %>%
count(Spp, name = 'Freq')
# A tibble: 3 × 2
Spp Freq
<chr> <int>
1 Genus1 species1 3
2 Genus1 species2 2
3 Genus2 species1 1
我有一个包含不同位置 (Location
) 以及在每个位置发现的动物种类 (Spp
) 的数据框。动物种类使用其独特的 属种 名称进行编码。我想知道每个独特的属物种在数据框中的频率。
示例数据
df1 <- data.frame(matrix(ncol = 2, nrow = 3))
x <- c("Location","Spp")
colnames(df1) <- x
df1$Location <- seq(1,3,1)
df1[1,2] <- c("Genus1 species1")
df1[2,2] <- c("Genus1 species1, Genus1 species2")
df1[3,2] <- c("Genus1 species1, Genus1 species2, Genus2 species1")
输出应如下所示
Spp Freq
Genus1 species1 3
Genus1 species2 2
Genus2 species1 1
我已经尝试使用 corpus
包来解决这个问题,但只能让它计算独特的单词而不是独特的 属种 短语。
library(tm)
library(corpus)
library(dplyr)
text <- df1[,2]
docs <- Corpus(VectorSource(text))
docs <- docs %>%
tm_map(removePunctuation)
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix), decreasing = TRUE)
words ### only provides count of unique individual Genus and species words. I want similar but need to keep Genus and species together.
这是一个快速的解决方案:
df1 <- data.frame(matrix(ncol = 2, nrow = 3))
x <- c("Location","Spp")
colnames(df1) <- x
df1$Location <- seq(1,3,1)
df1[1,2] <- c("Genus1 species1")
df1[2,2] <- c("Genus1 species1, Genus1 species2")
df1[3,2] <- c("Genus1 species1, Genus1 species2, Genus2 species1")
table(unlist(strsplit(df1$Spp,', ')))
#>
#> Genus1 species1 Genus1 species2 Genus2 species1
#> 3 2 1
由 reprex package (v2.0.1)
于 2021-10-04 创建我们可以使用 separate_rows
和 count
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(Spp, sep = ",\s+") %>%
count(Spp, name = 'Freq')
# A tibble: 3 × 2
Spp Freq
<chr> <int>
1 Genus1 species1 3
2 Genus1 species2 2
3 Genus2 species1 1