如何计算具有唯一性的字符串并将它们输入到 R 的另一列中
How to count string with unique and enter them in another column in R
我有一个包含 12000 多条记录的数据集,如下图所示,我需要对字符串进行计数。数据集看起来像
Drugs Gender year
met,met,sulp,DPP M 2020
met and sulp and DPP M 2021
SGLT SGLT SGLT M 2018
Incretin, AGI, AGI F 2019
THK, USP F 2013
我需要这样的输出,请推荐我
Drugs number of drugs Gender year
met,met,sulp,DPP 3 M 2020
met and sulp and DPP 3 M 2021
SGLT SGLT SGLT 1 M 2018
Incretin, AGI, AGI 2 F 2019
THK, USP 2 F 2013
提前致谢
您可以使用 stringr::str_count
来计算 'DRUG'
个值的数量。
library(stringr)
df$num_drugs <- str_count(df$Drugs, regex('DRUG', ignore_case = TRUE))
要计算唯一值,您可以使用 -
df$num_drugs <- sapply(strsplit(df$Drugs, ',\s*'), function(x) length(unique(x)))
更改输入后更新:
感谢 Rui Barradas 的支持!
首先我们制作一个包含要计数的元素的向量。这可以做得更优雅。
之后使用正则表达式计数:
library(tidyr)
library(dplyr)
df1 <- df %>%
select(Drugs) %>%
separate_rows(Drugs, sep = ",") %>%
separate_rows(Drugs, sep = " and ") %>%
separate_rows(Drugs, sep = " ") %>%
mutate(Drugs = str_trim(Drugs)) %>%
distinct(Drugs) %>%
filter(Drugs != "")
my_expression <- paste(df1$Drugs, collapse="|")
df %>%
mutate(number = lengths(gregexpr(my_expression, Drugs)), .before=2)
输出:
Drugs number Gender year
<chr> <int> <chr> <chr>
1 met,met,sulp,DPP 4 M 2020
2 met and sulp and DPP 3 M 2021
3 SGLT SGLT SGLT 3 M 2018
4 Incretin, AGI, AGI 3 F 2019
5 THK, USP 2 F 2013
假设你有一个更不干净的数据并且可以有前导空格,我建议这种方法
library(tidyverse)
df <- read.table(header = TRUE, text = "Drugs Gender year
'met,met,sulp,DPP ' M 2020
'met and sulp and DPP ' M 2021
'SGLT SGLT SGLT ' M 2018
'Incretin, AGI, AGI ' F 2019
'THK, USP' F 2013")
df %>%
mutate(number_of_drugs = map(str_split(gsub('\sand\s|\W+', ' ', str_trim(Drugs)), '\s+'), ~ length(unique(.x))))
#> Drugs Gender year number_of_drugs
#> 1 met,met,sulp,DPP M 2020 3
#> 2 met and sulp and DPP M 2021 3
#> 3 SGLT SGLT SGLT M 2018 1
#> 4 Incretin, AGI, AGI F 2019 2
#> 5 THK, USP F 2013 2
由 reprex package (v2.0.0)
于 2021-07-29 创建
我有一个包含 12000 多条记录的数据集,如下图所示,我需要对字符串进行计数。数据集看起来像
Drugs Gender year
met,met,sulp,DPP M 2020
met and sulp and DPP M 2021
SGLT SGLT SGLT M 2018
Incretin, AGI, AGI F 2019
THK, USP F 2013
我需要这样的输出,请推荐我
Drugs number of drugs Gender year
met,met,sulp,DPP 3 M 2020
met and sulp and DPP 3 M 2021
SGLT SGLT SGLT 1 M 2018
Incretin, AGI, AGI 2 F 2019
THK, USP 2 F 2013
提前致谢
您可以使用 stringr::str_count
来计算 'DRUG'
个值的数量。
library(stringr)
df$num_drugs <- str_count(df$Drugs, regex('DRUG', ignore_case = TRUE))
要计算唯一值,您可以使用 -
df$num_drugs <- sapply(strsplit(df$Drugs, ',\s*'), function(x) length(unique(x)))
更改输入后更新: 感谢 Rui Barradas 的支持!
首先我们制作一个包含要计数的元素的向量。这可以做得更优雅。
之后使用正则表达式计数:
library(tidyr)
library(dplyr)
df1 <- df %>%
select(Drugs) %>%
separate_rows(Drugs, sep = ",") %>%
separate_rows(Drugs, sep = " and ") %>%
separate_rows(Drugs, sep = " ") %>%
mutate(Drugs = str_trim(Drugs)) %>%
distinct(Drugs) %>%
filter(Drugs != "")
my_expression <- paste(df1$Drugs, collapse="|")
df %>%
mutate(number = lengths(gregexpr(my_expression, Drugs)), .before=2)
输出:
Drugs number Gender year
<chr> <int> <chr> <chr>
1 met,met,sulp,DPP 4 M 2020
2 met and sulp and DPP 3 M 2021
3 SGLT SGLT SGLT 3 M 2018
4 Incretin, AGI, AGI 3 F 2019
5 THK, USP 2 F 2013
假设你有一个更不干净的数据并且可以有前导空格,我建议这种方法
library(tidyverse)
df <- read.table(header = TRUE, text = "Drugs Gender year
'met,met,sulp,DPP ' M 2020
'met and sulp and DPP ' M 2021
'SGLT SGLT SGLT ' M 2018
'Incretin, AGI, AGI ' F 2019
'THK, USP' F 2013")
df %>%
mutate(number_of_drugs = map(str_split(gsub('\sand\s|\W+', ' ', str_trim(Drugs)), '\s+'), ~ length(unique(.x))))
#> Drugs Gender year number_of_drugs
#> 1 met,met,sulp,DPP M 2020 3
#> 2 met and sulp and DPP M 2021 3
#> 3 SGLT SGLT SGLT M 2018 1
#> 4 Incretin, AGI, AGI F 2019 2
#> 5 THK, USP F 2013 2
由 reprex package (v2.0.0)
于 2021-07-29 创建