如何在 r 中的多个点拆分字符串列?
How to split a string column at multiple points in r?
我有一个基因变异 ID 的遗传数据集:
VARIANT_ID
01_1254436_A_G_1
02_2254436_A_G_1
03_3255436_A_G_1
10_10344745_A_G_1
11_11256437_A_G_1
11_11343426_A_G_1
12_12222431_A_G_1
14_14200436_A_G_1
15_15256789_A_G_1
我希望在第一个 _ 和最后一个 _ 中创建一个包含该数据的一部分的新列,因此希望输出:
VARIANT_ID newcol
01_1254436_A_G_1 1254436_A_G
02_2254436_A_G_1 2254436_A_G
03_3255436_A_G_1 3255436_A_G
10_10344745_A_G_1 10344745_A_G
11_11256437_A_G_1 11256437_A_G
11_11343426_A_G_1 11343426_A_G
12_12222431_A_G_1 12222431_A_G
14_14200436_A_G_1 14200436_A_G
15_15256789_A_G_1 15256789_A_G
我没能在 R 中找到类似的问题,所以不确定如何解决这个问题,我已经尝试过 str_split_fixed()
但这没有用,任何关于什么功能的帮助尝试将不胜感激
输入数据:
dput(df)
structure(list(VARIANT_ID = c("01_1254436_A_G_1", "02_2254436_A_G_1",
"03_3255436_A_G_1", "10_10344745_A_G_1", "11_11256437_A_G_1",
"11_11343426_A_G_1", "12_12222431_A_G_1", "14_14200436_A_G_1",
"15_15256789_A_G_1")), row.names = c(NA, -9L), class = c("data.table",
"data.frame"))
我们可以利用一个简单的正则表达式来进行拆分:
library(dplyr)
df %>%
mutate(split_string = stringr::str_replace_all(VARIANT_ID,"^\d{1,}_|_\d+$",""))
或者:
df %>%
mutate(split_string = stringr::str_replace_all(VARIANT_ID,
"^\d{1,}_(?=\d{2,})|_\d$", ""))
结果:
VARIANT_ID split_string
1: 01_1254436_A_G_1 1254436_A_G
2: 02_2254436_A_G_1 2254436_A_G
3: 03_3255436_A_G_1 3255436_A_G
4: 10_10344745_A_G_1 10344745_A_G
5: 11_11256437_A_G_1 11256437_A_G
6: 11_11343426_A_G_1 11343426_A_G
7: 12_12222431_A_G_1 12222431_A_G
8: 14_14200436_A_G_1 14200436_A_G
9: 15_15256789_A_G_1 15256789_A_G
选项extract
library(tidyr)
extract(df, VARIANT_ID, into = 'newcol', '^\d+_(.*)_\d+$', remove = FALSE)
我有一个基因变异 ID 的遗传数据集:
VARIANT_ID
01_1254436_A_G_1
02_2254436_A_G_1
03_3255436_A_G_1
10_10344745_A_G_1
11_11256437_A_G_1
11_11343426_A_G_1
12_12222431_A_G_1
14_14200436_A_G_1
15_15256789_A_G_1
我希望在第一个 _ 和最后一个 _ 中创建一个包含该数据的一部分的新列,因此希望输出:
VARIANT_ID newcol
01_1254436_A_G_1 1254436_A_G
02_2254436_A_G_1 2254436_A_G
03_3255436_A_G_1 3255436_A_G
10_10344745_A_G_1 10344745_A_G
11_11256437_A_G_1 11256437_A_G
11_11343426_A_G_1 11343426_A_G
12_12222431_A_G_1 12222431_A_G
14_14200436_A_G_1 14200436_A_G
15_15256789_A_G_1 15256789_A_G
我没能在 R 中找到类似的问题,所以不确定如何解决这个问题,我已经尝试过 str_split_fixed()
但这没有用,任何关于什么功能的帮助尝试将不胜感激
输入数据:
dput(df)
structure(list(VARIANT_ID = c("01_1254436_A_G_1", "02_2254436_A_G_1",
"03_3255436_A_G_1", "10_10344745_A_G_1", "11_11256437_A_G_1",
"11_11343426_A_G_1", "12_12222431_A_G_1", "14_14200436_A_G_1",
"15_15256789_A_G_1")), row.names = c(NA, -9L), class = c("data.table",
"data.frame"))
我们可以利用一个简单的正则表达式来进行拆分:
library(dplyr)
df %>%
mutate(split_string = stringr::str_replace_all(VARIANT_ID,"^\d{1,}_|_\d+$",""))
或者:
df %>%
mutate(split_string = stringr::str_replace_all(VARIANT_ID,
"^\d{1,}_(?=\d{2,})|_\d$", ""))
结果:
VARIANT_ID split_string
1: 01_1254436_A_G_1 1254436_A_G
2: 02_2254436_A_G_1 2254436_A_G
3: 03_3255436_A_G_1 3255436_A_G
4: 10_10344745_A_G_1 10344745_A_G
5: 11_11256437_A_G_1 11256437_A_G
6: 11_11343426_A_G_1 11343426_A_G
7: 12_12222431_A_G_1 12222431_A_G
8: 14_14200436_A_G_1 14200436_A_G
9: 15_15256789_A_G_1 15256789_A_G
选项extract
library(tidyr)
extract(df, VARIANT_ID, into = 'newcol', '^\d+_(.*)_\d+$', remove = FALSE)