如何根据位置的数值向量拆分字符向量
How to split a character vector based on a numeric vector for positions
我想根据用于拆分点的第二个数值向量将字符向量拆分为子字符串
vec <- "LAYRVCMTNEGHPWVSLVVQKTRLQISQDPSLNYEYLPTMGLKSFIQASLALLFGKHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHKDARIVYIISSQKELHGLVFQDMGFTVYEYSVWDPKKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"
split.points <- c(25, 32, 55, 90, 124)
我想在 split.points
向量中给定的位置将上述字符向量切割成六个不同的子字符串。
听起来很简单,但我知道的 split
命令要么只适用于特定的正则表达式(模式),要么适用于一定长度的子字符串。
如有任何帮助,我将不胜感激。
我们可以使用 separate
来自 tidyr
library(tidyverse)
data_frame(vec) %>%
separate(vec, into = paste0('vec', 1:6), sep = split.points) %>%
unlist(., use.names = FALSE)
#[1] "LAYRVCMTNEGHPWVSLVVQKTRLQ" "ISQDPSL" "NYEYLPTMGLKSFIQASLALLFG"
#[4] "KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK" "DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP"
#[6] "KKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"
base R
选项将是 substr
unname(mapply(substr, vec, start = c(1, split.points+1), stop = c(split.points, nchar(vec))))
#[1] "LAYRVCMTNEGHPWVSLVVQKTRLQ" "ISQDPSL" "NYEYLPTMGLKSFIQASLALLFG"
#[4] "KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK" "DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP" "KKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"
我们可以试试substring
:
substring(
vec,
c(1, split.points + 1),
c(split.points, nchar(vec))
)
# [1] "LAYRVCMTNEGHPWVSLVVQKTRLQ" "ISQDPSL"
# [3] "NYEYLPTMGLKSFIQASLALLFG" "KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK"
# [5] "DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP" "KKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"
另一种选择是使用 read.fwf
:
unlist(read.fwf(textConnection(vec),
widths = c(25, diff(split.points)),
as.is = TRUE),
use.names = FALSE)
给出:
[1] "LAYRVCMTNEGHPWVSLVVQKTRLQ"
[2] "ISQDPSL"
[3] "NYEYLPTMGLKSFIQASLALLFG"
[4] "KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK"
[5] "DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP"
当您的字符向量源自数据文件时,我不会感到惊讶。在那种情况下 read.fwf
会特别有用。一个例子:
vec2 <- "LAYRVCMTNEGHPWVSLVVQKTRLQISQDPSLNYEYLPTMGLKSFIQASLALLFGKHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHKDARIVYIISSQKELHGLVFQDMGFTVYEYSVWDPKKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM
LAYRVCMTNEGHPWVSLVVQKTRLQISQDPSLNYEYLPTMGLKSFIQASLALLFGKHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHKDARIVYIISSQKELHGLVFQDMGFTVYEYSVWDPKKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"
read.fwf(textConnection(vec2),
widths = c(25, diff(split.points)),
as.is=TRUE)
这将给出:
V1 V2 V3 V4 V5
1 LAYRVCMTNEGHPWVSLVVQKTRLQ ISQDPSL NYEYLPTMGLKSFIQASLALLFG KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP
2 LAYRVCMTNEGHPWVSLVVQKTRLQ ISQDPSL NYEYLPTMGLKSFIQASLALLFG KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP
我想根据用于拆分点的第二个数值向量将字符向量拆分为子字符串
vec <- "LAYRVCMTNEGHPWVSLVVQKTRLQISQDPSLNYEYLPTMGLKSFIQASLALLFGKHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHKDARIVYIISSQKELHGLVFQDMGFTVYEYSVWDPKKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"
split.points <- c(25, 32, 55, 90, 124)
我想在 split.points
向量中给定的位置将上述字符向量切割成六个不同的子字符串。
听起来很简单,但我知道的 split
命令要么只适用于特定的正则表达式(模式),要么适用于一定长度的子字符串。
如有任何帮助,我将不胜感激。
我们可以使用 separate
来自 tidyr
library(tidyverse)
data_frame(vec) %>%
separate(vec, into = paste0('vec', 1:6), sep = split.points) %>%
unlist(., use.names = FALSE)
#[1] "LAYRVCMTNEGHPWVSLVVQKTRLQ" "ISQDPSL" "NYEYLPTMGLKSFIQASLALLFG"
#[4] "KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK" "DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP"
#[6] "KKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"
base R
选项将是 substr
unname(mapply(substr, vec, start = c(1, split.points+1), stop = c(split.points, nchar(vec))))
#[1] "LAYRVCMTNEGHPWVSLVVQKTRLQ" "ISQDPSL" "NYEYLPTMGLKSFIQASLALLFG"
#[4] "KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK" "DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP" "KKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"
我们可以试试substring
:
substring(
vec,
c(1, split.points + 1),
c(split.points, nchar(vec))
)
# [1] "LAYRVCMTNEGHPWVSLVVQKTRLQ" "ISQDPSL"
# [3] "NYEYLPTMGLKSFIQASLALLFG" "KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK"
# [5] "DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP" "KKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"
另一种选择是使用 read.fwf
:
unlist(read.fwf(textConnection(vec),
widths = c(25, diff(split.points)),
as.is = TRUE),
use.names = FALSE)
给出:
[1] "LAYRVCMTNEGHPWVSLVVQKTRLQ" [2] "ISQDPSL" [3] "NYEYLPTMGLKSFIQASLALLFG" [4] "KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK" [5] "DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP"
当您的字符向量源自数据文件时,我不会感到惊讶。在那种情况下 read.fwf
会特别有用。一个例子:
vec2 <- "LAYRVCMTNEGHPWVSLVVQKTRLQISQDPSLNYEYLPTMGLKSFIQASLALLFGKHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHKDARIVYIISSQKELHGLVFQDMGFTVYEYSVWDPKKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM
LAYRVCMTNEGHPWVSLVVQKTRLQISQDPSLNYEYLPTMGLKSFIQASLALLFGKHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHKDARIVYIISSQKELHGLVFQDMGFTVYEYSVWDPKKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"
read.fwf(textConnection(vec2),
widths = c(25, diff(split.points)),
as.is=TRUE)
这将给出:
V1 V2 V3 V4 V5
1 LAYRVCMTNEGHPWVSLVVQKTRLQ ISQDPSL NYEYLPTMGLKSFIQASLALLFG KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP
2 LAYRVCMTNEGHPWVSLVVQKTRLQ ISQDPSL NYEYLPTMGLKSFIQASLALLFG KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP