如何从R中的字符串中提取第n个单词之后的所有单词?
How to extract all words after the nth word from string in R?
我的 data.frame 中的第一列由字符串组成,第二列是唯一键。
我想从每个字符串中提取第 n 个单词之后的所有单词,如果该字符串有 <= n 个单词,则提取整个字符串。
我的 data.frame 中有超过 10k 行并且想知道除了使用 for 循环之外是否有快速的方法来做到这一点?
谢谢。
下面的怎么样:
# Generate some sample data
library(tidyverse)
df <- data.frame(
one = c("Entries from row one", "Entries from row two", "Entries from row three"),
two = runif(3))
# Define function to extract all words after the n=1 word
# (or return the full string if n > # of words in string)
crop_string <- function(ss, n) {
lapply(strsplit(as.character(ss), "\s"), function(v)
if (length(v) > n) paste(v[(n + 1):length(v)], collapse = " ")
else paste(v, collapse = " "))
}
# Let's crop strings from column one by removing the first 3 words (n = 3)
n <- 3;
df %>%
mutate(words_after_n = crop_string(one, n))
# one two words_after_n
#1 Entries from row one 0.5120053 one
#2 Entries from row two 0.1873522 two
#3 Entries from row three 0.0725107 three
# If n > # of words, return the full string
n <- 10;
df %>%
mutate(words_after_n = crop_string(one, n))
# one two words_after_n
#1 Entries from row one 0.9363278 Entries from row one
#2 Entries from row two 0.3024628 Entries from row two
#3 Entries from row three 0.6666226 Entries from row three
这里我使用的是nchar(),所以让你的数据已经转化为字符。
as.character(YOUR_DATA)
as.character(sapply(YOUR_DATA,function(x,y){
if(nchar(x)>=y){
substr(x,y,nchar(x))
}
else{x}
},y= nth_data_you_want))
假设数据如下:
"gene@seq"
"Cblb@TAGTCCCGAAGGCATCCCGA"
"Fbxo27@CCCACGTGTTCTCCGGCATC"
"Fbxo11@GGAATATACGTCCACGAGAA"
"Pwp1@GCCCGACCCAGGCACCGCCT"
我用10作为第n个数据,结果是:
"gene@seq"
"CCCGAAGGCATCCCGA"
"CACGTGTTCTCCGGCATC"
"AATATACGTCCACGAGAA"
"GACCCAGGCACCGCCT"
我的 data.frame 中的第一列由字符串组成,第二列是唯一键。
我想从每个字符串中提取第 n 个单词之后的所有单词,如果该字符串有 <= n 个单词,则提取整个字符串。
我的 data.frame 中有超过 10k 行并且想知道除了使用 for 循环之外是否有快速的方法来做到这一点?
谢谢。
下面的怎么样:
# Generate some sample data
library(tidyverse)
df <- data.frame(
one = c("Entries from row one", "Entries from row two", "Entries from row three"),
two = runif(3))
# Define function to extract all words after the n=1 word
# (or return the full string if n > # of words in string)
crop_string <- function(ss, n) {
lapply(strsplit(as.character(ss), "\s"), function(v)
if (length(v) > n) paste(v[(n + 1):length(v)], collapse = " ")
else paste(v, collapse = " "))
}
# Let's crop strings from column one by removing the first 3 words (n = 3)
n <- 3;
df %>%
mutate(words_after_n = crop_string(one, n))
# one two words_after_n
#1 Entries from row one 0.5120053 one
#2 Entries from row two 0.1873522 two
#3 Entries from row three 0.0725107 three
# If n > # of words, return the full string
n <- 10;
df %>%
mutate(words_after_n = crop_string(one, n))
# one two words_after_n
#1 Entries from row one 0.9363278 Entries from row one
#2 Entries from row two 0.3024628 Entries from row two
#3 Entries from row three 0.6666226 Entries from row three
这里我使用的是nchar(),所以让你的数据已经转化为字符。
as.character(YOUR_DATA)
as.character(sapply(YOUR_DATA,function(x,y){
if(nchar(x)>=y){
substr(x,y,nchar(x))
}
else{x}
},y= nth_data_you_want))
假设数据如下:
"gene@seq"
"Cblb@TAGTCCCGAAGGCATCCCGA"
"Fbxo27@CCCACGTGTTCTCCGGCATC"
"Fbxo11@GGAATATACGTCCACGAGAA"
"Pwp1@GCCCGACCCAGGCACCGCCT"
我用10作为第n个数据,结果是:
"gene@seq"
"CCCGAAGGCATCCCGA"
"CACGTGTTCTCCGGCATC"
"AATATACGTCCACGAGAA"
"GACCCAGGCACCGCCT"