r 根据方括号拆分数据框中的列
r split a column in a data frame based on square brackets
我有一个数据框:
x <- data.frame(a = letters[1:7], b = letters[2:8],
c = c("bla bla [ text1 ]", "bla bla [text2]", "how how [text3 ]",
"wow wow [ text4a ] [ text4b ]", "ba ba [ text5a ][ text5b]",
"my text A", "my text B"), stringsAsFactors = FALSE)
x
我想根据其中两个方括号 [...]
之间的内容拆分 c 列。如果 c 列仅包含一组方括号,我希望字符串转到下一列。如果 c 列包含两组由 [
和 ]
包围的字符串,我只想将最后一个 [
]
之间的字符串转到新列。
这是我的做法。看起来很复杂,我正在使用循环。有没有可能以更简约的方式做到这一点?
library(stringr)
# Counting number of square brackets "[" in column c:
sqrbrack_count <- str_count(x$c, pattern = '\[')
# Creating a new column:
x$newcolumn <- NA
for(i in 1:nrow(x)){ # looping through rows of x
if(sqrbrack_count[i] == 0) next # do nothing of 0 square brackets
minilist <- str_split_fixed(x[i, "c"], pattern = '\[', n = Inf) # split string
if(sqrbrack_count[i] == 1) { # if there is only one square bracket "["
x[i, "c"] <- minilist[1]
x[i, "newcolumn"] <- minilist[2]
} else { # if there are >1 square bracket "["
x[i, "c"] <- paste(minilist[1:2], collapse = "+")
x[i, "newcolumn"] <- minilist[3]
}
}
# Replacing renmaning square brackets we don't need anymore:
x$c <- str_replace(x$c, pattern = " \]", replacement = "")
x$c <- str_replace(x$c, pattern = "\]", replacement = "")
x$newcolumn <- str_replace(x$newcolumn, pattern = " \]", replacement = "")
x$newcolumn <- str_replace(x$newcolumn, pattern = "\]", replacement = "")
x
下面的代码更短一些,可能更容易理解,因为大部分复杂的逻辑都发生在两行中。我已经在这两行上面添加了注释,我认为其余部分是不言自明的。
library(plyr)
# find all strings between characters '[' and ']'
strmatches = lapply(1:nrow(x), function(y) {regmatches(x$c[y], gregexpr("(?<=\[).*?(?=\])", x$c[y], perl=T))[[1]]})
# parse these to a dataframe called 'new_cols'
new_cols = rbind.fill(lapply(strmatches, function(x) {as.data.frame(t(x),stringsAsFactors = F)}))
df = cbind(x,new_cols)
df$c = gsub("\[.*$", "", x$c) # only keep everything before '['
df$c[!is.na(df$V2)] = paste0(df$c[!is.na(df$V2)], '+',df$V1[!is.na(df$V2)])
df$V1[!is.na(df$V2)] = df$V2[!is.na(df$V2)]
df$V2=NULL
colnames(df)[colnames(df)=="V1"]="newcolumn"
输出:
a b c V1
1 a b bla bla text1
2 b c bla bla text2
3 c d how how text3
4 d e wow wow + text4a text4b
5 e f ba ba + text5a text5b
6 f g my text A <NA>
7 g h my text B <NA>
希望对您有所帮助!
PS:这符合您的预期输出,但您可能想在其中添加一些 str_trim。
我有一个数据框:
x <- data.frame(a = letters[1:7], b = letters[2:8],
c = c("bla bla [ text1 ]", "bla bla [text2]", "how how [text3 ]",
"wow wow [ text4a ] [ text4b ]", "ba ba [ text5a ][ text5b]",
"my text A", "my text B"), stringsAsFactors = FALSE)
x
我想根据其中两个方括号 [...]
之间的内容拆分 c 列。如果 c 列仅包含一组方括号,我希望字符串转到下一列。如果 c 列包含两组由 [
和 ]
包围的字符串,我只想将最后一个 [
]
之间的字符串转到新列。
这是我的做法。看起来很复杂,我正在使用循环。有没有可能以更简约的方式做到这一点?
library(stringr)
# Counting number of square brackets "[" in column c:
sqrbrack_count <- str_count(x$c, pattern = '\[')
# Creating a new column:
x$newcolumn <- NA
for(i in 1:nrow(x)){ # looping through rows of x
if(sqrbrack_count[i] == 0) next # do nothing of 0 square brackets
minilist <- str_split_fixed(x[i, "c"], pattern = '\[', n = Inf) # split string
if(sqrbrack_count[i] == 1) { # if there is only one square bracket "["
x[i, "c"] <- minilist[1]
x[i, "newcolumn"] <- minilist[2]
} else { # if there are >1 square bracket "["
x[i, "c"] <- paste(minilist[1:2], collapse = "+")
x[i, "newcolumn"] <- minilist[3]
}
}
# Replacing renmaning square brackets we don't need anymore:
x$c <- str_replace(x$c, pattern = " \]", replacement = "")
x$c <- str_replace(x$c, pattern = "\]", replacement = "")
x$newcolumn <- str_replace(x$newcolumn, pattern = " \]", replacement = "")
x$newcolumn <- str_replace(x$newcolumn, pattern = "\]", replacement = "")
x
下面的代码更短一些,可能更容易理解,因为大部分复杂的逻辑都发生在两行中。我已经在这两行上面添加了注释,我认为其余部分是不言自明的。
library(plyr)
# find all strings between characters '[' and ']'
strmatches = lapply(1:nrow(x), function(y) {regmatches(x$c[y], gregexpr("(?<=\[).*?(?=\])", x$c[y], perl=T))[[1]]})
# parse these to a dataframe called 'new_cols'
new_cols = rbind.fill(lapply(strmatches, function(x) {as.data.frame(t(x),stringsAsFactors = F)}))
df = cbind(x,new_cols)
df$c = gsub("\[.*$", "", x$c) # only keep everything before '['
df$c[!is.na(df$V2)] = paste0(df$c[!is.na(df$V2)], '+',df$V1[!is.na(df$V2)])
df$V1[!is.na(df$V2)] = df$V2[!is.na(df$V2)]
df$V2=NULL
colnames(df)[colnames(df)=="V1"]="newcolumn"
输出:
a b c V1
1 a b bla bla text1
2 b c bla bla text2
3 c d how how text3
4 d e wow wow + text4a text4b
5 e f ba ba + text5a text5b
6 f g my text A <NA>
7 g h my text B <NA>
希望对您有所帮助!
PS:这符合您的预期输出,但您可能想在其中添加一些 str_trim。