在 r 的列中对齐数据框的字符串
Align strings of a dataframe in columns in r
我有一个大数据框,我希望字符串根据后缀(子字符串)在列中对齐,源数据框如下所示:
notst代表其他变量前缀被忽略
# col1 col2 col3
# notst-s1 notst-s2 notst-x3
# notst-s1 notst-x3 notst-a5
# notst-s2 notst-a5
# notst-x3 notst-a5
结果应该是:
# col1 col2 col3 col4
# notst-s1 notst-s2 notst-x3
# notst-s1 notst-x3 notst-a5
# notst-s2 notst-a5
# notst-x3 notst-a5
编辑:
考虑整个后缀(“-”之后)。它没有数字。在某些情况下,整个字符串 ("xxxx-spst") 应该匹配 (*),因为字符串的 xxxx 部分有多个版本。
对于:
df <- read.table(text="
col1 col2 col3
st1-ab stb-spst sta-spst
stc-spst sta-spst st4-ab
stb-spst st7-ab
st9-ba stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
可能的结果可能是:(列名和顺序无关)
# col1 col2 col3 col4
# st1-ab stb-spst sta-spst
# st4-ab stc-spst sta-spst
# st7-ab stb-spst
# stb-spst st9-ba
(*) 请注意,在第 2 行中,col2,"stc-spst" 似乎放错了位置,但这不是问题,因为该行中不存在值 stb-spst,因此对于该特定情况,仅后缀 ("spst") 很重要。换句话说,当整个字符串(前缀-后缀)与其他字符串(在其他行中)匹配时,它们应该在同一列中,否则,当后缀与(其他行的)后缀匹配时,它们应该在同一列中柱子。生成的数据框应具有与原始数据框相同的行数和尽可能少的列数。
编辑。答案应该是通用的并且适用于:
df2 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stc-spst sta-spst st4-ab st2-ab
stb-spst st7-ab sa-ac
st9-ba stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
比如,也。
可能的结果:
# col1 col2 col3 col4 col5 col6 col7
# st1-ab stb-spst sta-spst std-spst
# st4-ab stc-spst sta-spst st2-ab
# st7-ab stb-spst sa-ac
# stb-spst st9-ba
示例 3
df3 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stb-spst sta-ab
sta-spst st7-ab sa-ac
sta-spst stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
期望的输出
col1 col2 col3 col4 col5
1 st1-ab sta-spst stb-spst std-spst
2 sta-ab stb-spst
3 sa-ac st7-ab sta-spst
4 sta-spst stb-spst
编辑示例 4。为了使任务更容易,您可以在函数中明确定义每行可能有多个可能前缀的后缀。在本例中 ("spst")。因此,任何后缀不同于 "spst" 的字符串每行应该只有一个可能的前缀,并且可以而且必须折叠到结果 df 中的一列中,作为所需输出中的 col2。这不是我最初想要的,因为我会得到比预期更多的专栏。理想情况下,包含 spst 和不同前缀的字符串应出现在尽可能少的列中。看上面)。
df4 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stb-spst st1-ab
sta-spst st7-ab sa-ac
sta-spst stb-spst st7-ab",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
期望的输出
row_id col1 col2 col3 col4 col5
1 st1-ab sta-spst stb-spst std-spst
2 st1-ab stb-spst
3 sa-ac st7-ab sta-spst
4 st7-ab sta-spst stb-spst
我们可以通过首先 melt
ing 数据集,从元素中提取数字索引,基于此创建一个 row/column 索引并将元素分配给 matrix
根据索引的最大值创建。
library(reshape2)
d1 <- na.omit(transform(melt(as.matrix(df1)), v1 = as.numeric(sub("\D+", "", value))))
m1 <- matrix("", nrow = max(d1$Var1), ncol = max(d1$v1))
m1[as.matrix(d1[c("Var1", "v1")])] <- as.character(d1$value)
d2 <- as.data.frame(m1[,!!colSums(m1!="")])
colnames(d2) <- paste0("col", seq_along(d2))
d2
# col1 col2 col3 col4
#1 notst-s1 notst-s2 notst-x3
#2 notst-s1 notst-x3 notst-a5
#3 notst-s2 notst-a5
#4 notst-x3 notst-a5
矩阵索引可能使这成为可能:
sel <- dat!=""
unq <- unique(dat[sel])
mat <- matrix(NA, nrow=nrow(dat), ncol=length(unq))
mat[cbind(row(dat)[sel], match(dat[sel], unq) )] <- dat[sel]
# [,1] [,2] [,3] [,4]
#[1,] "notst-s1" "notst-s2" "notst-x3" NA
#[2,] "notst-s1" NA "notst-x3" "notst-a5"
#[3,] NA "notst-s2" NA "notst-a5"
#[4,] NA NA "notst-x3" "notst-a5"
其中 dat
被导入为:
dat <- read.table(text="
col1 col2 col3
notst-s1 notst-s2 notst-x3
notst-s1 notst-x3 notst-a5
notst-s2 notst-a5
notst-x3 notst-a5",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
测试了四个示例,但此版本没有考虑您在示例 4 中作为解决方法添加的信息。
主要添加的是随机播放逻辑(这可能很慢)以从右到左压缩生成的数据帧。 assigned_by_suffix
和 assigned_by_single_suffix
可能不再需要,但我还没有验证。
输出在代码的末尾
# examples
df1 <- read.table(text="
col1 col2 col3
st1-ab stb-spst sta-spst
stc-spst sta-spst st4-ab
stb-spst st7-ab
st9-ba stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
df2 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stc-spst sta-spst st4-ab st2-ab
stb-spst st7-ab sa-ac
st9-ba stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
df3 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stb-spst sta-ab
sta-spst st7-ab sa-ac
sta-spst stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
df4 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stb-spst st1-ab
sta-spst st7-ab sa-ac
sta-spst stb-spst st7-ab",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
library(reshape2)
library(tidyr)
library(dplyr)
library(stringr)
library(assertthat)
suffix <- function(s) {str_extract(s, "[^\-]+$")}
# make a tall dataframe with melt, and get the suffix
dfm <- df4 %>%
mutate(row_id = seq_along(col1)) %>%
melt(id.vars="row_id") %>%
select(-2) %>%
filter(value != "") %>%
mutate(suffix = suffix(value)) %>%
arrange(value)
assert_that(!any(duplicated(dfm[c("row_id", "value")])))
# initialize
combined <- data.frame()
remaining <- dfm
# get the groups with more than 1 value
matched_values <- dfm %>%
group_by(value, suffix) %>%
summarize(n=n()) %>%
filter(n>1) %>%
rename(group_id = value) %>%
ungroup()
# .. and assign the group ids that match
assigned_by_value <- remaining %>%
inner_join(matched_values %>% select(group_id), by = c("value" = "group_id")) %>%
mutate(group_id = value) %>%
select(row_id, value, suffix, group_id)
combined <- combined %>% bind_rows(assigned_by_value)
remaining <- dfm %>% anti_join(combined, by=c("row_id", "value"))
# find the remaining suffixes
matched_suffixes <- remaining %>%
group_by(suffix) %>%
summarize(n=n()) %>%
filter(n>1) %>%
select(-n) %>%
ungroup()
# ... and assign those that match
assigned_by_suffix <- remaining %>%
inner_join(matched_suffixes, by="suffix") %>%
mutate(group_id = suffix)
combined <- bind_rows(combined, assigned_by_suffix)
remaining <- remaining %>% anti_join(combined, by=c("row_id", "value"))
# All that remain are singles assign matches by suffix, choosing the match with fewest
assigned_by_single_suffix <- remaining %>%
inner_join(matched_values, by = "suffix") %>%
top_n(1, n) %>%
head(1) %>%
select(-n)
combined <- bind_rows(combined, assigned_by_single_suffix)
remaining <- remaining %>% anti_join(combined, by=c("row_id", "value"))
# get the remaining unmatched
unmatched <- remaining%>%
mutate(group_id = value)
combined <- bind_rows(combined, unmatched)
remaining <- remaining %>% anti_join(combined, by=c("row_id", "value"))
assert_that(nrow(remaining) == 0)
# any overloads (duplicates) need to bump to their own column
dups <- duplicated(combined[,c("row_id", "group_id")])
combined$group_id[dups] <- combined$value[dups]
assert_that(nrow(combined) == nrow(dfm))
# spread the result
result <- spread(combined %>% select(-suffix), group_id, value, fill ="")
# Shuffle any matching suffix from right to left, so l long as there
# is corresponding space an that the whole column can move
# i is source (startign from right) - j is target (starting from right)
#
drop_cols = c()
suffixes <- suffix(names(result))
for (i in (ncol(result)):3) {
for(j in (i-1):2) {
if (suffixes[i] == suffixes[j]) {
non_empty <- which(result[,i] != "") # list of source to move
can_fill <- which(result[,j] == "") # list of targets can be filled
can_move <- all(non_empty %in% can_fill) # is to move a subset of can_fill?
# if there's space, shuffle the column down
if (can_move ) {
# shuffle down
result[,j] <- if_else(result[,j] != "", result[,j], result[,i])
drop_cols <- c(drop_cols, i)
result[,i] <- NA
break
}
}
}
}
if (!is.null(drop_cols)) {
result <- result[,-drop_cols]
}
result
# Example 1
# row_id ab st9-ba sta-spst stb-spst
# 1 1 st1-ab sta-spst stb-spst
# 2 2 st4-ab sta-spst stc-spst
# 3 3 st7-ab stb-spst
# 4 4 st9-ba stb-spst
# Example 2
# row_id ab sa-ac spst st2-ab st9-ba sta-spst stb-spst
# 1 1 st1-ab std-spst sta-spst stb-spst
# 2 2 st4-ab stc-spst st2-ab sta-spst
# 3 3 st7-ab sa-ac stb-spst
# 4 4 st9-ba stb-spst
# Example 3
# row_id ab sa-ac sta-spst stb-spst std-spst
# 1 1 st1-ab sta-spst stb-spst std-spst
# 2 2 sta-ab stb-spst
# 3 3 st7-ab sa-ac sta-spst
# 4 4 sta-spst stb-spst
# Example 4
# row_id sa-ac st1-ab sta-spst stb-spst std-spst
# 1 1 st1-ab sta-spst stb-spst std-spst
# 2 2 st1-ab stb-spst
# 3 3 sa-ac st7-ab sta-spst
# 4 4 st7-ab sta-spst stb-spst
>
我有一个大数据框,我希望字符串根据后缀(子字符串)在列中对齐,源数据框如下所示:
notst代表其他变量前缀被忽略
# col1 col2 col3
# notst-s1 notst-s2 notst-x3
# notst-s1 notst-x3 notst-a5
# notst-s2 notst-a5
# notst-x3 notst-a5
结果应该是:
# col1 col2 col3 col4
# notst-s1 notst-s2 notst-x3
# notst-s1 notst-x3 notst-a5
# notst-s2 notst-a5
# notst-x3 notst-a5
编辑:
考虑整个后缀(“-”之后)。它没有数字。在某些情况下,整个字符串 ("xxxx-spst") 应该匹配 (*),因为字符串的 xxxx 部分有多个版本。
对于:
df <- read.table(text="
col1 col2 col3
st1-ab stb-spst sta-spst
stc-spst sta-spst st4-ab
stb-spst st7-ab
st9-ba stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
可能的结果可能是:(列名和顺序无关)
# col1 col2 col3 col4
# st1-ab stb-spst sta-spst
# st4-ab stc-spst sta-spst
# st7-ab stb-spst
# stb-spst st9-ba
(*) 请注意,在第 2 行中,col2,"stc-spst" 似乎放错了位置,但这不是问题,因为该行中不存在值 stb-spst,因此对于该特定情况,仅后缀 ("spst") 很重要。换句话说,当整个字符串(前缀-后缀)与其他字符串(在其他行中)匹配时,它们应该在同一列中,否则,当后缀与(其他行的)后缀匹配时,它们应该在同一列中柱子。生成的数据框应具有与原始数据框相同的行数和尽可能少的列数。
编辑。答案应该是通用的并且适用于:
df2 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stc-spst sta-spst st4-ab st2-ab
stb-spst st7-ab sa-ac
st9-ba stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
比如,也。 可能的结果:
# col1 col2 col3 col4 col5 col6 col7
# st1-ab stb-spst sta-spst std-spst
# st4-ab stc-spst sta-spst st2-ab
# st7-ab stb-spst sa-ac
# stb-spst st9-ba
示例 3
df3 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stb-spst sta-ab
sta-spst st7-ab sa-ac
sta-spst stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
期望的输出
col1 col2 col3 col4 col5
1 st1-ab sta-spst stb-spst std-spst
2 sta-ab stb-spst
3 sa-ac st7-ab sta-spst
4 sta-spst stb-spst
编辑示例 4。为了使任务更容易,您可以在函数中明确定义每行可能有多个可能前缀的后缀。在本例中 ("spst")。因此,任何后缀不同于 "spst" 的字符串每行应该只有一个可能的前缀,并且可以而且必须折叠到结果 df 中的一列中,作为所需输出中的 col2。这不是我最初想要的,因为我会得到比预期更多的专栏。理想情况下,包含 spst 和不同前缀的字符串应出现在尽可能少的列中。看上面)。
df4 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stb-spst st1-ab
sta-spst st7-ab sa-ac
sta-spst stb-spst st7-ab",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
期望的输出
row_id col1 col2 col3 col4 col5
1 st1-ab sta-spst stb-spst std-spst
2 st1-ab stb-spst
3 sa-ac st7-ab sta-spst
4 st7-ab sta-spst stb-spst
我们可以通过首先 melt
ing 数据集,从元素中提取数字索引,基于此创建一个 row/column 索引并将元素分配给 matrix
根据索引的最大值创建。
library(reshape2)
d1 <- na.omit(transform(melt(as.matrix(df1)), v1 = as.numeric(sub("\D+", "", value))))
m1 <- matrix("", nrow = max(d1$Var1), ncol = max(d1$v1))
m1[as.matrix(d1[c("Var1", "v1")])] <- as.character(d1$value)
d2 <- as.data.frame(m1[,!!colSums(m1!="")])
colnames(d2) <- paste0("col", seq_along(d2))
d2
# col1 col2 col3 col4
#1 notst-s1 notst-s2 notst-x3
#2 notst-s1 notst-x3 notst-a5
#3 notst-s2 notst-a5
#4 notst-x3 notst-a5
矩阵索引可能使这成为可能:
sel <- dat!=""
unq <- unique(dat[sel])
mat <- matrix(NA, nrow=nrow(dat), ncol=length(unq))
mat[cbind(row(dat)[sel], match(dat[sel], unq) )] <- dat[sel]
# [,1] [,2] [,3] [,4]
#[1,] "notst-s1" "notst-s2" "notst-x3" NA
#[2,] "notst-s1" NA "notst-x3" "notst-a5"
#[3,] NA "notst-s2" NA "notst-a5"
#[4,] NA NA "notst-x3" "notst-a5"
其中 dat
被导入为:
dat <- read.table(text="
col1 col2 col3
notst-s1 notst-s2 notst-x3
notst-s1 notst-x3 notst-a5
notst-s2 notst-a5
notst-x3 notst-a5",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
测试了四个示例,但此版本没有考虑您在示例 4 中作为解决方法添加的信息。
主要添加的是随机播放逻辑(这可能很慢)以从右到左压缩生成的数据帧。 assigned_by_suffix
和 assigned_by_single_suffix
可能不再需要,但我还没有验证。
输出在代码的末尾
# examples
df1 <- read.table(text="
col1 col2 col3
st1-ab stb-spst sta-spst
stc-spst sta-spst st4-ab
stb-spst st7-ab
st9-ba stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
df2 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stc-spst sta-spst st4-ab st2-ab
stb-spst st7-ab sa-ac
st9-ba stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
df3 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stb-spst sta-ab
sta-spst st7-ab sa-ac
sta-spst stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
df4 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stb-spst st1-ab
sta-spst st7-ab sa-ac
sta-spst stb-spst st7-ab",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
library(reshape2)
library(tidyr)
library(dplyr)
library(stringr)
library(assertthat)
suffix <- function(s) {str_extract(s, "[^\-]+$")}
# make a tall dataframe with melt, and get the suffix
dfm <- df4 %>%
mutate(row_id = seq_along(col1)) %>%
melt(id.vars="row_id") %>%
select(-2) %>%
filter(value != "") %>%
mutate(suffix = suffix(value)) %>%
arrange(value)
assert_that(!any(duplicated(dfm[c("row_id", "value")])))
# initialize
combined <- data.frame()
remaining <- dfm
# get the groups with more than 1 value
matched_values <- dfm %>%
group_by(value, suffix) %>%
summarize(n=n()) %>%
filter(n>1) %>%
rename(group_id = value) %>%
ungroup()
# .. and assign the group ids that match
assigned_by_value <- remaining %>%
inner_join(matched_values %>% select(group_id), by = c("value" = "group_id")) %>%
mutate(group_id = value) %>%
select(row_id, value, suffix, group_id)
combined <- combined %>% bind_rows(assigned_by_value)
remaining <- dfm %>% anti_join(combined, by=c("row_id", "value"))
# find the remaining suffixes
matched_suffixes <- remaining %>%
group_by(suffix) %>%
summarize(n=n()) %>%
filter(n>1) %>%
select(-n) %>%
ungroup()
# ... and assign those that match
assigned_by_suffix <- remaining %>%
inner_join(matched_suffixes, by="suffix") %>%
mutate(group_id = suffix)
combined <- bind_rows(combined, assigned_by_suffix)
remaining <- remaining %>% anti_join(combined, by=c("row_id", "value"))
# All that remain are singles assign matches by suffix, choosing the match with fewest
assigned_by_single_suffix <- remaining %>%
inner_join(matched_values, by = "suffix") %>%
top_n(1, n) %>%
head(1) %>%
select(-n)
combined <- bind_rows(combined, assigned_by_single_suffix)
remaining <- remaining %>% anti_join(combined, by=c("row_id", "value"))
# get the remaining unmatched
unmatched <- remaining%>%
mutate(group_id = value)
combined <- bind_rows(combined, unmatched)
remaining <- remaining %>% anti_join(combined, by=c("row_id", "value"))
assert_that(nrow(remaining) == 0)
# any overloads (duplicates) need to bump to their own column
dups <- duplicated(combined[,c("row_id", "group_id")])
combined$group_id[dups] <- combined$value[dups]
assert_that(nrow(combined) == nrow(dfm))
# spread the result
result <- spread(combined %>% select(-suffix), group_id, value, fill ="")
# Shuffle any matching suffix from right to left, so l long as there
# is corresponding space an that the whole column can move
# i is source (startign from right) - j is target (starting from right)
#
drop_cols = c()
suffixes <- suffix(names(result))
for (i in (ncol(result)):3) {
for(j in (i-1):2) {
if (suffixes[i] == suffixes[j]) {
non_empty <- which(result[,i] != "") # list of source to move
can_fill <- which(result[,j] == "") # list of targets can be filled
can_move <- all(non_empty %in% can_fill) # is to move a subset of can_fill?
# if there's space, shuffle the column down
if (can_move ) {
# shuffle down
result[,j] <- if_else(result[,j] != "", result[,j], result[,i])
drop_cols <- c(drop_cols, i)
result[,i] <- NA
break
}
}
}
}
if (!is.null(drop_cols)) {
result <- result[,-drop_cols]
}
result
# Example 1
# row_id ab st9-ba sta-spst stb-spst
# 1 1 st1-ab sta-spst stb-spst
# 2 2 st4-ab sta-spst stc-spst
# 3 3 st7-ab stb-spst
# 4 4 st9-ba stb-spst
# Example 2
# row_id ab sa-ac spst st2-ab st9-ba sta-spst stb-spst
# 1 1 st1-ab std-spst sta-spst stb-spst
# 2 2 st4-ab stc-spst st2-ab sta-spst
# 3 3 st7-ab sa-ac stb-spst
# 4 4 st9-ba stb-spst
# Example 3
# row_id ab sa-ac sta-spst stb-spst std-spst
# 1 1 st1-ab sta-spst stb-spst std-spst
# 2 2 sta-ab stb-spst
# 3 3 st7-ab sa-ac sta-spst
# 4 4 sta-spst stb-spst
# Example 4
# row_id sa-ac st1-ab sta-spst stb-spst std-spst
# 1 1 st1-ab sta-spst stb-spst std-spst
# 2 2 st1-ab stb-spst
# 3 3 sa-ac st7-ab sta-spst
# 4 4 st7-ab sta-spst stb-spst
>