在 R 中对列值进行唯一分组

Group the column values uniquely in R

我有两列 titletext。我想根据收到的文本数量对标题进行分组。另外,我想对具有相同名称的标题进行唯一分组。

例如

我有

title | text 
-------------
A     | I like...
B     | I wish...
C     | review1
C     | review2
C     | review3
D     | Detecting...
C     | review1
C     | review2
E     | New...

我想要的是:

title | text 
-------------
A     | I like...
B     | I wish...
C     | review1 review2 review3
D     | Detecting...
C     | review1 review2
E     | New...

我试过的是:

df %>%
    filter(title %in% sample(unique(title))) %>%
    group_by(title) %>%
    select(title, text)

但还是没有达到我的预期。

我没有使用 dplyr,但 base R 可以处理它:

do.line = function(a.title){
  return(c(a.title, paste(text[df$title == a.title], collapse = ' ' )))
}
t(sapply(unique(df$title), do.line))

另一种基础 R 方式

tmp=rle(df$title)
df$grp=rep(1:length(tmp$lengths),tmp$lengths)
aggregate(text~title+grp,data=df,FUN=paste0,collapse=" ")

  title grp                    text
1     A   1               I like...
2     B   2               I wish...
3     C   3 review1 review2 review3
4     D   4            Detecting...
5     C   5         review1 review2
6     E   6                  New...

您可以编写一个函数来连接向量的唯一值,并在 group_by:

之后使用它
library(dplyr)

df <- data.frame(title = c('A','B','C','C','C','D','C','C','E'),
                 text = c('I like...', 'I wish...', 'review1','review2','review3',
                          'Detecting...','review1','review2', 'New...'))

unique_paste <- function(text_vec) {
  paste(unique(text_vec), collapse = " ")
}

df2 <- df %>% 
  mutate(id = cumsum(title != lag(title, default = 'A'))) %>% 
  group_by(id, title) %>% 
  do(text = unique_paste(.$text)) %>% 
  ungroup()

这是一个 dplyr 方法。关键是要正确设置 group_by,以便它根据行位置 title 列中的 值定义组。

library(dplyr)

df %>% 
  group_by(gp = c(0, na.omit(cumsum(lead(title) != title)))) %>% 
  summarize(title = unique(title), text = paste0(text, collapse = " ")) %>% 
  select(-gp)

# A tibble: 6 × 2
  title text                   
  <chr> <chr>                  
1 A     I like...              
2 B     I wish...              
3 C     review1 review2 review3
4 D     Detecting...           
5 C     review1 review2        
6 E     New...   

我想您可以使用 aggregate

尝试以下基本 R 选项
aggregate(text ~. unique(df), toString)

想到旋转:

library(tidyverse)

# Build test data
df <- data.frame(title=c("A","B","C","C","C","D","C","C","E"),
                 text=c("I like...","I wish...","review1","review2","review3","Detecting","review1","review2","New..."))

# Combine all values in a list by pivoting
new_df <- df %>% pivot_wider(names_from=title, values_from=text, values_fn=list)

# Bring to desired format by pivoting back
new_df <- new_df %>% pivot_longer(cols=c(names(new_df)), names_to="title", values_to="text")

# Inspecting result
new_df

str(new_df)

# Example query
new_df %>%  filter(title=="C") %>% unlist()