在 R 中组合 nest() 和 aggregate()?

Combine nest() and aggregate() in R?


我用 rtweet 包收集了推文。这让我得到了一个数据框,行中包含观察结果(即推文),列中包含变量。变量既在推文级别(例如文本、喜欢、主题标签等)又在帐户级别(关注者数量、简历等)。我 运行 对推文进行情绪分析,将推文级别的情绪分数变量添加到数据框。

模拟我的数据现在的样子(实际上我有 100,000 多个观测值和 115 个变量):

df <- data.frame(users = c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1'),
           text = c('this is u1 first tweet', 
                    'this is another tweet', 
                    'hello hello', 
                    'hashtag tweettext',
                    'tweet text',
                    'this is u1 second tweet',
                    'this is u6 first tzeet',
                   'this is u6 second tweet',
                    'this is u6 third tweet',
                   'this is u1 third tweet'),
           likes= sample(1:10, 10),
           sentiment= rnorm(10, mean=0, sd=1),
           followers = c(111, 200, 300, 400, 500, 111, 666, 666, 666, 111),
           bio = paste0(rep('lorem ipsum', 10), " ", c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1')))
   users                    text likes   sentiment followers            bio
1     u1  this is u1 first tweet     1  0.96445407       111 lorem ipsum u1
2     u2   this is another tweet    10  1.03840459       200 lorem ipsum u2
3     u3             hello hello     7  1.76887362       300 lorem ipsum u3
4     u4       hashtag tweettext     5 -0.57165015       400 lorem ipsum u4
5     u5              tweet text     4 -1.47028289       500 lorem ipsum u5
6     u1 this is u1 second tweet     2 -1.11036644       111 lorem ipsum u1
7     u6  this is u6 first tzeet     3  0.25440339       666 lorem ipsum u6
8     u6 this is u6 second tweet     8  0.02334468       666 lorem ipsum u6
9     u6  this is u6 third tweet     9 -2.71592529       666 lorem ipsum u6
10    u1  this is u1 third tweet     6  1.18528925       111 lorem ipsum u1



  summarise(meanlikes = mean(likes),
            meansentiment = mean(sentiment))


data %>%
  select(-likes, -sentiment) %>%
  nest(-users, -followers, -bio)


d1<- df %>%
  select(-likes, -sentiment) %>%
  nest(-users, -followers, -bio)

d2 <- df %>%
  summarise(meanlikes = mean(likes),
            meansentiment = mean(sentiment))

d1 <- d1 %>%



  users                                                                    text followers
1    u1 this is u1 first tweet, this is u1 second tweet, this is u1 third tweet       111
2    u2                                                   this is another tweet       200
3    u3                                                             hello hello       300
4    u4                                                       hashtag tweettext       400
5    u5                                                              tweet text       500
6    u6 this is u6 first tzeet, this is u6 second tweet, this is u6 third tweet       666
             bio meanlikes meansentiment
1 lorem ipsum u1  4.333333    -0.2846824
2 lorem ipsum u2  6.000000    -0.5443194
3 lorem ipsum u3  2.000000     1.8001123
4 lorem ipsum u4  4.000000     1.0114402
5 lorem ipsum u5  9.000000    -0.5637166
6 lorem ipsum u6  7.000000     1.2346833



# set seed to make df reproducible

df <- data.frame(users = c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1'),
                 text = c('this is u1 first tweet', 
                          'this is another tweet', 
                          'hello hello', 
                          'hashtag tweettext',
                          'tweet text',
                          'this is u1 second tweet',
                          'this is u6 first tzeet',
                          'this is u6 second tweet',
                          'this is u6 third tweet',
                          'this is u1 third tweet'),
                 likes= sample(1:10, 10),
                 sentiment= rnorm(10, mean=0, sd=1),
                 followers = c(111, 200, 300, 400, 500, 111, 666, 666, 666, 111),
                 bio = paste0(rep('lorem ipsum', 10), " ", c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1')))

df %>% group_by(users)%>%
  mutate(tweets = str_c(text, collapse = ""),
         meanlikes = mean(likes),
         meansentiment = mean(sentiment)) %>%
  select(-text, -likes, -sentiment) %>%

您可以 group_by users,保留 biofollowersfirst 值,因为它们都是一样的。取 likessentimentmean 并使用 toString.

text 折叠成一个逗号分隔的字符串

df %>%
  group_by(users) %>%
  summarise(across(c(bio, followers), first),
            across(c(likes, sentiment), mean), 
            text = toString(text))

#  users bio      followers likes sentiment text             
#  <chr> <chr>        <dbl> <dbl>     <dbl> <chr>            
#1 u1    lorem i…       111  6.67    0.0870 this is u1 first…
#2 u2    lorem i…       200  8      -0.945  this is another …
#3 u3    lorem i…       300  6       0.225  hello hello      
#4 u4    lorem i…       400  3       0.359  hashtag tweettext
#5 u5    lorem i…       500  5      -0.664  tweet text       
#6 u6    lorem i…       666  4.33    0.206  this is u6 first…